Is depth useful for self-attention?

A theoretical perspective

Yoav Levine, Noam Wies, Or Sharir, Hofit Bata and Amnon Shashua
7 min read

In a nutshell: In our new paper, we prove that a double-exponential depth-efficiency takes place in self-attention networks, while at the same time we pinpoint a transition, at depth \(L=log_3(d_x)\), in which the capacity of the self-attention width \(d_x\) (the representation dimension) to support this efficiency exhausts. Our predictions strongly accord with extensive empirical ablations in Kaplan et al., accounting for the different behaviors in the two depth-efficiency/inefficiency regimes. Pointing at the network’s width as a limiting factor, we predict that solutions for dramatically increasing the width (model parallelism, etc.) can facilitate the next leap in self-attention expressivity.

Background: Depth is less crucial in self-attention

The golden age of deep learning has popularized the depth-efficiency notion: From an expressiveness standpoint, increasing a neural network's size by adding more layers (deepening) is advantageous relative to other parameter increase alternatives, such as increasing the dimension of the internal representation (widening). Beyond overwhelming empirical signals for this notion, depth-efficiency was theoretically supported from a variety of angles. Diminishing returns in the case of very deep networks were mainly attributed to optimization issues, and indeed alleviating these issues allowed network depths to mount from 10s to 100s and beyond, allowing for deep convolutional networks (ConvNets) to advance the state-of-the-art in computer vision applications.

Since the introduction of the Transformer, along with its encoder-only variant, BERT, self-attention based deep learning architectures have taken over the field of natural language processing. However, in contrast to the depth "arms race" that took place in the ConvNet case, the leading self-attention networks are not much deeper than the original depth-12 BERT-base model. In fact, the strongest self-attention model trained to date, T5, has increased the parameter count of BERT-base by a factor of 100, while only increasing its depth by a factor of 4. The remaining size increase stems from an increase in layer widths, clearly countering the depth-efficiency notion.

A recent extensive empirical ablation study by Kaplan et al. provides systematic support for the above signal. Figure 1 above, taken from this study, shows that the overall (non-embedding) network size, given by \(12\cdot L\cdot d_x^2\) where \(L\) is the number of self-attention layers (network depth) and \(d_x\) is the hidden representation dimension (network width), is the main predictor of performance, regardless of the depth to width ratio. Experiments along the \(L>6\) (yellow) curve include self-attention networks of depths from \(L=12\) to \(L=200\), all approximately obeying the same improvement trend, which depends only on network size. This suggests that depth does not play as crucial a role in self-attention networks as it does in convolutional networks.

Our findings: Network width caps benefits of depth in self-attention

In our new work, we theoretically address the above question of the depth to width trade-off in self-attention networks, and reveal fundamental subtleties in the above picture. Rather than reinforcing the seemingly plausible hypothesis for the trend in the above figure, by which widening a self-attention network is as effective as deepening it, we confirm the contrary. We show that the operation of stacking self-attention layers is so effective that it quickly saturates the capacity of the network's width.

Specifically, we establish the existence of a depth threshold which depends logarithmically on the width \(d_x\), denoted \(L_{\textrm{th}}(d_x)= log_3(d_x)\). Below the threshold, we prove that double-exponential depth-efficiency takes place in self-attention networks:

Informal theorem: A self-attention network of depth that is under \(log_3(d_x)\) can only be replicated by a shallower network if the latter is wider by a factor that is double-exponential in the depth ratio.

In the other regime, above the threshold, we establish a completely different behavior for the operation of stacking self-attention layers:

Informal theorem: For self-attention networks of depth that is over \(log_3(d_x)\), width and depth contribute simliarly to network expressivity.

A closer observation of the experimental ablation in Kaplan et al., displayed in figure 2 below, reveals an agreement with our theoretical indications. The figure shows that while for \(L\leq 6\) there is an advantage for depth, for \(L>6\) it completely disappears. When assigning actual width values which range around \(d_x=1000\), our theoretical threshold for depth-efficiency agrees with empirical findings, as \(L_{\textrm{th}}(d_x)\simeq 6.3\).

Practical derivatives:

The clear boundaries drawn between the two regimes suggest always to exploit any parameter budget of \(12\cdot L\cdot d^2_x\) such that depth does not fall below the threshold of \(\log_3(d_x)\). In this case, we have shown a clear disadvantage in the expressiveness of shallower networks. The table below contains the minimal depths per parameter-budget by these considerations, which accord with empirical evidence in figure 2. Such insights may prove useful given the rapid increase in model sizes.

Moreover, the observation that width is the limiting factor for depth-efficiency promotes the development of methods for dramatically increasing it in self-attention architectures. The successful ALBERT, a BERT variant which shares parameters between self-attention layers, allows for wider models to be trained for the same budget. For a more significant increase, that addresses the question of computation efficiency, we point at the concept employed in ShuffleNet, which has proved to be very efficient for convolutional networks. They suggest increasing the representation dimension while using only a fraction of it for computation in each layer. This way, the computation costs are contained, but the theoretical limitations posed by our work are relaxed. Generally, width increases have greater potential for speeding up network inference and training because it can be parallelized, as opposed to depth which yields a sequential computation. Our theoretical indication that the contribution of depth and width is indeed on the same order, and that moreover width limits the ability to enjoy depth efficiency, may motivate development of further model parallelism methods for Transformers.

Check out all of the details in Limits to Depth Efficiencies of Self-Attention