Yoav Levine, Noam Wies, Or Sharir, Hofit Bata and Amnon Shashua

9 Dec 2020
**In a nutshell**: In our recent NeurIPS paper, we theoretically predict a width-dependant transition between depth-efficiency and depth-**in**efficiency in self-attention networks (see figure below). We conduct extensive empirical ablations that clearly reveal the theoretically predicted behaviors, and provide **explicit quantitative suggestions regarding the optimal depth-to-width allocation for a given self-attention network size** (see table).
Informed guidelines for increasing depth and width in tandem have boosted performance of convolutional networks (EfficientNet).
The race towards beyond 1-Trillion parameter language models renders such guidelines an essential ingredient in the case of self-attention.
**Our guidelines elucidate the depth-to-width tradeoff in self-attention networks of sizes up to the scale of GPT3 (which is too deep for its size), and beyond**.
We mark network width of 30K as optimal for a 1-Trillion parameter self-attention network.

The following table shows our suggested depths and widths for models of sizes used in the recent GPT3 paper:

It seems that popular self-attention architectures at all sizes trained up to GPT3's crossing of the 100B parameter threshold, could generally benefit from deepening, with the appropriate widening (indicated by our guidelines).
With that, **our results clearly indicate the importance of widening self-attention networks when aiming for the 1 Trillion parameter mark**. We project the optimal architecture at that size to have depth 92 and width 30K, wider than any self-attention network trained to date.

Our theory predicts that self-attention has a **depth-inefficiency** parameter regime: a depth \(L_{\textrm{deep}}\) self-attention network will not outperform a shallower network of the same size with depth \(L_{\textrm{shallow}}< L_{\textrm{deep}}\). This behaivior is predicted to occur until the shallower network's width crosses a threshold that grows exponentially with its depth. Beyond this threshold, self-attention is predicted to cross into a **depth-efficiency** regime: the deeper self-attention network will outperform the shallower one given the same parameter budget.

The experiments in the figures on top display this predicted behavior, and furthermore reveal that a deeper self-attention network can even underperform relatively to a shallower one, when its width is too small. As illustrated in the left figure below, **at different network sizes we find there to be an optimal depth** that achieves best performance.
Our paper details a thorough evaluation comparing performance between networks of depths 6, 12, 18, 24, 30, 36, 48 at various widths. A clear trend emerges: the network size for transition between regimes indeed depends exponentially on the depth, *i.e.*, **the optimal depth is logarithmic in network size**. This is why increasing towards 1-Trillion parameters and beyond should be done mainly by widening, as quantified more percisely below.

We fit the exponential dependence (as shown in the right figure above), and project practical guidelines for the architectural design of contemporary huge language models. As an example, the table above shows that **up to the scale of~10B, the trained GPT3 architectures were generally too shallow per their parameter count**, meaning that they are prjected to under-perform relatively to the optimal architecture at that size (similarly to the depth 6 network in the white regime of figure at the top).
Conversely, **the largest model trained to date, GPT3-175B, is too deep given its size**, and could have benefited from widening at the expense of depth (similarly to the depth 48 network in the gray regime of figure at the top).
Overall, our findings strongly indicate that **the best route towards Trillion parameter self-attention models is via significant widening**.

The golden age of deep learning has popularized the *depth-efficiency * notion: From an expressiveness standpoint, increasing a neural network's size by adding more layers (deepening) is advantageous relative to other parameter increase alternatives, such as increasing the dimension of the internal representation (widening). Beyond overwhelming empirical signals for this notion, depth-efficiency was theoretically supported from a variety of angles. Diminishing returns in the case of very deep networks were mainly attributed to optimization issues, and indeed alleviating these issues allowed network depths to mount from 10s to 100s and beyond, allowing for deep convolutional networks (ConvNets) to advance the state-of-the-art in computer vision applications.

Since the introduction of the Transformer, along with its encoder-only variant, BERT, self-attention based deep learning architectures have taken over the field of natural language processing. However, in contrast to the depth "arms race" that took place in the ConvNet case, the leading self-attention networks are not much deeper than the original depth-24 BERT-Large model. In fact, the largest self-attention model trained to date, GPT3, has increased the parameter count of BERT-Large by a factor of 500, while only increasing its depth by a factor of 4. The remaining size increase stems from an increase in layer widths, clearly countering the depth-efficiency notion.

An empirical ablation study by OpenAI provides systematic support for the above signal. The figure below, taken from this study, leads the authors to conclude that the overall (non-embedding) network size, given by \(12\cdot L\cdot d_x^2\) where \(L\) is the number of self-attention layers (network depth) and \(d_x\) is the hidden representation dimension (network width), is the main predictor of performance, regardless of the depth to width ratio. Experiments along the \(L>6\) (yellow) curve reinforce this conclusion, suggesting that depth does not play as crucial a role in self-attention networks as it does in convolutional networks.
**Our theoretical framework reveals fundamental subtleties in the above picture, predicting depth-efficiency and depth-inefficiency parameter regimes in self-attention**.

Rather than reinforcing the seemingly plausible hypothesis for the trend in the above figure, by which widening a self-attention network is as effective as deepening it, we confirm the contrary. We show that the operation of stacking self-attention layers is so effective that it quickly saturates the capacity of the network's width.

Specifically, we establish the existence of a depth threshold which depends logarithmically on the width \(d_x\), denoted \(L_{\textrm{th}}(d_x)\sim\log(d_x)\). Below the threshold, we prove that double-exponential depth-efficiency takes place in self-attention networks:

Informal theorem: A self-attention network of depth that is under \(log(d_x)\) can only be replicated by a shallower network if the latter is wider by a factor that is double-exponential in the depth ratio.

In the other regime, above the threshold, we establish a completely different behavior for the operation of stacking self-attention layers:

Informal theorem: For self-attention networks of depth that is over \(log(d_x)\) width and depth contribute similarly to network expressivity.

our bounds do not separate wider networks from deeper ones in the depth-inefficiency regime. However, our experiments clearly show a surprising phenomenon: **For small enough network sizes, deeper self-attention networks perform worse than shallow ones**. We leave a theoretical treatment of this added regime for future work.

Beyond elucidating the behavior of vanilla self-attention architectures, our work theoretically motivates architectural changes that can provide the next leap in self-attention network expressiveness. By indicating the network width as the limiting factor for depth-efficiency, our analysis encourages the development of methods for increasing network width with low expenses. For example, we point at the concept of ShuffleNet, which has proven to be efficient for convolutional networks. They suggest increasing the representation dimension while using only a fraction of it for computation in each layer. This way, the computation costs are contained, but the theoretical limitations related to width which are posed by our work are relaxed.

Generally, width increases have greater potential for speeding up network training and inference because it can be parallelized, as opposed to depth which yields a sequential computation. A theoretical indication that the contribution of depth and width is indeed on the same order, and that width constrains depth from contributing further, along with an empirical indication that the route to 1-Trillion parameter models should be via widening, motivates the development of more extensive model parallelism methods for Transformers. Indeed, we view our work as part of an effort to provide timely interpretations as feedback for the tremendous empirical pull in our field.
*Check out all of the details in The depth-to-width interplay in self-attention*