How AI training scales

In the last few years AI researchers have had increasing success in speeding up neural network training through data-parallelism, which splits large batches of data across many machines. Researchers have successfully used batch sizes of tens of thousands for image classification and language modeling, and even millions for RL agents that play the game Dota 2. These large batches allow increasing amounts of compute to be efficiently poured into the training of a single model, and are an important enabler of the fast growth in AI training compute. However, batch sizes that are too large show rapidly diminishing algorithmic returns, and it’s not clear why these limits are larger for some tasks and smaller for others.^{[^footnote-study]}

We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients^{[^footnote-gradients]}, we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.

This type of statistic is widely used for sample size selection and has been proposed for use in deep learning, but has not been measured or applied systematically for modern training runs. We verified this prediction for a wide range of machine learning tasks shown in the figure above, including image recognition, language modeling, Atari games, and Dota. Specifically, we did training runs at a wide range of batch sizes (tuning the learning rate separately for each) for all of these tasks and compared the speedups in training to what the noise scale predicts should happen. Since large batch sizes often require careful and expensive tuning or special learning rate schedules to be effective, knowing an upper limit ahead of time provides a significant practical advantage in training new models.

We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

Basic scaling diagram with overlap between economically feasible and technically feasible

We make these curves by setting a level of performance (say a score of 1000 on the Atari game of Beam Rider) and seeing how long it takes to train to that performance at various batch sizes. The results match our model’s predictions relatively closely, across many different values of the performance target.

The tradeoff between experience and training time needed to achieve a given score is predictable.

Patterns in the gradient noise scale

We’ve observed several patterns in the gradient noise scale which offer clues as to what the future of AI training may hold.

First, in our experiments, the noise scale typically increases by an order of magnitude or more over the course of training. Intuitively, this means the network learns the more “obvious” features of the task early in training and learns more intricate features later. For example, in the case of an image classifier, the network might first learn to identify small-scale features such as edges or textures that are present in most images, while only later putting these pieces together into more general concepts such as cats and dogs. To see the full variety of edges or textures, the network only needs to see a small number of images, so the noise scale is smaller; once the network knows more about larger objects, it can process many more images at once without seeing duplicative data.

We see some preliminary indications that the same effect holds across different models on the same dataset—more powerful models have a higher gradient noise scale, but only because they achieve a lower loss. Thus, there’s some evidence that the increasing noise scale over training isn’t just an artifact of convergence, but occurs because the model gets better. If this is true, then we expect future, more powerful models to have higher noise scale and therefore be more parallelizable.

Second, tasks that are subjectively more difficult are also more amenable to parallelization. In the context of supervised learning, there is a clear progression from MNIST, to SVHN, to ImageNet. In the context of reinforcement learning, there is a clear progression from Atari Pong to Dota 1v1 to Dota 5v5, with the optimal batch sizes differing by a factor of more than 10,000. Thus, as AI advances to new and more difficult tasks, we expect models to tolerate higher batch size.

Implications

The degree of data parallelism significantly affects the speed at which AI capabilities can progress. Faster training makes more powerful models possible and accelerates research through faster iteration times.

In an earlier study, AI and Compute, we observed that the compute being used to train the largest ML models is doubling every 3.5 months, and we noted that this trend is driven by a combination of economics (willingness to spend money on compute) and the algorithmic ability to parallelize training. The latter factor (algorithmic parallelizability) is harder to predict and its limits are not well-understood, but our current results represent a step toward systematizing and quantifying it. In particular, we have evidence that more difficult tasks and more powerful models on the same task will allow for more radical data-parallelism than we have seen to date, providing a key driver for the continued fast exponential growth in training compute. (And this is without even considering recent advances in model-parallelism, which may allow for even further parallelization on top of data-parallelism).

The continued growth of training compute, and its apparently predictable algorithmic basis, further highlights the possibility of rapid increases in AI capabilities over the next few years, and emphasizes the urgency of research into making sure such systems are safe and that they are used responsibly. A central challenge of AI policy will be to work out how to use measures like this to make predictions about the characteristics of future AI systems, and use this knowledge to conceive of policies that let society maximize the upsides and minimize the downsides of these technologies.

OpenAI is committed both to continuing to conduct rigorous analyses to give us foresight on what the future of AI holds, and to acting to address the issues that these analyses raise.

If you want to study the “Science of AI” and help us make neural network training more data-driven, consider applying to work at OpenAI.

How AI training scales

Patterns in the gradient noise scale

Implications

Authors

Acknowledgments

How AI training scales

More resources

Patterns in the gradient noise scale

Implications

Authors

Acknowledgments