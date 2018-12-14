We have found that by measuring the gradient noise scale, a simple statistic that quantifies the signal-to-noise ratio of the network gradients[^footnote-gradients], we can approximately predict the maximum useful batch size. Heuristically, the noise scale measures the variation in the data as seen by the model (at a given stage in training). When the noise scale is small, looking at a lot of data in parallel quickly becomes redundant, whereas when it is large, we can still learn a lot from huge batches of data.

This type of statistic is widely used for sample size selection and has been proposed for use in deep learning, but has not been measured or applied systematically for modern training runs. We verified this prediction for a wide range of machine learning tasks shown in the figure above, including image recognition, language modeling, Atari games, and Dota. Specifically, we did training runs at a wide range of batch sizes (tuning the learning rate separately for each) for all of these tasks and compared the speedups in training to what the noise scale predicts should happen. Since large batch sizes often require careful and expensive tuning or special learning rate schedules to be effective, knowing an upper limit ahead of time provides a significant practical advantage in training new models.

We’ve found it helpful to visualize the results of these experiments in terms of a tradeoff between wall time for training and total bulk compute that we use to do the training (proportional to dollar cost). At very small batch sizes, doubling the batch allows us to train in half the time without using extra compute (we run twice as many chips for half as long). At very large batch sizes, more parallelization doesn’t lead to faster training. There is a “bend” in the curve in the middle, and the gradient noise scale predicts where that bend occurs.

