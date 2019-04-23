One way to reduce this is by recomputing the attention matrix from checkpoints during backpropagation, a well-established technique in deep learning for reducing memory usage at the cost of more computation. When done for the attention matrix in Transformers, it means the largest memory cost becomes independent of the number of layers, letting us train networks with substantially greater depth than possible previously. In practice, we found that Transformers with depth up to 128 layers outperformed shallower networks on benchmark tasks like CIFAR-10.

To train these models with increased depth, we made several adjustments to the ordering of operations in the transformer and modified the initialization scheme. Full details can be seen in our paper.

