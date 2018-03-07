As an alternative to the last step, we can treat Φ − W \Phi - W Φ−W as a gradient and plug it into a more sophisticated optimizer like Adam.

It is at first surprising that this method works at all. If k = 1 k=1 k=1, this algorithm would correspond to “joint training”—performing SGD on the mixture of all tasks. While joint training can learn a useful initialization in some cases, it learns very little when zero-shot learning is not possible (e.g. when the output labels are randomly permuted). Reptile requires Error in LaTeX ' k>1 ': KaTeX parse error: Expected 'EOF', got '&' at position 3: k&̲gt;1 , where the update depends on the higher-order derivatives of the loss function; as we show in the paper, this behaves very differently from k = 1 k=1 k=1 (joint training).

To analyze why Reptile works, we approximate the update using a Taylor series. We show that the Reptile update maximizes the inner product between gradients of different minibatches from the same task, corresponding to improved generalization. This finding may have implications outside of the meta-learning setting for explaining the generalization properties of SGD. Our analysis suggests that Reptile and MAML perform a very similar update, including the same two terms with different weights.

In our experiments, we show that Reptile and MAML yield similar performance on the Omniglot and Mini-ImageNet benchmarks for few-shot classification. Reptile also converges to the solution faster, since the update has lower variance.