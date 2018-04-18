EPG consists of two optimization loops. In the inner loop, an agent learns, from scratch, to solve a particular task sampled from a family of tasks. The family of tasks might be “move gripper to target location [x, y]” and one particular task in this family could be “move gripper to position [50, 100]”. The inner loop uses stochastic gradient descent (SGD) to optimize the agent’s policy against a loss function proposed by the outer loop. The outer loop evaluates the returns achieved after inner-loop learning and adjusts the parameters of the loss function, using Evolution Strategies (ES), to propose a new loss that will lead to higher returns.

Having a learned loss offers several advantages compared to current RL methods: using ES to evolve the loss function allows us to optimize the true objective (final trained policy performance) rather than short-term returns, and EPG improves on standard RL algorithms by allowing the loss function to be adaptive to the environment and agent history.