Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions.
On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps.
Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence.
Our algorithm, meta-learning shared hierarchies (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions.
In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks.
We train on a distribution over tasks, sharing the sub-policies while learning a new master policy on each sampled task. By repeatedly training new master policies, this process automatically finds sub-policies that accommodate the master policy’s learning dynamics.
In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post).