Skip to main content

October 31, 2018

Reinforcement learning with prediction-based rewards

We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first time exceeds average human performance on Montezuma’s Revenge(opens in a new window).

We’ve developed Random Network Distillation (RND), a prediction-based method for encouraging reinforcement learning agents to explore their environments through curiosity, which for the first timeA exceeds average human performance on Montezuma’s Revenge(opens in a new window). RND achieves state-of-the-art performance, periodically finds all 24 rooms and solves the first level without using demonstrations or having access to the underlying state of the game.

RND incentivizes visiting unfamiliar states by measuring how hard it is to predict the output of a fixed random neural network on visited states. In unfamiliar states it’s hard to guess the output, and hence the reward is high. It can be applied to any reinforcement learning algorithm, is simple to implement and efficient to scale. Below we release a reference implementation of RND that can reproduce the results from our paper.

Progress in Montezuma’s Revenge

For an agent to achieve a desired goal it must first explore what is possible in its environment and what constitutes progress towards the goal. Many games’ reward signals provide a curriculum such that even simple exploration strategies are sufficient for achieving the game’s goal. In the seminal work introducing DQN(opens in a new window), Montezuma’s Revenge was the only game where DQN got 0% of the average human score (4.7K). Simple exploration strategies are highly unlikely to gather any rewards, or see more than a few of the 24 rooms in the level. Since then advances in Montezuma’s Revenge have been seen by many as synonymous with advances in exploration.

Dot graph showing progress in Montezuma's Revenge

Significant progress was made in 2016(opens in a new window) by combining DQN with a count-based exploration bonus, resulting in an agent that explored 15 rooms, achieved a high score of 6.6K and an average reward of around 3.7K. Since then, significant(opens in a new window) improvement(opens in a new window) in the score achieved by an RL agent has come only from exploiting access to demonstrations(opens in a new window) from human experts(opens in a new window), or access to the underlying state of the emulator(opens in a new window).

We ran a large scale RND experiment with 1024 rollout workers resulting in a mean return of 10K over 9 runs and a best mean return of 14.5K. Each run discovered between 20 and 22 rooms. In addition one of our smaller scale but longer running experiments yielded one run (out of 10) that achieved a best return of 17.5K corresponding to passing the first level and finding all 24 rooms. The graph below compares these two experiments showing the mean return as a function of parameter updates.

Graph showing mean episodic return over parameter updates

The visualization below shows the progress of the smaller scale experiment in discovering the rooms. Curiosity drives the agent to discover new rooms and find ways of increasing the in-game score, and this extrinsic reward drives it to revisit those rooms later in the training.


Large-scale study of curiosity-driven learning

Prior to developing RND, we, together with collaborators from UC Berkeley, investigated learning without any environment-specific rewards. Curiosity gives us an easier way to teach agents to interact with any environment, rather than via an extensively engineered task-specific reward function that we hope corresponds to solving a task. Projects like ALE(opens in a new window)Universe(opens in a new window)Malmo(opens in a new window)Gym(opens in a new window)Gym Retro(opens in a new window)Unity(opens in a new window)DeepMind Lab(opens in a new window)CommAI(opens in a new window) make a large number of simulated environments available for an agent to interact with through a standardized interface. An agent using a generic reward function not specific to the particulars of an environment can acquire a basic level of competency in a wide range of environments, resulting in the agent’s ability to determine what useful behaviors are even in the absence of carefully engineered rewards.

In standard reinforcement learning set-ups, at every discrete time-step the agent sends an action to the environment, and the environment responds by emitting the next observation, transition reward and an indicator of episode end. In our previous paper(opens in a new window) we require the environment to output(opens in a new window) only(opens in a new window) the next observation. There, the agent learns a next-state predictor model from its experience, and uses the error of the prediction as an intrinsic reward. As a result it is attracted to the unpredictable. For example, it will find a change in a game score to be rewarding only if the score is displayed on the screen and the change is hard to predict. The agent will typically find interactions with new objects rewarding, as the outcomes of such interactions are usually harder to predict than other aspects of the environment.

Similar to prior(opens in a new window) work(opens in a new window), we tried to avoid modeling all aspects of the environment, whether they are relevant or not, by choosing to model features of the observation. Surprisingly, we found that even random features worked well.

What do curious agents do?

We tested our agent across 50+ different environments and observed a range of competence levels from seemingly random actions to deliberately interacting with the environment. To our surprise, in some environments the agent achieved the game’s objective even though the game’s objective was not communicated to it through an extrinsic reward.


Breakout The agent experiences spikes of intrinsic reward when it sees a new configuration of bricks early on in training and when it passes the level for the first time after training for several hours.


Pong We trained the agent to control both paddles at the same time and it learned to keep the ball in play resulting in long rallies. Even when trained against the in-game AI, the agent tried to prolong the game rather than win.


Bowling The agent learned to play the game better than agents trained to maximize the (clipped) extrinsic reward directly. We think this is because the agent gets attracted to the difficult-to-predict flashing of the scoreboard occurring after the strikes.


Mario The intrinsic reward is particularly well-aligned with the game’s objective of advancing through the levels. The agent is rewarded for finding new areas because the details of a newly found area are impossible to predict. As a result the agent discovers 11 levels, finds secret rooms, and even defeats bosses.

The noisy-TV Problem

Like a gambler at a slot machine attracted to chance outcomes, the agent sometimes gets trapped by its curiosity as the result of the noisy-TV problem. The agent finds a source of randomness in the environment and keeps observing it, always experiencing a high intrinsic reward for such transitions. Watching a TV playing static noise is an example of such a trap. We demonstrate this literally by placing the agent in a Unity maze environment with a TV playing random channels.


While the noisy-TV problem is a concern in theory, for largely deterministic environments like Montezuma’s Revenge, we anticipated that curiosity would drive the agent to discover rooms and interact with objects. We tried several variants of next-state prediction based curiosity combining the exploration bonus with the score from the game.


In these experiments the agent controls the environment through a noisy controller that repeats the last action instead of the current one with some probability. This setup with sticky actions was suggested(opens in a new window) as a best practice for training agents on fully deterministic games like Atari to prevent memorization. Sticky actions make the transition from room to room unpredictable.

Random Network Distillation

Since next-state prediction is inherently susceptible to the noisy-TV problem, we identified the following relevant sources of prediction errors:

  • Factor 1: Prediction error is high where the predictor fails to generalize from previously seen examples. Novel experience then corresponds to high prediction error.

  • Factor 2: Prediction error is high because the prediction target is stochastic.

  • Factor 3: Prediction error is high because information necessary for the prediction is missing, or the model class of predictors is too limited to fit the complexity of the target function.

We determined Factor 1 is a useful source of error since it quantifies the novelty of experience, whereas Factors 2 and 3 cause the noisy-TV problem. To avoid Factors 2 and 3, we developed RND, a new exploration bonus that is based on predicting the output of a fixed and randomly initialized neural network on the next state, given the next state itself.

Nextstate Vs Rnd Stacked 5

The intuition is that predictive models have low error in states similar to the ones they have been trained on. In particular the agent’s predictions of the output of a randomly initialized neural network will be less accurate in novel states than in states the agent visited frequently. The advantage of using a synthetic prediction problem is that we can have it be deterministic (bypassing Factor 2) and inside the class of functions the predictor can represent (bypassing Factor 3) by choosing the predictor to be of the same architecture as the target network. These choices make RND immune to the noisy-TV problem.

We combine the exploration bonus with the extrinsic rewards through a variant of Proximal Policy Optimization(opens in a new window) (PPO(opens in a new window)) that uses two value heads for the two reward streams. This allows us to use different discount rates for the different rewards, and combine episodic and non-episodic returns. With this additional flexibility, our best agent often finds 22 out of the 24 rooms on the first level in Montezuma’s Revenge, and occasionally passes the first level after finding the remaining two rooms. The same method gets state-of-the-art performance on Venture and Gravitar.

Six graphs comparing game score of PPO and RND

The visualization of the RND bonus below shows a graph of the intrinsic reward over the course of an episode of Montezuma’s Revenge where the agent finds the torch for the first time.


Implementation matters

Big-picture considerations like susceptibility to the noisy-TV problem are important for the choice of a good exploration algorithm. However, we found that getting seemingly-small details right in our simple algorithm made the difference between an agent that never leaves the first room and an agent that can pass the first level. To add stability to the training, we avoided saturation of the features and brought the intrinsic rewards to a predictable range. We also noticed significant improvements in performance of RND every time we discovered and fixed a bug (our favorite one involved accidentally zeroing an array which resulted in extrinsic returns being treated as non-episodic; we realized this was the case only after being puzzled by the extrinsic value function looking suspiciously periodic). Getting such details right was a significant part of achieving high performance even with algorithms conceptually similar to prior work. This is one reason to prefer simpler algorithms where possible.

Future directions

We suggest the following paths forward for future research:

  • Analyze the benefits of different exploration methods and find novel ways of combining them.

  • Train a curious agent on many different environments without reward and investigate the transfer to target environments with rewards.

  • Investigate global exploration that involves coordinated decisions over long time horizons.

If you are interested in working on overcoming these challenges then apply to work with us!


  1. A

    There is an anonymous ICLR submission(opens in a new window) concurrent with our own work which exceeds human performance, though not to the same extent.


Yura Burda, Harri Edwards

Thanks to those who contributed to these papers and this blog post:

Large-Scale Study of Curiosity-Driven Learning(opens in a new window): Yuri Burda*, Harrison Edwards*, Deepak Pathak*, Amos Storkey, Trevor Darrell, Alexei A. Efros

Exploration by Random Network Distillation(opens in a new window): Yuri Burda*, Harrison Edwards*, Amos Storkey, Oleg Klimov

Equal contributions:

Blog post: Karl Cobbe, Alex Nichol, Joshua Achiam, Phillip Isola, Alex Ray, Jonas Schneider, Jack Clark, Greg Brockman, Ilya Sutskever, Ben Barry, Amos Storkey, Alexei Efros, Deepak Pathak, Trevor Darrell, Andrew Brock, Antreas Antoniou, Stanislaw Jastrzebski, Ashley Pilipiszyn, Justin Jay Wang