MADDPG extends a reinforcement learning algorithm called DDPG, taking inspiration from actor-critic reinforcement learning techniques; other groups are exploring variations and parallel implementations of these ideas.
We treat each agent in our simulation as an “actor”, and each actor gets advice from a “critic” that helps the actor decide what actions to reinforce during training. Traditionally, the critic tries to predict the value (i.e. the reward we expect to get in the future) of an action in a particular state, which is used by the agent—the actor—to update its policy. This is more stable than directly using the reward, which can vary considerably. To make it feasible to train multiple agents that can act in a globally-coordinated way, we enhance our critics so they can access the observations and actions of all the agents, as the following diagram shows.
Our agents don’t need to access the central critic at test time; they act based on their observations in combination with their predictions of other agents behaviors’. Since a centralized critic is learned independently for each agent, our approach can also be used to model arbitrary reward structures between agents, including adversarial cases where the rewards are opposing.
We tested our approach on a variety of tasks and it performed better than DDPG on all of them. In the above animations you can see, from left to right: two AI agents trying to go to a specific location and learning to split up to hide their intended location from the opposing agent; one agent communicating the name of a landmark to another agent; and three agents coordinating to travel to landmarks without bumping into each other.
Where traditional RL struggles
Traditional decentralized RL approaches—DDPG, actor-critic learning, deep Q-learning, and so on—struggle to learn in multiagent environments, as at every time step each agent will be trying to learn to predict the actions of other agents while also taking its own actions. This is especially true in competitive situations. MADDPG employs a centralized critic to supply agents with information about their peers’ observations and potential actions, transforming an unpredictable environment into a predictable one.
Using policy gradient methods presents further challenges: because these exhibit high variance learning the right policy is difficult to do when the reward is inconsistent. We also found that adding in a critic, while improving stability, still failed to solve several of our environments such as cooperative communication. It seems that considering the actions of others during training is important for learning collaborative strategies.
Before we developed MADDPG, when using decentralized techniques, we noticed that listener agents would often learn to ignore the speaker if it sent inconsistent messages about where to go to. The agent would then set all the weights associated with the speaker’s message to 0, effectively deafening itself. Once this happens, it’s hard for training to recover, since the speaker will never know if it says the right thing due to the absence of any feedback. To fix this, we looked at a technique outlined in a recent hierarchical reinforcement project, which lets us force the listener to incorporate the utterances of the speaker in its decision-making process. This fix didn’t work, because though it forces the listener to pay attention to the speaker, it doesn’t help the speaker figure out what to say that is relevant. Our centralized critic method helps deal with these challenges, by helping the speaker to learn which utterances might be relevant to the actions of other agents. For more of our results, you can watch the following video:
Agent modeling has a rich history within artificial intelligence research and many of these scenarios have been studied before. Lots of previous research considered games with only a small number of time steps with a small state space. Deep learning lets us deal with complex visual inputs and RL gives us tools to learn behaviors over long time periods. Now that we can use these capabilities to train multiple-agents at once without them needing to know the dynamics of the environment (how the environment changes at each time-step), we can tackle a wider range of problems involving communication and language while learning from environments’ high-dimensional information. If you’re interesting in exploring different approaches to evolving agents then consider joining OpenAI.