Universe allows an AI agent to use a computer like a human does: by looking at screen pixels and operating a virtual keyboard and mouse. We must train AI systems on the full range of tasks we expect them to solve, and Universe lets us train a single agent on any task a human can complete with a computer.
In April, we launched Gym, a toolkit for developing and comparing reinforcement learning (RL) algorithms. With Universe, any program can be turned into a Gym environment. Universe works by automatically launching the program behind a VNC remote desktop—it doesn’t need special access to program internals, source code, or bot APIs.
Today’s release consists of a thousand environments including Flash games, browser tasks, and games like slither.io and GTA V. Hundreds of these are ready for reinforcement learning, and almost all can be freely run with the universe Python library as follows:
import gym import universe # register Universe environments into Gym env = gym.make('flashgames.DuskDrive-v0') # any Universe environment ID here observation_n = env.reset() while True: # agent which presses the Up arrow 60 times per second action_n = [[('KeyEvent', 'ArrowUp', True)] for _ in observation_n] observation_n, reward_n, done_n, info = env.step(action_n) env.render()
Our goal is to develop a single AI agent that can flexibly apply its past experience on Universe environments to quickly master unfamiliar, difficult environments, which would be a major step towards general intelligence. There are many ways to help: giving us permission on your games, training agents across Universe tasks, (soon) integrating new games, or (soon) playing the games.
With support from EA, Microsoft Studios, Valve, Wolfram, and many others, we’ve already secured permission for Universe AI agents to freely access games and applications such as Portal, Fable Anniversary, World of Goo, RimWorld, Slime Rancher, Shovel Knight, SpaceChem, Wing Commander III, Command & Conquer: Red Alert 2, Syndicate, Magic Carpet, Mirror’s Edge, Sid Meier’s Alpha Centauri, and Wolfram Mathematica. We look forward to integrating these and many more.
The area of artificial intelligence has seen rapid progress over the last few years. Computers can now see, hear, and translate languages with unprecedented accuracies. They are also learning to generate images, sound, and text. A reinforcement learning system, AlphaGo, defeated the world champion at Go. However, despite all of these advances, the systems we’re building still fall into the category of “Narrow AI”—they can achieve super-human performance in a specific domain, but lack the ability to do anything sensible outside of it. For instance, AlphaGo can easily defeat you at Go, but you can’t explain the rules of a different board game to it and expect it to play with you.
Systems with general problem solving ability—something akin to human common sense, allowing an agent to rapidly solve a new hard task—remain out of reach. One apparent challenge is that our agents don’t carry their experience along with them to new tasks. In a standard training regime, we initialize agents from scratch and let them twitch randomly through tens of millions of trials as they learn to repeat actions that happen to lead to rewarding outcomes. If we are to make progress towards generally intelligent agents, we must allow them to experience a wide repertoire of tasks so they can develop world knowledge and problem solving strategies that can be efficiently reused in a new task.
Universe exposes a wide range of environments through a common interface: the agent operates a remote desktop by observing pixels of a screen and producing keyboard and mouse commands. The environment exposes a VNC server and the universe library turns the agent into a VNC client.
Our design goal for universe was to support a single Python process driving 20 environments in parallel at 60 frames per second. Each screen buffer is 1024x768, so naively reading each frame from an external process would take 3GB/s of memory bandwidth. We wrote a batch-oriented VNC client in Go, which is loaded as a shared library in Python and incrementally updates a pair of buffers for each environment. After experimenting with many combinations of VNC servers, encodings, and undocumented protocol options, we now routinely drive dozens of environments at 60 frames per second with 100ms latency—almost all due to server-side encoding.
Here are some important properties of our current implementation:
General. An agent can use this interface (which was originally designed for humans) to interact with any existing computer program without requiring an emulator or access to the program’s internals. For instance, it can play any computer game, interact with a terminal, browse the web, design buildings in CAD software, operate a photo editing program, or edit a spreadsheet.
Familiar to humans. Since people are already well versed with the interface of pixels/keyboard/mouse, humans can easily operate any of our environments. We can use human performance as a meaningful baseline, and record human demonstrations by simply saving VNC traffic. We’ve found demonstrations to be extremely useful in initializing agents with sensible policies with behavioral cloning (i.e. use supervised learning to mimic what the human does), before switching to RL to optimize for the given reward function.
Easy to debug. We can observe our agent while it is training or being evaluated—we just attach a VNC client to the environment’s (shared) VNC desktop. We can also save the VNC traffic for future analysis.
We were all quite surprised that we could make VNC work so well. As we scale to larger games, there’s a decent chance we’ll start using additional backend technologies. But preliminary signs indicate we can push the existing implementation far: with the right settings, our client can coax GTA V to run at 20 frames per second over the public internet.
We have already integrated a large number of environments into Universe, and view these as just the start. Each environment is packaged as a Docker image and hosts two servers that communicate with the outside world: the VNC server which sends pixels and receives keyboard/mouse commands, and a WebSocket server which sends the reward signal for reinforcement learning tasks (as well as any auxiliary information such as text, or diagnostics) and accepts control messages (such as the specific environment ID to run).
Universe includes the Atari 2600 games from the Arcade Learning Environment. These environments now run asynchronously inside the quay.io/openai/universe.gym-core Docker image and allow the agent to connect over the network, which means the agent must handle lag and low frame rates. Running over a local network in the cloud, we usually see 60 frames per second, observation lags of 20ms, and action lags of 10ms; over the public internet this drops to 20 frames per second, 80ms observation lags, and 30ms action lags.
We turned to Flash games as a starting point for scaling Universe—they are pervasive on the Internet, generally feature richer graphics than Atari, but are still individually simple. We’ve sifted through over 30,000 so far, and estimate there’s an order of magnitude more.
Our initial Universe release includes 1,000 Flash games (100 with reward functions), which we distribute in the quay.io/openai/universe.flashgames Docker image with consent from the rightsholders. This image starts a TigerVNC server and boots a Python control server, which uses Selenium to open a Chrome browser to an in-container page with the desired game, and automatically clicks through any menus needed to start the game.
Extracting rewards. While environments without reward functions can be used for unsupervised learning or to generate human demonstrations, RL needs a reward function. Unlike with the Atari games, we can’t simply read out success criteria from the process memory, as there is too much variation in how each game stores this information. Fortunately, many games have an on-screen score which we can use as a reward function, as long as we can parse it. While off-the-shelf OCR such as Tesseract performs great on standard fonts with clean backgrounds, it struggles with the diverse fonts, moving backgrounds, flashy animations, or occluding objects common in many games. We developed a convolutional neural network-based OCR model that runs inside the Docker container’s Python controller, parses the score (from a screen buffer maintained via a VNC self-loop), and communicates it over the WebSocket channel to the agent.
Humanity has collectively built the Internet into an immense treasure trove of information, designed for visual consumption by humans. Universe includes browser-based environments which require AI agents to read, navigate, and use the web just like people—using pixels, keyboard, and mouse.
Today, our agents are mostly learning to interact with common user interface elements like buttons, lists and sliders, but in the future they could complete complex tasks, such as looking up things they don’t know on the internet, managing your email or calendar, completing Khan Academy lessons, or working on Amazon Mechanical Turk and CrowdFlower tasks.
Mini World of Bits. We first set out to create a new benchmark that captures the salient challenges of browser interactions in a simple setting. We call this benchmark Mini World of Bits. We think of it as an analogue to MNIST, and believe that mastering these environments provides valuable signal towards models and training techniques that will perform well on full websites and more complex tasks. Our initial Mini World of Bits benchmark consists of 80 environments that range from simple (e.g. click a specific button) to difficult (e.g. reply to a contact in a simulated email client).
Real-world browser tasks. We’ve begun work on more realistic browser tasks. The agent takes an instruction, and performs a sequence of actions on a website. One such environment hands the agent details of a desired flight booking, and then requires it to manipulate a user interface to search for the flight. (We use cached recordings of these sites to avoid spamming them, or booking lots of real flights.)
This infrastructure is general-purpose: we can integrate any game, website, or application which can run in a Docker container (most convenient) or a Windows virtual machine (less convenient). We’d like the community’s help to continue extending the breadth of Universe environments, including completing the integrations of our partners’ games, Android apps (emulators can run inside of Docker), fold.it, Unity games, HTML5 games, online educational games, and really anything else people think of.
Microsoft’s Project Malmo team will be integrating with Universe, and we look forward to supporting other AI frameworks as well.
Running an environment
$ git clone https://github.com/openai/universe && pip install -e universe
We package each collection of similar environments into a “runtime”, which is a server exposing two ports: 5900 (used for the VNC protocol to exchange pixels/keyboard/mouse) and 15900 (used for a WebSocket control protocol). For example, the quay.io/openai/universe.flashgames Docker image is a runtime that can serve many different Flash game environments.
Starting a runtime. You can boot your first runtime from the console as follows:
# -p 5900:5900 and -p 15900:15900 expose the VNC and WebSocket ports # --privileged/--cap-add/--ipc=host needed to make Selenium work $ docker run --privileged --cap-add=SYS_ADMIN --ipc=host \ -p 5900:5900 -p 15900:15900 quay.io/openai/universe.flashgames
This will download and run the Flash games Docker container. You can view and control the remote desktop by connecting your own VNC viewer to port 5900, such as via TurboVNC or the browser-based VNC client served via the webserver on port 15900. The default password is openai. OSX also has a native VNC viewer which can be accessed by running open vnc://localhost:5900 in Terminal. (Unfortunately, the OSX viewer doesn’t implement Tight encoding, which is the best option for bigger games.)
Writing your own agent. You can write your own agent quite easily, using your favorite framework such as TensorFlow or Theano. (We’ve provided a starter TensorFlow agent.) At each time step, the agent’s observation includes a NumPy pixel array, and the agent must emit a list of VNC events (mouse/keyboard actions). For example, the following agent will activate Dusk Drive and press forward constantly:
import gym import universe # register Universe environments into Gym env = gym.make('flashgames.DuskDrive-v0') # any Universe [environment ID](https://github.com/openai/universe/blob/master/universe/__init__.py#L297) here # If using docker-machine, replace "localhost" with your Docker IP env.configure(remotes="vnc://localhost:5900+15900") observation_n = env.reset() while True: # agent which presses the Up arrow 60 times per second action_n = [[('KeyEvent', 'ArrowUp', True)] for _ in observation_n] observation_n, reward_n, done_n, info = env.step(action_n) env.render()
You can keep your own VNC connection open, and watch the agent play, or even use the keyboard and mouse alongside the agent in a human/agent co-op mode.
Environment management. Because environments run as server processes, they can run on remote machines, possibly within a cluster or even over the public internet. We’ve documented a few ways to manage remote runtimes. At OpenAI, we use an “allocator” HTTP service, which provisions runtimes across a Kubernetes cluster on demand, and which we can use to connect a single agent process to hundreds of simultaneous environments.
Validating the Universe infrastructure
Universe agents must deal with real-world griminess that traditional RL agents are shielded from: agents must run in real-time and account for fluctuating action and observation lag. While the full complexity of Universe is designed to be out of reach of current techniques, we also have ensured it’s possible to make progress today.
Universe Pong. Our first goal was solving gym-core.PongDeterministic-v3. Pong is one of the easiest Atari games, but it had the potential to be intractable as a Universe task, since the agent has to learn to perform very precise maneuvers at 4x realtime (as the environment uses a standard frameskip of 4). We used this environment to validate that Universe’s variable latencies still allowed for learning precise and rapid reactions. Today’s release includes universe-starter-agent, which takes an hour to train to a Pong score of +17 out of 21. Humans playing the same version of Pong were only able reach a score of -11 on a scale between -21 and 21, due to the game’s high speed.
Additional experiments. We applied RL to several racing Flash games, which worked after applying some standard tricks such as reward normalization. Some browser tasks where we tried RL had difficult exploration problems, but were solvable with behavioral cloning from human demonstration data.
Some of our successful agents are shown below. While solving Universe will require an agent far outside the reach of current techniques, these videos show that many interesting Universe environments can be fruitfully approached with today’s algorithms.
Remote training. We’ve also experimented with slither.io agents on our physical infrastructure (which has access to Titan X GPUs) and environments in the cloud. Generally, the agent will control 32 simultaneous environments at 5 frames per second—observations are available much more frequently, but lower framerates help today’s RL algorithms, which struggle with dependencies over many timesteps.
Our agent’s “reaction time” averages around 150ms over the public internet: 110ms for an observation to arrive, 10ms to compute the action, and 30ms for the action to take effect. (For comparison, human reaction time averages around 250ms.) Reaction times drop to 80ms over a local network, and 40ms within a single machine.
Research progress requires meaningful performance measurement. In upcoming weeks, we’ll release a transfer learning benchmark, allowing researchers to determine if they are making progress on general problem solving ability.
Universe draws inspiration from the history of the ImageNet dataset in the Computer Vision community. Fei-Fei Li and her collaborators deliberately designed the ImageNet benchmark to be nearly impossible, but error rates have dropped rapidly from 28% in 2010 to 3% in 2016, which reaches (or in some cases even surpasses) human-level performance.
If the AI community does the same with Universe, then we will have made real progress towards systems with broad, general intelligence.
Help us improve Universe
Universe will only succeed with the community’s help. There are many ways to contribute (and one particularly great way is to join us):
Grant us permission to use your game, program, website, or app.
If your program would yield good training tasks for an AI then we’d love your permission to package it in Universe. Good candidates have an on-screen number (such as a game score) which can be parsed as a reward, or well-defined objectives, either natively or definable by the user.
Train agents on Universe tasks.
AI advances require the entire community to collaborate, and we welcome the community’s help in training agents across these tasks. We’ve released a starter agent which should be a helpful starting point for building your own agents. In upcoming weeks, we’ll release the sub-benchmarks we think are the right places to start.
Integrate new environments. (Coming soon)
We have many more environments waiting to be integrated than we can handle on our own. In upcoming weeks, we’ll release our environment integration tools, so anyone can contribute new environment integrations. In the meanwhile, we’ll be running a beta for environment integrators.
Contribute demonstrations. (Coming soon)
We’re compiling a large dataset of human demonstrations on Universe environments, which will be released publicly. If you’d like to play games for the good of science, please sign up for our beta.