The internet contains an enormous amount of publicly available videos that we can learn from. You can watch a person make a gorgeous presentation, a digital artist draw a beautiful sunset, and a Minecraft player build an intricate house. However, these videos only provide a record of what happened but not precisely how it was achieved, i.e., you will not know the exact sequence of mouse movements and keys pressed. If we would like to build large-scale foundation models in these domains as we’ve done in language with GPT, this lack of action labels poses a new challenge not present in the language domain, where “action labels” are simply the next words in a sentence.
In order to utilize the wealth of unlabeled video data available on the internet, we introduce a novel, yet simple, semi-supervised imitation learning method: Video PreTraining (VPT). We start by gathering a small dataset from contractors where we record not only their video, but also the actions they took, which in our case are keypresses and mouse movements. With this data we train an inverse dynamics model (IDM), which predicts the action being taken at each step in the video. Importantly, the IDM can use past and future information to guess the action at each step. This task is much easier and thus requires far less data than the behavioral cloning task of predicting actions given past video frames only, which requires inferring what the person wants to do and how to accomplish it. We can then use the trained IDM to label a much larger dataset of online videos and learn to act via behavioral cloning.
VPT zero-shot results
We chose to validate our method in Minecraft because it (1) is one of the most actively played video games in the world and thus has a wealth of freely available video data and (2) is open-ended with a wide variety of things to do, similar to real-world applications such as computer usage. Unlike prior works in Minecraft that use simplified action spaces aimed at easing exploration, our AI uses the much more generally applicable, though also much more difficult, native human interface: 20Hz frame rate with the mouse and keyboard.
Trained on 70,000 hours of IDM-labeled online video, our behavioral cloning model (the “VPT foundation model”) accomplishes tasks in Minecraft that are nearly impossible to achieve with reinforcement learning from scratch. It learns to chop down trees to collect logs, craft those logs into planks, and then craft those planks into a crafting table; this sequence takes a human proficient in Minecraft approximately 50 seconds or 1,000 consecutive game actions.
Additionally, the model performs other complex skills humans often do in the game, such as swimming, hunting animals for food, and eating that food. It also learned the skill of “pillar jumping”, a common behavior in Minecraft of elevating yourself by repeatedly jumping and placing a block underneath yourself.
Fine-tuning with behavioral cloning
Foundation models are designed to have a broad behavior profile and be generally capable across a wide variety of tasks. To incorporate new knowledge or allow them to specialize on a narrower task distribution, it is common practice to fine-tune these models to smaller, more specific datasets. As a case study into how well the VPT foundation model can be fine-tuned to downstream datasets, we asked our contractors to play for 10 minutes in brand new Minecraft worlds and build a house from basic Minecraft materials. We hoped that this would amplify the foundation model’s ability to reliably perform “early game” skills such as building crafting tables. When fine-tuning to this dataset, not only do we see a massive improvement in reliably performing the early game skills already present in the foundation model, but the fine-tuned model also learns to go even deeper into the technology tree by crafting both wooden and stone tools. Sometimes we even see some rudimentary shelter construction and the agent searching through villages, including raiding chests.
Perhaps the most important hypothesis of our work is that it is far more effective to use labeled contractor data to train an IDM (as part of the VPT pipeline) than it is to directly train a BC foundation model from that same small contractor dataset. To validate this hypothesis we train foundation models on increasing amounts of data from 1 to 70,000 hours. Those trained on under 2,000 hours of data are trained on the contractor data with ground-truth labels that were originally collected to train the IDM, and those trained on over 2,000 hours are trained on internet data labeled with our IDM. We then take each foundation model and fine-tune it to the house building dataset described in the previous section.
As foundation model data increases, we generally see an increase in crafting ability, and only at the largest data scale do we see the emergence of stone tool crafting.
Fine-tuning with reinforcement learning
When it is possible to specify a reward function, reinforcement learning (RL) can be a powerful method for eliciting high, potentially even super-human, performance. However, many tasks require overcoming hard exploration challenges, and most RL methods tackle these with random exploration priors, e.g. models are often incentivized to act randomly via entropy bonuses. The VPT model should be a much better prior for RL because emulating human behavior is likely much more helpful than taking random actions. We set our model the challenging task of collecting a diamond pickaxe, an unprecedented capability in Minecraft made all the more difficult when using the native human interface.
Crafting a diamond pickaxe requires a long and complicated sequence of subtasks. To make this task tractable, we reward agents for each item in the sequence.
We found that an RL policy trained from a random initialization (the standard RL method) barely achieves any reward, never learning to collect logs and only rarely collecting sticks. In stark contrast, fine-tuning from a VPT model not only learns to craft diamond pickaxes (which it does in 2.5% of 10-minute Minecraft episodes), but it even has a human-level success rate at collecting all items leading up to the diamond pickaxe. This is the first time anyone has shown a computer agent capable of crafting diamond tools in Minecraft, which takes humans over 20 minutes (24,000 actions) on average.
VPT paves the path toward allowing agents to learn to act by watching the vast numbers of videos on the internet. Compared to generative video modeling or contrastive methods that would only yield representational priors, VPT offers the exciting possibility of directly learning large scale behavioral priors in more domains than just language. While we only experiment in Minecraft, the game is very open-ended and the native human interface (mouse and keyboard) is very generic, so we believe our results bode well for other similar domains, e.g. computer usage.
For more information, please see our paper. We are also open sourcing our contractor data, Minecraft environment, model code, and model weights, which we hope will aid future research into VPT. Furthermore, we have partnered with the MineRL NeurIPS competition this year. Contestants can use and fine-tune our models to try to solve many difficult tasks in Minecraft. Those interested can check out the competition webpage and compete for a blue-sky prize of $100,000 in addition to a regular prize pool of $20,000. Grants are available to self-identified underrepresented groups and individuals.