Skip to main content

June 11, 2018

Improving language understanding with unsupervised learning

Language Unsupervised

Illustration: Ben Barry

We’ve obtained state-of-the-art results on a suite of diverse language tasks with a scalable, task-agnostic system, which we’re also releasing. Our approach is a combination of two existing ideas: transformers(opens in a new window) and unsupervised pre-training(opens in a new window). These results provide a convincing example that pairing supervised learning methods with unsupervised pre-training works very well; this is an idea that many have explored in the past, and we hope our result motivates further research into applying this idea on larger and more diverse datasets.






Textual entailment



MNLI matched

Textual entailment



MNLI mismatched

Textual entailment




Textual entailment




Textual entailment




Textual entailment




Semantic similarity




Semantic similarity




Semantic similarity




Reading comprehension




Commonsense reasoning




Commonsense reasoning




Sentiment analysis




Linguistic acceptability




Multi task benchmark



Our system works in two stages; first we train a transformer model on a very large amount of data in an unsupervised manner—using language modeling as a training signal—then we fine-tune this model on much smaller supervised datasets to help it solve specific tasks. We developed this approach following our sentiment neuron work, in which we noted that unsupervised learning techniques can yield surprisingly discriminative features when trained on enough data. Here, we wanted to further explore this idea: can we develop one model, train it in an unsupervised way on a large amount of data, and then fine-tune the model to achieve good performance on many different tasks? Our results indicate that this approach works surprisingly well; the same core model can be fine-tuned for very different tasks with minimal adaptation.

This work builds on the approach introduced in Semi-supervised Sequence Learning(opens in a new window), which showed how to improve document classification performance by using unsupervised pre-training of an LSTM followed by supervised fine-tuning. It also extends ULMFiT(opens in a new window), research that shows how a single dataset-agnostic LSTM language model can be fine-tuned to get state-of-the-art performance on a variety of document classification datasets; our work shows how a Transformer-based model can be used in this approach to succeed at a broader range of tasks beyond document classification, such as commonsense reasoning, semantic similarity, and reading comprehension. It is also similar to but more task-agnostic than ELMo(opens in a new window), which incorporates pre-training but uses task-customized architectures to get state-of-the-art results on a broad suite of tasks.

Very little tuning was used to achieve our results. All datasets use a single forward language model, without any ensembling, and the majority of the reported results use the exact same hyperparameter settings.

A result we are particularly excited about is the performance of our approach on three datasets—COPA(opens in a new window)RACE(opens in a new window), and ROCStories(opens in a new window)—designed to test commonsense reasoning and reading comprehension. Our model obtains new state-of-the-art results on these datasets by a wide margin. These datasets are thought to require multi-sentence reasoning and significant world knowledge to solve suggesting that our model improves these skills predominantly via unsupervised learning. This suggests there’s hope for developing complex language understanding capabilities via unsupervised techniques.

Why unsupervised learning?

Supervised learning is at the core of most of the recent success of machine learning. However, it can require large, carefully cleaned, and expensive to create datasets to work well. Unsupervised learning is attractive because of its potential to address these drawbacks. Since unsupervised learning removes the bottleneck of explicit human labeling it also scales well with current trends of increasing compute(opens in a new window) and availability of raw data. Unsupervised learning is a very(opens in a new window) active(opens in a new window) area(opens in a new window) of(opens in a new window) research(opens in a new window) but practical uses of it are often still limited.

There’s been a recent push to try to further language capabilities by using unsupervised learning to augment systems with large amounts of unlabeled data; representations of words trained via unsupervised techniques can use large datasets consisting of terabytes of information and, when integrated with supervised learning, improve performance on a wide range of NLP tasks. Until recently, these unsupervised techniques for NLP (for example, GLoVe(opens in a new window) and word2vec(opens in a new window)) used simple models (word vectors) and training signals (the local co-occurence of words). Skip-Thought Vectors(opens in a new window) is a notable early demonstration of the potential improvements more complex approaches can realize. But new techniques are now being used which are further boosting performance. These include the use of pre-trained sentence representation models, contextualized word vectors (notably ELMo(opens in a new window) and CoVE(opens in a new window)), and approaches which use customized architectures to fuse unsupervised pre-training with supervised fine-tuning, like our own.

Two side-by-side graphs demonstrating how Zero Shot Transfer can directly accelerate Supervised Fine-Tuning

We also noticed we can use the underlying language model to begin to perform tasks without ever training on them. For example, performance on tasks like picking the right answer to a multiple choice question steadily increases as the underlying language model improves. While the absolute performance of these methods is still often quite low compared to the supervised state-of-the-art (for question answering it still outperformed by a simple sliding-window baseline) it is encouraging that this behavior is robust across a broad set of tasks. Randomly initialized networks containing no information about the task and the world perform no-better than random using these heuristics. This provides some insight into why generative pre-training can improve performance on downstream tasks.

We can also use the existing language functionality in the model to perform sentiment analysis. For the Stanford Sentiment Treebank dataset, which consists of sentences from positive and negative movie reviews, we can use the language model to guess whether a review is positive or negative by inputting the word “very” after the sentence and seeing whether the model predicts the word “positive” or “negative” as more likely. This approach, without adapting the model at all to the task, performs on par with classic baselines ~80% accuracy(opens in a new window).

Our work is also a validation of the robustness and usefulness of the transformer architecture, indicating that it is sufficiently flexible to achieve state-of-the-art results on a wide range of tasks without requiring complicated task-specific customization or hyperparameter tuning.


This project has a few outstanding issues which are worth noting:

  • Compute Requirements: Many previous approaches to NLP tasks train relatively small models on a single GPU from scratch. Our approach requires an expensive pre-training step—1 month on 8 GPUs. Luckily, this only has to be done once and we’re releasing our model so others can avoid it. It is also a large model (in comparison to prior work) and consequently uses more compute and memory—we used a 37-layer (12 block) Transformer architecture, and we train on sequences of up to 512 tokens. Most experiments were conducted on 4 and 8 GPU systems. The model does fine-tune to new tasks very quickly which helps mitigate the additional resource requirements.

  • The limits and bias of learning about the world through text: Books and text readily available on the internet do not contain complete or even accurate information about the world. Recent work(opens in a new window) has shown that certain kinds of information are difficult to learn via just text and other work(opens in a new window) has shown that models learn and exploit biases in data distributions.

  • Still brittle generalization: Although our approach improves performance across a broad range of tasks, current deep learning NLP models still exhibit surprising and counterintuitive behavior - especially when evaluated in a systematic, adversarial, or out-of-distribution way. Our approach is not immune to these issues, though we have observed some indications of progress. Our approach shows improved lexical robustness over previous purely neural approaches to textual entailment. On the dataset introduced in Glockner et al. (2018)(opens in a new window) our model achieves 83.75%, performing similarly to KIM(opens in a new window), which incorporates external knowledge via WordNet.


  • Scaling the approach: We’ve observed that improvements in the performance of the language model are well correlated with improvements on downstream tasks. We’re currently using commodity hardware (a single 8 GPU machine) and a training dataset of only a few thousand books (~5GB of text). This suggests there is significant room for improvement using the well-validated approach of more compute and data.

  • Improved fine-tuning: Our approach is currently very simple. It is likely that substantial improvements can be made using more intricate adaptation and transfer techniques such as those explored in ULMFiT(opens in a new window).

  • Better understanding of why generative pre-training helps: Although we’ve discussed some ideas we are partial to here, more targeted experiments and research will help distinguish between competing explanations. For instance, how much of the benefits we observe are due to improved ability to process broader context versus improved world knowledge?

Appendix: Dataset examples



We’re increasingly interested in understanding the relationship between the compute we expend on training models and the resulting output(opens in a new window). The total compute used to train this model was 0.96 petaflop days (pfs-days).

8 P600 GPU's * 30 days * 12 TFLOPS/GPU * 0.33 utilization = 
= .96 pfs-days