Skip to main content

September 19, 2019

Fine-tuning GPT-2 from human preferences

The image shows an abstract painting with soft, blended strokes in warm tones of peach, orange, and pink, contrasted with patches of blue and yellow. The texture resembles a sky or landscape at sunset with a dreamy, impressionistic style.

We’ve fine-tuned the 774M parameter GPT-2 language model using human feedback for various tasks, successfully matching the preferences of the external human labelers, though those preferences did not always match our own. Specifically, for summarization tasks the labelers preferred sentences copied wholesale from the input (we’d only asked them to ensure accuracy), so our models learned to copy. Summarization required 60k human labels; simpler tasks which continue text in various styles required only 5k. Our motivation is to move safety techniques closer to the general task of “machines talking to humans,” which we believe is key to extracting information about human values.

We believe language is a key ingredient in making reinforcement learning practical and safe for real-world tasks. Previous work(opens in a new window) on learning models of human preferences has focused on simple simulated environments (Atari games or robotics tasks) which do not capture the complexity of language. Language is also a necessary ingredient for algorithms such as amplification and debate, which target the reasoning behind preferences.

This work applies human preference learning to several natural language tasks: continuing text with positive sentiment or physically descriptive language using the BookCorpus(opens in a new window), and summarizing content from the TL;DR(opens in a new window) and CNN/Daily Mail(opens in a new window) datasets. Each of these tasks can be viewed as a text completion problem: starting with some text X, we ask what text Y should follow.A

We start with a pretrained language model (the 774M parameter version of GPT-2) and fine-tune the model by asking human labelers(opens in a new window) which of four samples is best. Fine-tuning for the stylistic continuation tasks is sample efficient: 5,000 human samples suffice for strong performance according to humans. For summarization, models trained with 60,000 comparisons learn to copy whole sentences from the input while skipping irrelevant preamble; this copying is an easy way to ensure accurate summaries, but may exploit the fact that labelers rely on simple heuristics.

Stylistic text continuation

For the stylistic continuation tasks, samples comparing the raw 774M GPT-2 model and our fine-tuned versions are shown below.B

Loading...

According to the same human labelers used to train them, our fine-tuned models are preferred to the base GPT-2 model (zero-shot) 88% and 86% of the time for sentiment and descriptiveness, respectively.

Summarization

We also applied human fine-tuning to two summarization tasks: summarization of articles from the CNN/Daily Mail dataset, and summarization of Reddit snippets from the TL;DR dataset.

These tasks are harder: our main models use 60,000 four-way comparisons. We also need online data collection, where the samples shown to humans are collected throughout training as the policy changes; an offline data collection strategy which shows humans only samples from the base GPT-2 language model performed poorly.

Our models achieve very good performance according to human labelers, but are likely exploiting the fact that labelers rely on simple heuristics: they prefer the lead-3 baseline of copying the first three sentences to our models. However, when combining supervised fine-tuning with human fine-tuning, our models outperform lead-3 on ROUGE scores.

Samples from zero-shot and supervised baselines, as well as RL fine-tuning of each, are shown below.

Loading...

The reader may have noticed a few things about these samples. First, our RL fine-tuned model is mostly a smart copying engine: it typically summarizes content by copying entire sentences from the article or Reddit snippet. By contrast, the zero-shot and supervised fine-tuned samples are more novel:

Model

CNN/Daily Mail

tl;dr

Reference Summaries

96.7

98.9

Zero-shot

91.7

96.3

Fine-tuned

2.5

29.0

Supervised

83.6

96.9

Supervised + fine-tuned

69.6

94.0

Sentence novelty: Percentage of sentences in summaries that do not appear in source text.

The RL fine-tuned model does vary where it copies from: while they copy the start of the input 28.3% and 77.6% of the time on TL;DR and CNN/Daily Mail, these numbers fall to 0.2% and 1.4% if the input starts with uninformative preamble (defined as “hi”, “hello”, “hey”, “ok”, “okay”, “so” for TL;DR, or a colon in the first three words for CNN/Daily Mail such as “Winner: Simon Wood took home the TV crown [...]”).

The visualization below shows where the variation in where the summarization models copy from, illustrated by the longest common subsequence of bigrams between context and summary for randomly chosen contexts.

Loading...

Second, while summaries from GPT-2 zero-shot and the supervised fine-tuned version of GPT-2 are more novel as measured by n-grams or sentences, they are also more novel in terms of content. That is, they’re not true:

Model

CNN/Daily Mail

tl;dr

Zero-shot

6/30

6/30

Fine-tuned

29/30

26/30

Supervised

19/30

8/30

Supervised + fine-tuned

20/30

11/30

Summary accuracy: Accuracy frequency of generated summaries, judged by authors on 30 articles from each dataset.

There are at least two ways of interpreting these results. The first is that copying is the easiest way to be accurate. The labelers were told to penalize inaccuracy but not copying. The zero-shot model copies some of the time, and when it copied it was accurate, so copying was reinforced. The result is a model that mostly copies, but at least does not lie.

However, this does not fully explain the results of human evaluation: both our model and a simple lead-3 baseline which copies the first three sentences are strongly preferred by the labelers to the human reference summaries in both datasets. The authors do not agree: we find the reference summaries are accurate and better capture the overall message. This reveals a mismatch between the notion of quality we wanted our model to learn, and what the humans labelers actually evaluated. Labelers want to work as quickly as possible, and they can work very quickly by following the heuristic of “if the summary copies, then select it.”

Challenges and lessons learned

Online data collection is hard

Online data collection was necessary to achieve the best results on summarization, but led to multiple difficulties:

  1. Software complexity. Interleaving data gathering, reward model training, and RL fine-tuning led to a far more complex system than if each component was separate.

  2. Machine learning complexity. An ML bug in any component would break the whole system, and it was awkward to debug one component in isolation.

  3. Quality control issues. Online label collection required low latency between generating a sample and receiving data back from Scale(opens in a new window) (typically ~30 minutes). Quality control with low latency is hard, and regressions in data quality were often not detected until after training runs were complete.

We believe the right middle ground between offline and online data collection is batched data collection: we would alternate between collecting large batches of data (with higher latency) and training on collected data. The cost of human data means that volume will always be low, so it is easy to retrain from scratch (or rather, from the GPT-2 starting point) each time.

Ambiguous tasks make labeling hard

A single human may have a clear notion of whether a given sample is separately accurate, grammatical, nonredundant, or hits the key points, but comparing two summaries often requires subjective weighing of different kinds of deficiencies. When possible, it seems better to design less ambiguous labeling tasks that get at the same information. For example, rather than asking a person to compare summaries, we could ask for a verbal description of the problems with a summary, or a suggested correction. Even if two people disagree on the most important problem, they may agree that the other picked some problem, and more agreement eases data quality control and the overall experimental process.

Bugs can optimize for bad behavior

One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. A mechanism such as Toyota’s Andon cord(opens in a new window) could have prevented this, by allowing any labeler to stop a problematic training process.

Looking forward

We’ve demonstrated reward learning from human preferences on two kinds of natural language tasks, stylistic continuation and summarization. Our results are mixed: for continuation we achieve good results with very few samples, but our summarization models are only “smart copiers”: they copy from the input text but skip over irrelevant preamble. The advantage of smart copying is truthfulness: the zero-shot and supervised models produce natural, plausible-looking summaries that are often lies. We believe the limiting factor in our experiments is data quality exacerbated by the online data collection setting, and plan to use batched data collection in the future.

We believe the application of reward learning to language is important both from a capability and safety perspective. On the capability side, reinforcement learning lets us correct mistakes that supervised learning would not catch, but RL with programmatic reward functions “can be detrimental to model quality(opens in a new window).” On the safety side, reward learning for language allows important criteria like “don’t lie” to be represented during training, and is a step towards scalable safety methods such as a debate and amplification.

Footnotes

  1. A

    For summarization, the text is the article plus the string “TL;DR:”.

  2. B

    Each fine-tuned model is trained using 5,000 four-way comparisons by humans.

Authors

Daniel Ziegler, Nisan Stiennon, Jeffrey Wu, Tom Brown, Dario Amodei, Alec Radford, Paul Christiano, Geoffrey Irving

Acknowledgments

Thanks to Ethan Perez, Jack Clark, Ashley Pilipiszyn & Greg Brockman for feedback on this post. Thanks to Justin Jay Wang, Brooke Chan & Shan Carter for design and development.