AI-written critiques help humans notice flaws
We trained “critique-writing” models to describe flaws in summaries. Human evaluators find flaws in summaries much more often when shown our model’s critiques. Larger models are better at self-critiquing, with scale improving critique-writing more than summary-writing. This shows promise for using AI systems to assist human supervision of AI systems on difficult tasks.
We want to ensure that future AI systems performing very difficult tasks remain aligned with human intent. Many previous(opens in a new window) works(opens in a new window) on aligning language models rely on human evaluations as a training signal. However, humans struggle at evaluating very difficult tasks—for example, it is hard to spot every bug in a codebase or every factual error in a long essay. Models may then learn to give outputs that look good to humans but have errors we systematically fail to notice.
To mitigate this problem, we want to train AI assistants that help humans provide feedback on hard tasks. These assistants should point out flaws, help humans understand what’s going on, and answer their questions. An example of this is our past work on book summarization: reading the entire book is a lot of work, but humans assisted with chapter summaries have a much easier time evaluating a book summary.
As a proof of concept, we used supervised learning to train language models to write critiques of topic-based summaries of short stories, Wikipedia articles, and other texts from the internet. We use these models to assist human evaluators and study scaling properties of critique writing.
Experiments with AI assistance
To see how useful our models are for evaluation assistance, we show labelers 8 model-written critiques of each summary, with a control group that receives no assistance. We use topic-based summaries from three sources: written by our models, written by humans, and written by humans deliberately to have important yet subtle flaws.
Even though summarization isn’t actually a difficult task for humans and our models aren’t more capable than humans, they already provide meaningful assistance: when asked to evaluate model-written summaries, the assisted group finds 50% more flaws than the control group. For deliberately misleading summaries, assistance increases how often humans spot the intended flaw from 27% to 45%.
Scaling properties of critiques
Assistance on model-written summaries only works if they are able to critique themselves. We ask humans to rate the helpfulness of model-written self-critiques, and find larger models are better at self-critiquing.
We also find that large models are able to directly improve their outputs, using their self-critiques, which small models are unable to do. Using better critiques helps models make better improvements than they do with worse critiques, or with no critiques.
Do models tell us everything they know?
To provide the best evaluation assistance on difficult tasks, we would like models to communicate all problems that they “know about.” Whenever a model correctly predicts that an answer is flawed, can the model also produce a concrete critique that humans understand?
This is particularly important for supervising models that could attempt to mislead human supervisors or hide information. We would like to train equally smart assistance models to point out what humans don’t notice.
Unfortunately, we found that models are better at discriminating than at critiquing their own answers, indicating they know about some problems that they can’t or don’t articulate. Furthermore, the gap between discrimination and critique ability did not appear to decrease for larger models. Reducing this gap is an important priority for our alignment research.
Next steps
An important limitation of this work is that topic-based summarization is not actually a difficult task: humans understand it quite well and it takes them only about 10 minutes to evaluate a summary. To understand the limits of AI-assisted evaluation better, we need to work with tasks that are much more difficult for humans to evaluate.
Nevertheless, these results make us optimistic that we can train models to provide humans with meaningful feedback assistance. This is an important pillar of our alignment strategy, starting with the work on debate and recursive reward modeling(opens in a new window). In the long run, we want to build assistants that can be trusted to take on all of the cognitive labor needed for evaluation, so humans can focus on communicating their preferences.