Finding GPT-4’s mistakes with GPT-4
CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF
We've trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.
The GPT-4 series of models, which powers ChatGPT, is aligned to be helpful and interactive through “Reinforcement Learning from Human Feedback” (RLHF). A key part of RLHF is collecting comparisons in which people, called AI trainers, rate different ChatGPT responses against each other.
As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.
To help with this challenge, we trained CriticGPT to write critiques that highlight inaccuracies in ChatGPT answers.
import os
def get_file_object(path):
# Define the safe directory
safe_dir = "/safedir"
# Get the absolute path of the file and the safe directory
absolute_file_path = os.path.abspath(path)
absolute_safe_dir = os.path.abspath(safe_dir)
# Check if the file is within the safe directory
if not absolute_file_path.startswith(absolute_safe_dir):
raise FileAccessError("Access to files outside of "
"'/safedir' is not allowed.")
...
CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help. Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone. In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.
Methods
CriticGPT was also trained with RLHF, similarly to ChatGPT. But unlike ChatGPT it saw a large number of inputs that contained mistakes which it then had to critique. We asked AI trainers to manually insert these mistakes into code written by ChatGPT and then write example feedback as if they had caught the bug that they just inserted. The same person then compared multiple critiques of the modified code so they could easily tell when a critique caught their inserted bug. In our experiments we studied both whether CriticGPT could catch inserted bugs and “naturally occurring” ChatGPT bugs that a previous trainer had caught. We find that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs, in part because the new critic produces fewer “nitpicks” (small complaints that are unhelpful) and hallucinates problems less often.
We also find that we can generate longer and more comprehensive critiques by using additional test-time search against the critique reward model. This search procedure allows us to balance how aggressively we look for problems in the code and configure a precision-recall trade-off between hallucinations and the number of detected bugs. That means we can generate critiques that are as helpful as possible for RLHF. See our research paper for more details.
Limitations
We trained CriticGPT on ChatGPT answers that are quite short. To supervise the agents of the future, we will need to develop methods that can help trainers to understand long and complex tasks.
Models still hallucinate and sometimes trainers make labeling mistakes after seeing those hallucinations.
Sometimes real-world mistakes can be spread across many parts of an answer. Our work focuses on errors that can be pointed out in one place, but in the future we need to tackle dispersed errors as well.
CriticGPT can only help so much: if a task or response is extremely complex even an expert with model help may not be able to correctly evaluate it.
Next Steps
In order to align AI systems that are increasingly complex, we’ll need better tools. In our research on CriticGPT, we found that applying RLHF to GPT-4 has promise to help humans produce better RLHF data for GPT-4. We are planning to scale this work further and put it into practice.