Skip to main content

June 27, 2024

Finding GPT-4’s mistakes with GPT-4

CriticGPT, a model based on GPT-4, writes critiques of ChatGPT responses to help human trainers spot mistakes during RLHF

Bug Finding Research

We've trained a model, based on GPT-4, called CriticGPT to catch errors in ChatGPT's code output. We found that when people get help from CriticGPT to review ChatGPT code they outperform those without help 60% of the time. We are beginning the work to integrate CriticGPT-like models into our RLHF labeling pipeline, providing our trainers with explicit AI assistance. This is a step towards being able to evaluate outputs from advanced AI systems that can be difficult for people to rate without better tools.

The GPT-4 series of models, which powers ChatGPT, is aligned to be helpful and interactive through “Reinforcement Learning from Human Feedback” (RLHF). A key part of RLHF is collecting comparisons in which people, called AI trainers, rate different ChatGPT responses against each other.

As we make advances in reasoning and model behavior, ChatGPT becomes more accurate and its mistakes become more subtle. This can make it hard for AI trainers to spot inaccuracies when they do occur, making the comparison task that powers RLHF much harder. This is a fundamental limitation of RLHF, and it may make it increasingly difficult to align models as they gradually become more knowledgeable than any person that could provide feedback.

To help with this challenge, we trained CriticGPT to write critiques that highlight inaccuracies in ChatGPT answers.

Task
Write a function in Python that takes as input a string path representing a file path and returns a File object for the file at “path”. Files outside of the directory “/safedir” should not be allowed to be accessed.
Solution
import os 

def get_file_object(path):
    # Define the safe directory
    safe_dir = "/safedir" 
    # Get the absolute path of the file and the safe directory 
    absolute_file_path = os.path.abspath(path) 
    absolute_safe_dir = os.path.abspath(safe_dir) 

# Check if the file is within the safe directory 
if not absolute_file_path.startswith(absolute_safe_dir):
        raise FileAccessError("Access to files outside of "
                              "'/safedir' is not allowed.")
...
CriticGPT
Checking if an absolute path of a file is within a directory by using startswith() is insecure. Users could exploit this with symlinks or by naming directories similarly. Using os.path.commonpath([absolute_file_path, absolute_safe_dir]) or a more robust method of path containment is advisable.

CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help. Additionally, when people use CriticGPT, the AI augments their skills, resulting in more comprehensive critiques than when people work alone, and fewer hallucinated bugs than when the model works alone. In our experiments a second random trainer preferred critiques from the Human+CriticGPT team over those from an unassisted person more than 60% of the time.

Desktop light

CriticGPT helps trainers to write more comprehensive critiques than they do without help while producing fewer hallucinations than critiques from the model alone.

Methods

CriticGPT was also trained with RLHF, similarly to ChatGPT. But unlike ChatGPT it saw a large number of inputs that contained mistakes which it then had to critique. We asked AI trainers to manually insert these mistakes into code written by ChatGPT and then write example feedback as if they had caught the bug that they just inserted. The same person then compared multiple critiques of the modified code so they could easily tell when a critique caught their inserted bug. In our experiments we studied both whether CriticGPT could catch inserted bugs and “naturally occurring” ChatGPT bugs that a previous trainer had caught. We find that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs, in part because the new critic produces fewer “nitpicks” (small complaints that are unhelpful) and hallucinates problems less often.

We also find that we can generate longer and more comprehensive critiques by using additional test-time search against the critique reward model. This search procedure allows us to balance how aggressively we look for problems in the code and configure a precision-recall trade-off between hallucinations and the number of detected bugs. That means we can generate critiques that are as helpful as possible for RLHF. See our research paper for more details.

Limitations

  • We trained CriticGPT on ChatGPT answers that are quite short. To supervise the agents of the future, we will need to develop methods that can help trainers to understand long and complex tasks.

  • Models still hallucinate and sometimes trainers make labeling mistakes after seeing those hallucinations.

  • Sometimes real-world mistakes can be spread across many parts of an answer. Our work focuses on errors that can be pointed out in one place, but in the future we need to tackle dispersed errors as well.

  • CriticGPT can only help so much: if a task or response is extremely complex even an expert with model help may not be able to correctly evaluate it.

Next Steps

In order to align AI systems that are increasingly complex, we’ll need better tools. In our research on CriticGPT, we found that applying RLHF to GPT-4 has promise to help humans produce better RLHF data for GPT-4. We are planning to scale this work further and put it into practice.

Authors

Nat McAleese, Maja Trębacz

Acknowledgments

Greg Brockman, Juan Felipe Ceron Uribe, Elie Georges, Wes McCabe, Evgenia Nitishinskaya, Rai (Michael Pokorny), Freddie Sulit