Evaluating fairness in ChatGPT
We've analyzed how ChatGPT responds to users based on their name, using language model research assistants to protect privacy.
Creating our models takes more than data—we also carefully design the training process to reduce harmful outputs and improve usefulness. Research has shown that language models can still sometimes absorb and repeat social biases from training data, such as gender or racial stereotypes.
In this study, we explored how subtle cues about a user's identity—like their name—can influence ChatGPT's responses. This matters because people use chatbots like ChatGPT in a variety of ways, from helping them draft a resume to asking for entertainment tips, which differ from the scenarios typically studied in AI fairness research, such as screening resumes or credit scoring.
While previous research has focused on third-person fairness, where institutions use AI to make decisions about others, this study examines first-person fairness, or how biases affect users directly in ChatGPT. As a starting point, we measured how ChatGPT’s awareness of different users’ names in an otherwise identical request might affect its response to each of those users. Names often carry cultural, gender, and racial associations, making them a relevant factor for investigating bias—especially since users frequently share their names with ChatGPT for tasks like drafting emails. ChatGPT can remember information like names across conversations, unless the user has turned off the Memory feature.
To focus our study on fairness, we looked at whether using names leads to responses that reflect harmful stereotypes. While we expect and want ChatGPT to tailor its response to user preferences, we want it to do so without introducing harmful bias. To illustrate the types of differences in responses and harmful stereotypes that we looked for, consider the following examples:
Examples of response differences
hi
Hey Jack! How’s it going?
hi
Hi Jill! How is your day going?
Our study found no difference in overall response quality for users whose names connote different genders, races or ethnicities. When names occasionally do spark differences in how ChatGPT answers the same prompt, our methodology found that less than 1% of those name-based differences reflected a harmful stereotype.
How we studied it
Because we wanted to measure if stereotypical differences occur even a small percentage of the time (beyond what would be expected purely by chance), we studied how ChatGPT responds across millions of real requests. To protect privacy while still understanding real-world usage, we instructed a language model (GPT-4o) to analyze patterns across a large number of real ChatGPT transcripts, and to share those trends (but not the underlying chats) within the research team. This way, researchers were able to analyze and understand real-world trends, while maintaining the privacy of the chats. We refer to this language model as a “Language Model Research Assistant” (LMRA) in the paper to distinguish it from the language models that generate the chats we are studying in ChatGPT.
An example of the type of prompt we used below:
To check whether the language model’s ratings agreed with what a human rater would say, we then asked both the language model and human raters to evaluate the same public chats. We then used the LMRA (but not human raters) to analyze patterns across ChatGPT conversations. For gender, the answers that the language model gave aligned with human raters’ answers more than 90% of the time, while for racial and ethnic stereotypes, the rates of agreement were lower. The LMRA detected lower rates of harmful racial stereotypes than those associated with gender. Further work is needed to define a harmful stereotype and to improve the accuracy of the LMRA.
Our findings
We found that when ChatGPT knows the user’s name, it gives equally high-quality answers regardless of the gender or racial connotations of the name, e.g., accuracy and hallucination rates were consistent across groups. We also found that a name's association with gender, race, or ethnicity did lead to differences in responses that the language model assessed as reflecting harmful stereotypes in around 0.1% of overall cases, with biases in some domains on older models up to around 1%.
The breakdown of harmful stereotype ratings by domain is shown below:
ChatGPT-4o-mini responses rated by GPT-4o
Ratings of harmful gender stereotypes among responses of GPT-4o-mini, as rated by the LMRA (powered by GPT-4o).
Within each domain, the LMRA identified tasks that most often had a harmful stereotype. Open-ended tasks with longer responses were more likely to include a harmful stereotype. For example, “Write a story” was found to include a stereotype more often than any other prompt tested.
While stereotype rates are low, less than 1 in 1000 averaged across all domains and tasks, our evaluation serves as a benchmark for us to measure how successful we are in reducing this rate over time. When we split this measure by task type and evaluate task-level bias across our models, we see that the model that showed the highest level of bias was GPT-3.5 Turbo, with newer models all having less than 1% bias across all tasks.
Harmful Stereotype Ratings Across Models
The LMRA proposed natural language explanations of what the differences are in each task. It highlighted occasional differences in the tone, language complexity, and degree of detail of ChatGPT’s response across all tasks. In addition to some clear stereotypes, these differences also included things that some users might welcome and others might not. For instance in the “Write a story” task, responses to users with female-sounding names more often feature female protagonists than those for male-sounding names.
Although individual users are unlikely to notice these differences, we think they are important to measure and understand as even rare patterns could be harmful in the aggregate. This approach also gives us a new way to track changes statistically over time. The research methods we created for this study, could also be generalized to studying biases in ChatGPT beyond names. For more details, we encourage you to read our full report which examines 3 metrics of fairness across 2 genders, 4 races and ethnicities, 66 tasks, 9 domains and 6 language models.
Limitations
Understanding fairness in language models is a large research area, and we acknowledge that our study has limitations. Not everyone shares their name, and other information besides names likely also has an impact on ChatGPT’s first-person fairness. It primarily focuses on English-language interactions, binary gender associations based on common U.S. names, and four races and ethnicities (Black, Asian, Hispanic and White). This study only covers text interactions, though we note that first-person fairness with respect to speaker demographics in audio is analyzed in the GPT-4o system card (see “Disparate Performance on Voice Inputs”). While we think the methodology is a step forward, there's more work to be done to understand biases related to other demographics, languages, and cultural contexts. We plan to build on this research to improve fairness more broadly.
Conclusion
While it’s difficult to boil harmful stereotypes down into a single number, we believe that developing new methods to measure and understand bias is an important step towards being able to track and mitigate it over time. The method we used in this study is now part of our standard suite of model performance evaluations, and will inform deployment decision making for future systems. These learnings will also support our efforts to further clarify the operational meaning of fairness in our systems. Fairness continues to be an active area of research, and we’ve shared examples of our fairness research in our GPT-4o and OpenAI o1 system cards (e.g., comparing accuracy of voice recognition across different speaker demographics).
We believe that transparency and continuous improvement are key to both addressing bias and building trust with our users and the broader research community. To support reproducibility and further fairness research, we are also sharing detailed system messages used in this study so external researchers can conduct first-person bias experiments of their own (details in our paper).
We welcome feedback and collaboration. If you have insights or wish to work with us on improving AI fairness, we’d be glad to hear from you—and if you want to focus on solving these challenges with us, we are hiring.