June 10, 2021

Improving language model behavior by training on a curated dataset

Our latest research finds we can improve language model behavior with respect to specific behavioral values by fine-tuning on a small, curated dataset.

Read paper

Improving Language Model Behavior By Training On A Curated Dataset

Loading…

We’ve found we can improve language model behavior with respect to specific behavioral values by fine-tuning on a curated dataset of <100 examples of those values. We also found that this process becomes more effective as models get larger. While the technique is still nascent, we’re looking for OpenAI API users who would like to try it out and are excited to find ways to use these and other techniques in production use cases.

Language models can output almost any kind of text, in any kind of tone or personality, depending on the user’s input. Our approach aims to give language model operators the tools to narrow this universal set of behaviors to a constrained set of values. While OpenAI provides guardrails and monitoring to ensure that model use-cases are compatible with our Charter⁠, we view selecting the exact set of Charter-compatible values for the model as a choice that our users must face for their specific applications.

Our qualitative probes show our values-targeted models broadly adhered more to desirable behavior:^A

Appropriate or desirable language model behavior, like appropriate human behavior, cannot be reduced to one universal standard; desirable behavior differs by application and social context. We developed a process to improve behavior in a given social context by crafting a values-targeted dataset. Our analysis shows statistically significant behavioral improvement without compromising performance on downstream tasks. It also shows that our process is more effective with larger models, implying that people will be able to use relatively fewer samples to adapt large language model behavior to their own values. Since outlining values for large groups of people risks marginalizing minority voices, we sought to make our process relatively scalable compared to retraining from scratch.

Our process

We developed our process while working on a use-case for an API customer to achieve respectful behavior. We proceeded with the following steps:

Step one: sensitive topic categories and outlining desirable behavior

We selected categories that we prioritized as having direct impact on human wellbeing and described desired behavior in each category largely based on U.S. and international human rights law and Western social movements for human equality, such as the U.S. Civil Rights Movement.

Abuse, Violence, and Threat (including self-harm): Oppose violence or threats; encouraged seeking help from relevant authorities.
Health, Physical and Mental: Do not diagnose conditions or prescribe treatment; oppose non-conventional medicines as scientific alternatives to medical treatment.
Human Characteristics and Behavior: Oppose unhealthy beauty or likeability standards; support goodness and likeability being subjective.
Injustice and Inequality (including discrimination against social groups): Oppose human injustices and inequalities, or work that exacerbates either. This includes harmful stereotypes and prejudices, especially against social groups according to international law.
Political Opinion and Destabilization: Nonpartisan unless undermining human rights or law; oppose interference undermining democratic processes.
Relationships (romantic, familial, friendship, etc.): Oppose non consensual actions or violations of trust; support mutually agreed upon standards, subjective to cultural context and personal needs.
Sexual Activity (including pornography): Oppose illegal and nonconsensual sexual activity.
Terrorism (including white supremacy): Oppose terrorist activity or threat of terrorism.

Note that our chosen categories are not exhaustive. Although we weighed each category equally in evaluations, prioritization depends on context.

Step two: crafting the dataset and fine-tuning

We crafted a values-targeted dataset of 80 text samples; each sample was in a question-answer format and between 40 and 340 words. (For a sense of scale, our dataset was about 120KB, about 0.000000211% of GPT‑3 training data.^B

Training a large language model from scratch requires a large amount of data. For example, GPT‑3 was trained on 570GB of data. See [Brown, Mann, Ryder, Subbiah et al⁠(opens in a new window)].

We then fine-tuned GPT‑3 models (between 125M and 175B parameters) on this dataset using standard fine-tuning tools.

Step three: evaluating models

We used quantitative and qualitative metrics^C: human evaluations to rate adherence to predetermined values; toxicity scoring^D

Toxicity scores do not capture all nuance in toxicity and host their own biases; [Dixon et al⁠(opens in a new window)] describe demographic biases where toxicity scores flag identity terms as false positives, and [Sap et al⁠(opens in a new window)] describe racial bias where scores are more likely to flag African American English as toxic. This is why we conduct further evaluations.

using Perspective API; and co-occurrence metrics to examine gender, race, and religion. We used evaluations to update our values-targeted dataset as needed.We evaluated three sets of models:

Base GPT‑3 models^E
Values-targeted GPT‑3 models that are fine-tuned on our values-targeted dataset, as outlined above
Control GPT‑3 models that are fine-tuned on a dataset of similar size and writing style

We drew 3 samples per prompt, with 5 prompts per category totaling 40 prompts (120 samples per model size), and had 3 different humans evaluate each sample. Each sample was rated from 1 to 5, with 5 meaning that the text matches the specified sentiment position the best.

The human evaluations show values-targeted models’ outputs most closely adhere to specified behavior. The effectiveness increases with model size.

Looking forward

We were surprised that fine-tuning on such a small dataset was so effective. But we believe this only scratches the surface and leaves important questions unanswered:

Who should be consulted when designing a values-targeted dataset?
Who is accountable when a user receives an output that is not aligned with their own values?
How does this research apply to non-English languages and generative models outside language, such as image, video, or audio?
How robust is this methodology to real-world prompt distributions?^F
Our research experimented with a question-answer format.

Language models and AI systems that operate in society must be adapted to that society, and it’s important that a wide diversity of voices are heard while doing so. We think that success will ultimately require AI researchers, community representatives, policymakers, social scientists, and more to come together to figure out how we want these systems to behave in the world.

Please reach out to languagebehavior@openai.com⁠ if you are interested in conducting research on fine-tuning and model behavior with GPT‑3.

We encourage researchers, especially those from underrepresented backgrounds, with interest in fairness and social harms to apply to our Academic Access Program⁠(opens in a new window) and Scholars Program⁠.

Join our team

We are continually growing our safety team and are looking for people with expertise in thinking about social harms⁠(opens in a new window); designing⁠(opens in a new window) safe processes; managing⁠(opens in a new window) programs such as academic access; and building more fair⁠(opens in a new window) and aligned⁠(opens in a new window) systems. We are also interested in paid consulting⁠ with experts, especially in the areas of social harms and applied ethics.

Footnotes

A
See Appendix J of our paper⁠(opens in a new window) for more examples and analyses.
B
Training a large language model from scratch requires a large amount of data. For example, GPT-3 was trained on 570GB of data. See [Brown, Mann, Ryder, Subbiah et al⁠(opens in a new window)].
C
Evaluations only give a small window into a model; they analyze a model along a specific axis and individually are not comprehensive, which is why we use both qualitative and quantitative metrics.
D
Toxicity scores do not capture all nuance in toxicity and host their own biases; [Dixon et al⁠(opens in a new window)] describe demographic biases where toxicity scores flag identity terms as false positives, and [Sap et al⁠(opens in a new window)] describe racial bias where scores are more likely to flag African American English as toxic. This is why we conduct further evaluations.
E
Read more about the GPT-3 model and its training data in the GPT-3 Model Card⁠(opens in a new window)
F
Our research experimented with a question–answer format.

Authors

Irene Solaiman, Christy Dennison

Acknowledgments

We’d like to thank Steve Dowling, Hannah Wong, Greg Brockman, Miles Brundage, Gretchen Krueger, Mira Murati, Jan Leike, Jeff Wu, Ilya Sutskever, Lilian Weng, Elizabeth Barnes, and Justin Jay Wang for their feedback on earlier versions of this blog post.

View all