Skip to main content

May 7, 2024

Our approach to data and AI

Just over a year after launching ChatGPT, AI is changing how we live, work and learn. It’s also raised important conversations about data in the age of AI. More on our approach, a new Media Manager for creators and content owners, and where we’re headed.

An abstract image featuring smooth gradients of pastel colors, including shades of peach, light blue, and purple. The colors blend softly, creating a calm and serene atmosphere with no distinct shapes or forms.

AI should expand opportunities for everyone. By transforming information in new ways, AI systems help us solve problems and express ourselves. Today, our AI tools like ChatGPT are being used around the world to help farmers in Kenya and India increase crop yields (Digital Green), researchers accelerate drug discovery (Moderna), governments support their workforces (State of Pennsylvania(opens in a new window)), educators advance student learning, and people with visual impairments navigate our world (Be My Eyes). AI tools like DALL·E and Sora (currently in research preview) are empowering creatives from aspiring artists to filmmakers.

Our mission is to benefit all of humanity. This encompasses not only our users, but also creators and publishers. While we believe legal precedents and sound public policy make learning fair use, we also feel that it’s important we contribute to the development of a broadly beneficial social contract for content in the AI age. 

We believe AI systems should benefit and respect the choices of creators and content owners. We’re continually improving our industry-leading systems to reflect content owner preferences, and are dedicated to building products and business models to fuel vibrant ecosystems for creators and publishers.

We are not professional writers, artists, or journalists, nor are we in those lines of business. We focus on building tools to help these professions create and achieve more. To accomplish this, we listen to and work closely with members of these communities, and look forward to our continued dialogues. Today, we’re sharing more about where we are and where we’re headed.

We respect the choices of creators and content owners on AI

Decades ago, the robots.txt standard was introduced and voluntarily adopted by the Internet ecosystem for web publishers to indicate what portions of websites web crawlers could access. 

Last summer, OpenAI pioneered the use of web crawler permissions for AI, enabling web publishers to express their preferences about the use of their content in AI. We take these signals into account each time we train a new model. 

That said, we understand these are incomplete solutions, as many creators do not control websites where their content may appear, and content is often quoted, reviewed, remixed, reposted and used as inspiration across multiple domains. We need an efficient, scalable solution for content owners to express their preferences about the use of their content in AI systems.

We’re building Media Manager so content owners can manage how their works are used in AI

OpenAI is developing Media Manager, a tool that will enable creators and content owners to tell us what they own and specify how they want their works to be included or excluded from machine learning research and training. Over time, we plan to introduce additional choices and features.

This will require cutting-edge machine learning research to build a first-ever tool of its kind to help us identify copyrighted text, images, audio, and video across multiple sources and reflect creator preferences. 

We’re collaborating with creators, content owners, and regulators as we develop Media Manager. Our goal is to have the tool in place by 2025, and we hope it will set a standard across the AI industry.

We’re building products to benefit users, creators and publishers in a vibrant ecosystem

Today, we live in an attention economy built for advertisers over users and quantity over quality. Our ambition is to use AI to change this: to empower creators and publishers and to enhance the user experience. 

We’re continuously making our products more useful discovery engines. We recently
improved source links in ChatGPT(opens in a new window) to give users better context and web publishers new ways to connect with our audiences. 

We’re also working with partners to display their content in our products and increase their connection to readers. We’ve announced partnerships with global news publishers from the
Financial Times, to Le Monde, Prisa Media, Axel Springer and more, to display their content in ChatGPT and enrich the user experience on news topics. More innovation is on the way. This content may also be used to train ChatGPT to better surface relevant publisher content to users and to improve our tools for newsrooms. 

Our partnerships are crafted to benefit partners and their users, making our models more useful to their employees, customers, and communities. To help advance educational resources, we partnered with nonprofits
Khan Academy and UK-based ExamSolutions(opens in a new window) to improve our model’s math performance, which accelerates their ability to expand access to personalized AI tutoring on their platform. 

Understanding our foundation models and how we build them

We design our AI models to be learning machines, not databases

AI models learn from relationships in information to create something new; they don’t store data like a database. When we train language models, we take trillions of words, and ask a computer to come up with an equation that best describes the relationship among the words and the underlying process that produced them. After the training process is complete, the AI model does not retain access to data analyzed in training. ChatGPT is like a teacher who has learned from lots of prior study and can explain things because she has learned the relationships between concepts, but doesn’t store the materials in her head.  

Our models are designed to help us generate new content and ideas – not to repeat or “regurgitate” content. AI models can state facts, which are in the public domain. If on rare occasions a model inadvertently repeats expressive content, it is a failure of the machine learning process. This failure is more likely to occur with content that appears frequently in training datasets, such as content that appears on many different public websites due to being frequently quoted. We employ state-of-the-art techniques throughout training and at output, for our API or ChatGPT, to prevent repetition, and we’re continually making improvements with on-going research and development.

We use broad and diverse data to build the best AI for everyone

We want our AI models to learn from as many languages, cultures, subjects, and industries as possible so they can benefit as many people as possible. The more diverse datasets are, the more diverse the models’ knowledge, understanding, and languages become – like a person who has been exposed to a wide range of cultural perspectives and experiences – and the more people and countries AI can safely serve. 

Each new generation of foundation models is trained from scratch on a new dataset. We constantly improve our architecture and increase the scale and diversity of our datasets significantly beyond our previous models. Unlike larger companies in the AI field, we do not have a large corpus of data collected over decades. We primarily rely on publicly available information to teach our models how to be helpful.

We train our models using:

  • Select publicly available data, mostly collected from industry-standard machine learning datasets and web crawls, similar to search engines. We exclude sources we know to have paywalls, primarily aggregate personally identifiable information, have content that violates our policies, or have opted-out.

  • Proprietary data from data partnerships. We partner to access non-publicly available content, such as archives and metadata. Our partners range from a major private video library for images and videos to train Sora to the Government of Iceland to help preserve their native languages. We don’t pursue paid partnerships for purely publicly available information. 

  • Human feedback from AI trainers, red teamers, employees, and users whose data control settings allow model improvements.

We take care to reduce the processing of personal and sensitive information, and we train our models not to provide private or sensitive information about people. We use a number of techniques to process raw data for safe use in training, and increasingly use AI models to help us clean, prepare and generate data. 

We do not train on our customers’ business data, including data from ChatGPT Team, ChatGPT Enterprise, or our API Platform. ChatGPT Free and Plus users can control whether they contribute to future model improvements in their settings(opens in a new window).

We’re building in partnership

AI evolves quickly, and we know our goals can’t be accomplished alone. We’re committed to collaborating with creators and publishers, creating mutually beneficial partnerships, supporting healthy ecosystems, and exploring new economic models. We thank our users and partners for working with us on these important subjects.