OpenAI’s approach to frontier risk

On July 21 2023, OpenAI joined other leading AI labs in making a set of voluntary commitments to promote safety, security, and trust in AI. These commitments encompassed a range of risk areas, centrally including the frontier risks that are the focus of the upcoming AI Safety Summit.

In this update, we describe our progress on those voluntary commitments and further detail our evolving approach to mitigating frontier risks, including our ongoing work to develop a Preparedness Framework.

On October 3, 2023 we publicly released the system card for our text to image model DALL-E 3, the first major public release of a new frontier model within the scope of our voluntary commitments. In line with both our mission and the voluntary commitments, we did critical safety work including pre-deployment safety evaluation and red-teaming. In addition, we are working toward new methods to empower people to track the provenance of AI-generated media, and have continued to invest in responsible practices through our rollout of voice and image analysis capabilities in ChatGPT.

We have also met our voluntary commitment to “establish or join a forum or mechanism through which [we] can develop, advance, and adopt shared standards and best practices for frontier AI safety,” by co-founding the Frontier Model Forum. This new industry body, jointly established with Microsoft, Google Deepmind, and Anthropic, is a venue to advance AI safety research and promote responsible development practices for frontier AI systems.

Preparedness Framework

Frontier AI models have the potential to benefit all of humanity, but also pose increasingly severe risks. To manage these risks as AI models continue to improve, we are developing a Preparedness Framework, which deepens our proactive, risk-based approach to responsible frontier model development, especially in relation to catastrophic risks.

The Preparedness Framework will detail our approach to developing rigorous frontier model capability evaluations and monitoring, as well as establishing a governance structure for accountability and oversight across the development process. The risks we plan to track as part of this policy span multiple categories including cybersecurity, persuasion, chemical and biological threats, and autonomy.

The Preparedness Framework will also provide for a spectrum of actions to protect against catastrophic outcomes. The empirical understanding of catastrophic risk is nascent and developing rapidly. We will thus be dynamically updating our assessment of current frontier model risk levels to ensure we reflect our latest evaluation and monitoring understanding. We are standing up a dedicated team (Preparedness) that drives this effort, including performing necessary research and monitoring.

The Preparedness Framework is meant to complement and extend our existing risk mitigation work, which contributes to the safety and alignment of new, highly capable systems, both before and after deployment. These existing efforts include the work of our Safety Systems team to conduct research and build systematic solutions to ensure our best models can be deployed safely and our Superalignment team, which focuses on the machine learning challenges of aligning superintelligent AI systems with human intent.

They also include a joint Deployment Safety Board (DSB) with Microsoft, which approves decisions by either party to deploy models above a certain capability threshold. The DSB focuses specifically on deployment decisions rather than on earlier steps such as deciding whether or not to train models of a certain scale or capability level. It has some of the features that are often discussed in the context of responsible scaling policies, such as a focus on the most capable systems, a strong emphasis on adversarial testing, and explicit consideration of alignment. We have learned valuable lessons from the DSB's review of GPT-4, which was the first eligible deployment, and will be using those lessons to inform the design and implementation of the Preparedness Framework. Both the DSB and the Preparedness Framework, and their respective roles, may evolve over time as we learn more about risks and mitigations.

Note: We refer to our policy as a Preparedness Framework rather than a Responsible Scaling Policy because we can experience dramatic increases in capability without significant increase in scale, e.g., via algorithmic improvements. The Preparedness Framework governs our development of increasingly capable frontier models regardless of whether those increasing capabilities stem from scale, algorithmic improvements, or other optimizations.

Priority research and investment on societal, safety, and security risks

We need scientific breakthroughs, societal preparedness and advanced security systems to control and integrate AI systems much smarter than us. We are investing in these breakthroughs by creating two new teams: Superalignment and Preparedness, and further investing in our security systems.

Our current techniques for aligning AI, such as reinforcement learning from human feedback, rely on human ability to supervise AI. But these techniques will not work for superintelligence, because humans will be unable to reliably supervise AI systems much smarter than us. We have set a goal to solve this problem within four years by investing in a new team called Superalignment, co-led by Ilya Sutskever (cofounder and Chief Scientist of OpenAI) and Jan Leike (Head of Alignment). Our goal is to build a near human-level automated alignment researcher and to use a large amount of compute to scale our efforts to align superintelligence. We plan to dedicate 20% of the compute we had secured by June 2023 to this effort. The team will share results broadly to also contribute to the alignment and safety of non-OpenAI models.

Beyond the challenge of aligning superintelligence, we believe that increasingly severe risks may emerge from potential misuse of increasingly capable frontier models. We are creating a dedicated new team called Preparedness to identify, track and prepare for these risks. We intend to track frontier risks, including cybersecurity, CBRN, persuasion, and autonomous replication and adaptation and share actions to protect against the impacts of catastrophic risk. Because the empirical understanding of catastrophic risk is nascent, we will iteratively update our assessment of current frontier model risk levels to ensure we reflect our latest evaluation and monitoring understanding.

We are continuing to invest in cybersecurity and insider threat safeguards to protect proprietary and unreleased model weights. We have launched the Cybersecurity Grant Program and the OpenAI Bug Bounty Program to coordinate like-minded researchers working for our collective safety. The Cybersecurity Grant Program is a $1M initiative to boost and quantify AI-powered cybersecurity capabilities and to foster high-level AI and cybersecurity discourse. We also invite the public to report vulnerabilities, bugs, or security flaws they discover in our systems. The OpenAI Bug Bounty Program allows us to recognize and reward the valuable insights of individuals who contribute to keeping our technology and company secure.

Model evaluations and red teaming

We evaluate each new major model released for safety, including using red-teaming. For example, before publicly releasing GPT-4, external red-teamers tested the model for the following frontier risks: (1) aid of the development of nuclear, radiological, biological, and chemical weapons (CBRN), (2) increase of cyber risk, (3) risks stemming from tool use and (4) self-replication capabilities. As part of our red-teaming of DALL-E 3, within the scope of our voluntary commitments, we red teamed the model’s ability to provide visual information needed to develop, acquire, or disperse CBRN.

We have also shared an open call for an OpenAI Red Teaming Network to publicly invite domain experts interested in improving the safety of OpenAI’s models to join our red-teaming efforts.

CBRN. Certain LLM capabilities can have dual-use potential, meaning that the models can be used for both commercial and military or proliferation applications. We subjected GPT-4 to stress testing, boundary testing, and red teaming in four dual-use domains to explore whether our models could provide the necessary information to proliferators seeking to develop, acquire or disperse CBRN. We found that on its own, access to GPT-4 is an insufficient condition for proliferation, but that it could alter the information available to proliferators, especially in comparison to traditional search tools. Red teamers selected a set of questions to prompt both GPT-4 and traditional search engines, finding that the time to research completion was reduced when using GPT-4. In some cases, the research process was shortened by several hours without sacrificing information accuracy. We therefore concluded that a key risk driver is GPT-4’s ability to generate publicly accessible but difficult-to-find information, shortening the time users spend on research and compiling this information in a way that is understandable to a non-expert user. Prior to releasing DALL-E 3, we evaluated how text-to-image generation changed the risk profile by testing the model’s ability to generate diagrams and visual instructions for producing and acquiring information related to CBRN risks. Similarly to GPT-4, we performed internal and external testing of DALL-E 3, where we tested the model for risks internally and provided early access to external experts from a range of industries to help probe the systems to map and evaluate risks. We subjected DALL·E 3 to red teaming in four dual-use domains to explore whether they could provide the information needed to develop, acquire, or disperse CBRN. Red teamers found minimal risk in these areas due to a combination of inaccuracy on these subject areas, refusals, and the broader need for further access and "ingredients" necessary for successful proliferation.

Cyber capabilities. We also assessed GPT-4’s ability to be used for vulnerability discovery and exploitation, and social engineering. To test the model’s ability to aid in computer vulnerability discovery, assessment, and exploitation, we contracted external cybersecurity experts who found that GPT-4 could explain some vulnerabilities if the source code was small enough to fit in the model’s context window, but that GPT-4 performed poorly at building exploits for the vulnerabilities that were identified. To test for social engineering capabilities, expert red teamers tested if GPT-4 represented an improvement over current tools in relevant tasks such as target identification, spear-phishing, and bait-and-switch phishing. They found that the model was not a ready-made upgrade to current social engineering capabilities as it struggled with factual tasks like enumerating targets and applying recent information to produce more effective phishing content. However, with the appropriate background knowledge about a target, GPT-4 was effective in drafting realistic social engineering content. Based on these findings, we post-trained GPT-4 to refuse malicious cybersecurity requests, and scaled our internal safety systems, including in monitoring, detection and response.

Self-replication. Prior to releasing GPT-4, we also facilitated a preliminary model evaluation by the Alignment Research Center (ARC) of the model’s ability to carry out actions to autonomously replicate and gather resources. We granted ARC early access to the models as a part of our red-teaming so their team could assess risks from power-seeking behavior. The specific form of power-seeking that ARC assessed was the ability for the model to autonomously replicate and acquire resources. ARC found early versions of GPT-4 were ineffective at an autonomous replication task in preliminary experiments they conducted. They therefore concluded that the model was unlikely to be able to autonomously replicate itself.

Transparency is an important element of building accountable AI systems. A key part of our approach to accountability is publishing a document that we currently call a system card, for new AI systems that we deploy. Our system cards aim to inform readers about key factors impacting the system’s behavior, especially in areas pertinent for responsible usage and draws inspiration from previous research work on model and system cards. Prior to making the voluntary commitments, OpenAI had published two system cards: the GPT-4 System Card and DALL-E 2 System Card. Since then, we published a System Card prior to releasing DALL-E 3 in ChatGPT, our first major public release of a new model since signing the voluntary commitments. In continued efforts to release our technology responsibly, we also published a System Card for GPT-4’s vision capabilities prior to making it available in ChatGPT.

Reporting structure for vulnerabilities found after model release

Since making the voluntary commitments, we have initiated a working group within the Frontier Model Forum to create a mechanism for the responsible disclosure of dangerous capabilities among AI labs. This mechanism will aim to enable the confidential disclosure of significant risks identified in frontier models among frontier labs and other AI labs. Our initial focus encompasses national security-related domains such as Chemical, Biological, Radiological, and Nuclear (CBRN) capabilities, along with other dangerous capabilities like self-replication, deception, and manipulation. The disclosure methods will encompass evaluations, insights from red-teaming exercises, and other evidence of common threats among lab members in areas where broader disclosure would present significant risks.

We also announced OpenAI’s bug bounty program as a way to recognize and reward individuals who report security vulnerabilities in our systems. Our rewards range from $200 for low-severity findings to $20,000 for exceptional discoveries. We have partnered with Bugcrowd, a leading bug bounty platform, to create a submission and reward process, available on the Bug Bounty Program page.

Post-deployment monitoring for patterns of misuse

We work hard to prevent foreseeable risks before deployment. However, there are also limits to what anyone can learn in a lab. Even after extensive research and testing, we cannot predict all of the beneficial ways people will use our technology, nor all the ways people may abuse it. Building the capacity to quickly detect and address unforeseen risks is a high priority for us, as this capacity is a critical safeguard for frontier systems where not all risks can be fully anticipated. We build internal measures designed to detect unexpected types of abuse, have processes to respond to them,and use the learnings to improve our usage policies, safety systems, and model outputs. After releasing a system, we do proactive investigation, monitoring, and vetting of inbound reports to detect abuse or unforeseen risks. We then aim to quickly and iteratively address surfaced issues via policy and technical solutions. We are continuing to scale our operations and reduce response time.

Security controls including securing model weights

We dedicate significant resources to the protection of OpenAI’s technology, intellectual property, and data.

We deploy our most powerful AI models as services. We do not distribute weights for such models outside of OpenAI and our technology partner Microsoft, and we provide third-party access to our most capable models via API so the model weights, source code, and other sensitive information remain controlled.

We also implement commercially reasonable technical, administrative, and organizational measures designed to prevent personal information loss, misuse, and unauthorized access. This includes undergoing third-party audits of our security program including SOC 2 Type 2. We have also started a bug bounty program that invites independent researchers to report vulnerabilities in our systems in exchange for cash rewards. Our Trust Portal allows customers and other stakeholders to review our security controls and audit reports. As part of our cybersecurity efforts, we regularly conduct internal and third-party penetration testing, and audit the suitability and effectiveness of our security controls.

Identifiers of AI-generated material

We are developing a technical approach to provenance in order to assist in identifying audiovisual content created by our models. Once this approach is developed, we will be deploying it broadly across our new frontier systems. We are assessing a range of provenance techniques, each with distinct pros and cons, that broadly fall into three buckets: watermarking, classifiers, metadata-based approaches.

Since making our voluntary commitments, we have been researching and testing a provenance classifier to help us identify whether or not an image was generated by DALL·E 3. We are currently evaluating this internally and provided a public update as a part of the DALL·E 3 release.

Data Input Controls and Audit

OpenAI’s large language models, including the models that power ChatGPT, are developed using three primary sources of information: (1) information that is publicly available on the internet, (2) information that we license from third parties, and (3) information that our users or our human trainers provide.

The vast majority of our training data comes from publicly available information that is freely and openly available on the Internet – for example, we do not seek information behind paywalls or from the “deep web.” We apply filters and remove certain data that we do not want our models to learn from or output, such as hate speech, adult content, sites that primarily aggregate personal information, and spam.

We also have implemented measures to enable creators, rightsholders, and website operators to express their preferences regarding AI training with respect to the content that they own or control. For example, OpenAI has implemented an easy means for website operators to exclude their content from being accessed by OpenAI’s “GPTBot” web crawler, relying on the robots.txt web standard. Similarly, OpenAI has documented the user-agent-string (“ChatGPT-user”) used by ChatGPT and ChatGPT plugins to access websites, so that site operators can block access for those purposes, as well. We provide instructions online for how to disallow either bot from accessing sites. We also provide a self-service form for image creators to opt their content out from training of our future DALL-E image generation models.