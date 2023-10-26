We evaluate each new major model released for safety, including using red-teaming. For example, before publicly releasing GPT-4, external red-teamers tested the model for the following frontier risks: (1) aid of the development of nuclear, radiological, biological, and chemical weapons (CBRN), (2) increase of cyber risk, (3) risks stemming from tool use and (4) self-replication capabilities. As part of our red-teaming of DALL-E 3, within the scope of our voluntary commitments, we red teamed the model’s ability to provide visual information needed to develop, acquire, or disperse CBRN.

We have also shared an open call for an OpenAI Red Teaming Network to publicly invite domain experts interested in improving the safety of OpenAI’s models to join our red-teaming efforts.

CBRN. Certain LLM capabilities can have dual-use potential, meaning that the models can be used for both commercial and military or proliferation applications. We subjected GPT-4 to stress testing, boundary testing, and red teaming in four dual-use domains to explore whether our models could provide the necessary information to proliferators seeking to develop, acquire or disperse CBRN. We found that on its own, access to GPT-4 is an insufficient condition for proliferation, but that it could alter the information available to proliferators, especially in comparison to traditional search tools. Red teamers selected a set of questions to prompt both GPT-4 and traditional search engines, finding that the time to research completion was reduced when using GPT-4. In some cases, the research process was shortened by several hours without sacrificing information accuracy. We therefore concluded that a key risk driver is GPT-4’s ability to generate publicly accessible but difficult-to-find information, shortening the time users spend on research and compiling this information in a way that is understandable to a non-expert user. Prior to releasing DALL-E 3, we evaluated how text-to-image generation changed the risk profile by testing the model’s ability to generate diagrams and visual instructions for producing and acquiring information related to CBRN risks. Similarly to GPT-4, we performed internal and external testing of DALL-E 3, where we tested the model for risks internally and provided early access to external experts from a range of industries to help probe the systems to map and evaluate risks. We subjected DALL·E 3 to red teaming in four dual-use domains to explore whether they could provide the information needed to develop, acquire, or disperse CBRN. Red teamers found minimal risk in these areas due to a combination of inaccuracy on these subject areas, refusals, and the broader need for further access and "ingredients" necessary for successful proliferation.

Cyber capabilities. We also assessed GPT-4’s ability to be used for vulnerability discovery and exploitation, and social engineering. To test the model’s ability to aid in computer vulnerability discovery, assessment, and exploitation, we contracted external cybersecurity experts who found that GPT-4 could explain some vulnerabilities if the source code was small enough to fit in the model’s context window, but that GPT-4 performed poorly at building exploits for the vulnerabilities that were identified. To test for social engineering capabilities, expert red teamers tested if GPT-4 represented an improvement over current tools in relevant tasks such as target identification, spear-phishing, and bait-and-switch phishing. They found that the model was not a ready-made upgrade to current social engineering capabilities as it struggled with factual tasks like enumerating targets and applying recent information to produce more effective phishing content. However, with the appropriate background knowledge about a target, GPT-4 was effective in drafting realistic social engineering content. Based on these findings, we post-trained GPT-4 to refuse malicious cybersecurity requests, and scaled our internal safety systems, including in monitoring, detection and response.

Self-replication. Prior to releasing GPT-4, we also facilitated a preliminary model evaluation by the Alignment Research Center (ARC) of the model’s ability to carry out actions to autonomously replicate and gather resources. We granted ARC early access to the models as a part of our red-teaming so their team could assess risks from power-seeking behavior. The specific form of power-seeking that ARC assessed was the ability for the model to autonomously replicate and acquire resources. ARC found early versions of GPT-4 were ineffective at an autonomous replication task in preliminary experiments they conducted. They therefore concluded that the model was unlikely to be able to autonomously replicate itself.

