September 26, 2024

Upgrading the Moderation API with our new multimodal moderation model

We’re introducing a new model built on GPT‑4o that is more accurate at detecting harmful text and images, enabling developers to build more robust moderation systems.

Abstract painting of overlapping geometric shapes in pastel shades of orange, pink, purple, and blue, with a block-like pattern creating a layered, textured effect.

Loading…

Today we are introducing a new moderation model, omni-moderation-latest, in the Moderation API⁠(opens in a new window). Based on GPT‑4o⁠, the new model supports both text and image inputs and is more accurate than our previous model, especially in non-English languages. Like the previous version, this model uses OpenAI's GPT‑based classifiers to assess whether content should be flagged across categories such as hate, violence, and self-harm, while also adding the ability to detect additional harm categories. Additionally, it provides more granular control over moderation decisions by calibrating probability scores to reflect the likelihood of content matching the detected category. The new moderation model is free to use for all developers through the Moderation API.

Since we first launched⁠ the Moderation API in 2022, the volume and variety of content that automated moderation systems need to handle has increased, especially as more AI apps have reached massive scale in production. We hope today’s upgrades help more developers benefit from the latest research and investments in our safety systems.

Companies across various sectors—from social media platforms and productivity tools to generative AI platforms—are using the Moderation API to build safer products for their users. For instance, Grammarly is using the Moderation API as part of the safety guardrails in its AI communications assistance to ensure its products outputs are safe and fair. Similarly, ElevenLabs utilizes the Moderation API along with in-house solutions to scan content generated by their audio AI products, preventing and flagging outputs that violate their policies.

The updated moderation model includes a number of major improvements:

Multimodal harm classification across six categories: the new model can evaluate the likelihood that an image, in isolation or in conjunction with text, contains harmful content. This is supported today for the following categories: violence (violence and violence/graphic), self-harm (self-harm, self-harm/intent, and self-harm/instruction) and sexual (sexual but not sexual/minors). The remaining categories are currently text-only and we are working to expand multimodal support to more categories in the future.
Two new text-only harm categories: the new model can detect harm in two additional categories compared to our previous models: illicit, which covers instructions or advice on how to commit wrongdoing—a phrase like “how to shoplift” for example, and illicit/violent, which covers the same for wrongdoing that also includes violence.
More accurate scores, especially for non-English content: in a test of 40 languages, compared to the previous model, this new model improved 42% on our internal multilingual eval, and improved in 98% of languages tested. For low-resource languages like Khmer or Swati, it improved 70%, and we saw the biggest improvements in Telugu (6.4x), Bengali (5.6x), and Marathi (4.6x). While the previous model had limited support for non-English languages, the performance of the new model in Spanish, German, Italian, Polish, Vietnamese, Portuguese, French, Chinese, Indonesian, and English all exceed even English performance from the previous model.

text-moderation-007 vs omni-moderation-latest multilingual performance

A higher AUPRC indicates better model performance in distinguishing between safe and unsafe example on the hard multilingual eval set

Calibrated scores: the new model’s scores now more accurately represent the probability that a piece of content violates the relevant policies and will be significantly more consistent across future moderation models.

AI content moderation systems help enforce platform policies and ease the workload on human moderators, crucially sustaining the health of digital platforms. That’s why, just like our previous model⁠, we’re making the new moderation model free to use for all developers through the Moderation API, with rate limits depending on usage tier. To get started, see our Moderation API guide⁠(opens in a new window).

Authors

Ian Kivlichan, Justyn Harriman, Cameron Raymond, Meghan Shah, Shraman Ray Chaudhuri, Keren Gu-Lemberg

Acknowledgements

Flo Leoni, Jieqi Yu, Madelaine Boyd, Mingxuan Wang, Nithanth Kudige, Yao Zhou, Andrea Vallone, Alec Helyar, Edmund Wong, Francis Zhang, Hadi Salman, Henrique Ponde de Oliveira Pinto, Joyce Lee, Nick Preston, Raul Puri, Shibani Santurkar, Lindsay McCallum, Leher Pathak, Edwin Arbus, Kevin Whinnery, Beth Hoover, Freddie Sulit, Filippo Raso, Cary Hudson, Dev Valladares, Pranav Deshpande, Sam Toizer, Lilian Weng, Owen Campbell-Moore