HealthBench is a new evaluation benchmark for AI in healthcare which evaluates models in realistic scenarios. Built with input from 250+ physicians, it aims to provide a shared standard for model performance and safety in health.
A factuality benchmark called SimpleQA that measures the ability for language models to answer short, fact-seeking questions.
We've analyzed how ChatGPT responds to users based on their name, using AI research assistants to protect privacy.
We are introducing OpenAI o1, a new large language model trained with reinforcement learning to perform complex reasoning. o1 thinks before it answers—it can produce a long internal chain of thought before responding to the user.
This report outlines the safety work carried out prior to releasing GPT-4o including external red teaming, frontier risk evaluations according to our Preparedness Framework, and an overview of the mitigations we built in to address key risk areas.
Introducing the most cost-efficient small model in the market
We present a holistic approach to building a robust and useful natural language classification system for real-world content moderation.
Using new techniques for scaling sparse autoencoders, we automatically identified 16 million patterns in GPT-4's computations.
We’re announcing GPT-4 Omni, our new flagship model which can reason across audio, vision, and text in real time.