Introducing the Realtime API
Developers can now build fast speech-to-speech experiences into their applications
Update on October 30, 2024: We've added five new voices with greater range and expressiveness. Cached pricing is now also available for text and audio inputs, lowering the price to $2.50/1M cached text input tokens and $20/1M cached audio input tokens. Learn more here(opens in a new window).
Update on October 17, 2024: Audio inputs and outputs are now available in the Chat Completions API. Get started here(opens in a new window).
Today, we're introducing a public beta of the Realtime API, enabling all paid developers to build low-latency, multimodal experiences in their apps. Similar to ChatGPT’s Advanced Voice Mode, the Realtime API supports natural speech-to-speech conversations using the six preset voices(opens in a new window) already supported in the API.
We’re also introducing audio input and output in the Chat Completions API(opens in a new window) to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both.
From language apps and educational software to customer support experiences, developers have already been leveraging voice experiences to connect with their users. Now with the Realtime API and soon with audio in the Chat Completions API, developers no longer have to stitch together multiple models to power these experiences. Instead, you can build natural conversational experiences with a single API call.
How it works
Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech(opens in a new window) model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.
Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context. For example, a voice assistant could place an order on behalf of the user or retrieve relevant customer information to personalize its responses.
Powering customer support agents, language learning assistants, and more
As part of our iterative deployment strategy, we’ve been testing the Realtime API with a handful of partners to gather feedback while we build. A couple of promising early use cases include:
Availability & pricing
The Realtime API will begin rolling out today in public beta to all paid developers. Audio capabilities in the Realtime API are powered by the new GPT-4o model gpt-4o-realtime-preview
.
Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview
. With gpt-4o-audio-preview
, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.
The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.
Safety & privacy
The Realtime API uses multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs. The Realtime API is built on the same version of GPT-4o that powers Advanced Voice Mode in ChatGPT, which we carefully assessed using both automated and human evaluations, including evaluations according to our Preparedness Framework, detailed in the GPT-4o System Card. The Realtime API also leverages the same audio safety infrastructure we built for Advanced Voice Mode, which our testing shows has helped to reduce the potential for harm.
It is against our usage policies to repurpose or distribute output from our services to spam, mislead, or otherwise harm others – and we actively monitor for potential abuse. Our policies also require developers to make it clear to their users that they are interacting with AI, unless it's obvious from the context.
Prior to launch, we tested the Realtime API with our external red teaming network and found that the Realtime API didn’t introduce any high-risk gaps not covered by our existing mitigations. As with all API services, the Realtime API is subject to our Enterprise privacy commitments. We do not train our models on the inputs or outputs used in this service without your explicit permission.
Getting started
Developers can start building with the Realtime API over the coming days in the Playground(opens in a new window), or by using our docs(opens in a new window) and the reference client(opens in a new window).
We’ve also worked with LiveKit(opens in a new window) and Agora(opens in a new window) to create client libraries of audio components like echo cancellation, reconnection, and sound isolation, and Twilio(opens in a new window) to integrate the Realtime API with Twilio’s Voice APIs(opens in a new window) which enable developers to seamlessly build, deploy and connect AI virtual agents to customers via voice calls.
What’s next
As we work towards general availability, we’re actively collecting feedback to improve the Realtime API. Some of the capabilities we plan to introduce include:
More modalities: To start, the Realtime API will support voice, and we plan to add additional modalities like vision and video over time.
Increased rate limits: Today the API is rate limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1-4. We will increase these limits over time to support larger deployments.
Official SDK support: We will integrate support for Realtime API into the OpenAI Python and Node.js SDKs.
Prompt Caching: We will add support for Prompt Caching(opens in a new window) so previous conversation turns can be reprocessed at a discount.
Expanded model support: The Realtime API will also support GPT-4o mini in upcoming versions of that model.
We're looking forward to seeing how developers leverage these new capabilities to create compelling new audio experiences for their users across a variety of use cases from education to translation, customer service, accessibility and beyond.