Skip to main content

October 1, 2024

Introducing the Realtime API

Developers can now build fast speech-to-speech experiences into their applications

DALL·E generated impressionist oil painting of undulating orange audio waves creating a harmonious, layered composition

Today, we're introducing a public beta of the Realtime API, enabling all paid developers to build low-latency, multimodal experiences in their apps. Similar to ChatGPT’s Advanced Voice Mode, the Realtime API supports natural speech-to-speech conversations using the six preset voices(opens in a new window) already supported in the API.

We’re also introducing audio input and output in the Chat Completions API(opens in a new window) to support use cases that don’t require the low-latency benefits of the Realtime API. With this update, developers can pass any text or audio inputs into GPT-4o and have the model respond with their choice of text, audio, or both.

From language apps and educational software to customer support experiences, developers have already been leveraging voice experiences to connect with their users. Now with the Realtime API and soon with audio in the Chat Completions API, developers no longer have to stitch together multiple models to power these experiences. Instead, you can build natural conversational experiences with a single API call.

How it works

Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech(opens in a new window) model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

Under the hood, the Realtime API lets you create a persistent WebSocket connection to exchange messages with GPT-4o. The API supports function calling(opens in a new window), which makes it possible for voice assistants to respond to user requests by triggering actions or pulling in new context. For example, a voice assistant could place an order on behalf of the user or retrieve relevant customer information to personalize its responses.

Powering customer support agents, language learning assistants, and more

As part of our iterative deployment strategy, we’ve been testing the Realtime API with a handful of partners to gather feedback while we build. A couple of promising early use cases include:

Healthify, a nutrition and fitness coaching app, uses the Realtime API to enable natural conversations with its AI coach Ria, while involving human dietitians when needed for personalized support.

Speak, a language learning app, uses Realtime API to power its role-play feature, encouraging users to practice conversations in a new language.

Availability & pricing

The Realtime API will begin rolling out today in public beta to all paid developers. Audio capabilities in the Realtime API are powered by the new GPT-4o model gpt-4o-realtime-preview.

Audio in the Chat Completions API will be released in the coming weeks, as a new model gpt-4o-audio-preview. With gpt-4o-audio-preview, developers can input text or audio into GPT-4o and receive responses in text, audio, or both.

The Realtime API uses both text tokens and audio tokens. Text input tokens are priced at $5 per 1M and $20 per 1M output tokens. Audio input is priced at $100 per 1M tokens and output is $200 per 1M tokens. This equates to approximately $0.06 per minute of audio input and $0.24 per minute of audio output. Audio in the Chat Completions API will be the same price.

Safety & privacy

The Realtime API uses multiple layers of safety protections to mitigate the risk of API abuse, including automated monitoring and human review of flagged model inputs and outputs. The Realtime API is built on the same version of GPT-4o that powers Advanced Voice Mode in ChatGPT, which we carefully assessed using both automated and human evaluations, including evaluations according to our Preparedness Framework, detailed in the GPT-4o System Card. The Realtime API also leverages the same audio safety infrastructure we built for Advanced Voice Mode, which our testing shows has helped to reduce the potential for harm.

It is against our usage policies to repurpose or distribute output from our services to spam, mislead, or otherwise harm others – and we actively monitor for potential abuse. Our policies also require developers to make it clear to their users that they are interacting with AI, unless it's obvious from the context.

Prior to launch, we tested the Realtime API with our external red teaming network and found that the Realtime API didn’t introduce any high-risk gaps not covered by our existing mitigations. As with all API services, the Realtime API is subject to our Enterprise privacy commitments. We do not train our models on the inputs or outputs used in this service without your explicit permission.

Getting started

Developers can start building with the Realtime API over the coming days in the Playground(opens in a new window), or by using our docs(opens in a new window) and the reference client(opens in a new window).

We’ve also worked with LiveKit(opens in a new window) and Agora(opens in a new window) to create client libraries of audio components like echo cancellation, reconnection, and sound isolation, and Twilio(opens in a new window) to integrate the Realtime API with Twilio’s Voice APIs(opens in a new window) which enable developers to seamlessly build, deploy and connect AI virtual agents to customers via voice calls.

What’s next

As we work towards general availability, we’re actively collecting feedback to improve the Realtime API. Some of the capabilities we plan to introduce include:

  • More modalities: To start, the Realtime API will support voice, and we plan to add additional modalities like vision and video over time.

  • Increased rate limits: Today the API is rate limited to approximately 100 simultaneous sessions for Tier 5 developers, with lower limits for Tiers 1-4. We will increase these limits over time to support larger deployments.

  • Official SDK support: We will integrate support for Realtime API into the OpenAI Python and Node.js SDKs.

  • Prompt Caching: We will add support for Prompt Caching(opens in a new window) so previous conversation turns can be reprocessed at a discount.

  • Expanded model support: The Realtime API will also support GPT-4o mini in upcoming versions of that model.

We're looking forward to seeing how developers leverage these new capabilities to create compelling new audio experiences for their users across a variety of use cases from education to translation, customer service, accessibility and beyond.