2026년 5월 7일

Advancing voice intelligence with new models in the API

A new generation of realtime voice models that can reason, translate, and transcribe as people speak.

로딩 중...

We’re introducing three audio models in the API that unlock a new class of voice apps for developers. With these models, developers can build voice experiences that feel more natural, respond more intelligently, and take action in real time:

GPT‑Realtime‑2, our first voice model with GPT‑5‑class reasoning that can handle harder requests and carry the conversation forward naturally.
GPT‑Realtime‑Translate, a new live translation model that translates speech from 70+ input languages into 13 output languages while keeping pace with the speaker.
GPT‑Realtime‑Whisper, a new streaming speech-to-text that transcribes speech live as the speaker talks.

GPT-Realtime-2 사용해보기

세션을 시작한 뒤 GPT-Realtime-2와 자연스럽게 대화해 보세요.

무엇을 물어볼 수 있나요?

세션을 시작한 뒤 아래 예시 중 하나를 말해보세요.

오늘 갑자기 저녁 모임을 열게 됐어. 시간은 30분밖에 없고, 채식하는 친구 두 명에 버섯 싫어하는 사람도 한 명 있어. 주방도 엄청 좁아. 간단한 메뉴 좀 짜줘.
일본에서 열리는 라이브 이벤트에서 게스트를 맞이하려고 해. 특별한 행사를 시작하는 진행자처럼 따뜻하고 자연스러운 일본어 환영 인사를 해줘.
내 주문 번호는 Orbit-742Q야. 맞는지 확인할 수 있게 또렷하게 다시 말해줘.
팀에 출시 목표 달성했다고 말하는 연습 좀 도와줘. 먼저 차분하지만 자신감 있는 느낌으로 말하고, 그다음엔 좀 더 신나게 말해줘.
로드트립에서 할 퀴즈를 준비 중이야. 겉보기엔 쉬워 보이는데 헷갈리기 좋은 문제 세 개 내주고, 답은 각각 한 문장으로 설명해줘.

이 데모는 시간 제한이 있습니다. 사용 시 OpenAI 이용약관에 동의하고 개인정보 보호 정책을 확인한 것으로 간주됩니다.

Voice is becoming one of the most natural ways for people to use software. It lets someone ask for help while driving, change a travel plan while walking through an airport, get support in their preferred language, or move through a task without stopping to type.

But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.

Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds.

Voice as an interface between people and products

As voice becomes a more natural way to use software, we’re seeing developers build around three emerging patterns in voice AI:

Voice-to-action, where people can describe what they need and the system can reason through the request, use tools, and complete the task. For example, Zillow is building an assistant that can listen, reason, and act on requests like: “find me homes within my BuyAbility, avoid busy streets, and schedule a tour for Saturday.”
Systems-to-voice, where software can turn context into live spoken guidance. For example, a travel app could proactively tell a traveler: “Your inbound flight is delayed, but you can still make your connection. I found the new gate, mapped the fastest route through the terminal, and your bag is still expected to transfer.”
Voice-to-voice, where AI can help live conversations continue across languages, tasks, or changing context. For example, Deutsche Telekom is building voice support experiences where customers can speak in the language they’re most comfortable using, while the model translates the conversation in real time.

세 가지 음성 AI 워크플로를 보여주는 다이어그램: 음성과 코드·개발, 쇼핑, 차량 내, 일정 관리 도구 같은 앱을 연결하는 voice-to-action, 앱·캘린더·CRM·지원 대시보드를 음성과 연결하는 systems-to-voice, 그리고 두 음성 에이전트를 연결하는 voice-to-voice.

These patterns can also work together. Priceline is working toward a future where travelers can manage entire trips by voice: searching for flights and hotels conversationally, handling changes like adjusting a hotel reservation after a flight delay or getting real-time updates on TSA wait times, and translating conversations once travelers are on the ground.

Realtime voice: helping voice models reason and take action

GPT‑Realtime‑2 is built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment.

Preambles: Developers can enable short phrases before a main response, like “let me check that” or “one moment while I look into it,” so users know the agent is working on the request.
Parallel tool calls and tool transparency: The model can call multiple tools at once and make those actions audible with phrases like “checking your calendar” or “looking that up now,” helping agents stay responsive while completing tasks.
Stronger recovery behavior: The model can recover more gracefully by saying things like “I’m having trouble with that right now,” instead of failing silently or breaking the conversation.
Longer context for agentic workflows: We’re increasing the context window from 32K to 128K to support longer, more coherent sessions and more complex task flows.
Stronger domain understanding: The model better retains specialized terminology, proper nouns, healthcare terms, and other vocabulary that matters in production settings.
More controllable tone and delivery: The model can better adjust its tone—speaking calmly while resolving an issue, empathetically when a user is frustrated, or upbeat when confirming a successful action.
Adjustable reasoning effort: Developers can now select from minimal, low, medium, high, and xhigh reasoning levels, with low as the default, balancing lower latency for straightforward interactions with more deliberate reasoning for complex requests.

The gains show up on audio evals that map closely to production voice agents: GPT‑Realtime‑2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence than GPT‑Realtime‑1.5. GPT‑Realtime‑2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following, improving over GPT‑Realtime‑1.5 and showing stronger reasoning, context management, and control in live conversations.

Big Bench Audio⁠는 오디오 입력을 지원하는 언어 모델의 까다로운 추론 역량을 평가합니다. Audio MultiChallenge⁠(새 창에서 열기)는 지시 준수, 맥락 통합, 자기 일관성, 자연스러운 음성 수정 처리 등을 포함해 음성 대화 시스템의 멀티턴 대화 지능을 평가합니다.

The magic of GPT‑Realtime‑2 shows up across a variety of different use cases:

During early testing, businesses used GPT‑Realtime‑2 to build voice agents that help customers and employees get things done through natural conversation:

Realtime translation: build live multilingual voice experiences

GPT‑Realtime‑Translate helps developers build live multilingual voice experiences where each person can speak in their preferred language and hear the conversation translated in real time and read the real time transcriptions. It supports more than 70 input languages and 13 output languages, making it useful for customer support, cross-border sales, education, events, media, and creator platforms serving global audiences.

For developers, live translation needs to preserve meaning while keeping pace with the speaker, even when people speak naturally, switch context, or use regional pronunciation and domain-specific language. For example, Deutsche Telekom is testing the model for multilingual voice interactions, where lower latency and stronger fluency can make cross-language conversations feel more natural.

In this video, Vimeo shows how GPT‑Realtime‑Translate can translate a product education video live as it plays, so global customers can hear updates in their preferred language without waiting for a separately produced version.

“인도를 위한 음성 AI를 구축하려면 다양한 지역별 음성적 특성을 처리해야 합니다. 힌디어, 타밀어, 텔루구어 전반에 걸친 평가에서 GPT-Realtime-Translate는 우리가 테스트한 어떤 다른 모델보다도 Word Error Rate를 12.5% 낮췄고, fallback 비율은 더 낮고, 작업 완료율은 더 높았으며, 지연 시간도 자연스러운 대화를 유지할 수 있었습니다. 이는 다국어 음성 AI의 새로운 기준을 제시합니다.”

— BolnaAI 공동 창립자 겸 CTO, Prateek Sachan

Realtime transcription: build low-latency transcription experiences

GPT‑Realtime‑Whisper is a new streaming transcription model built for low-latency speech-to-text. It transcribes audio as people speak, so live products can feel faster, more responsive, and more natural—from captions that appear in the moment, to meeting notes that keep up with the conversation.

The model makes live speech usable inside business workflows as it happens. Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions.

Safety

The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. We employ active classifiers over Realtime API sessions, meaning certain conversations can be halted if they are detected as violating our harmful content guidelines. Developers can also easily add their own additional safety guardrails using the Agents SDK⁠.⁠(새 창에서 열기)

Our usage policies⁠⁠ prohibit repurposing or distributing outputs from our services for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they’re interacting with AI, unless it’s already obvious from the context.

The Realtime API fully supports EU Data Residency⁠⁠(새 창에서 열기) for EU-based applications and is covered by our enterprise privacy commitments⁠⁠.

Pricing & availability

GPT‑Realtime‑2, GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper are available in the Realtime API. GPT‑Realtime‑2 is priced at $32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens. GPT‑Realtime‑Translate is priced at $0.034 per minute. GPT‑Realtime‑Whisper is priced at $0.017 per minute.

Get started

You can test the new realtime voice models in the Playground⁠(새 창에서 열기).

To start building, open this prompt in Codex⁠ to add GPT‑Realtime‑2 to an existing app or start a new one. If you don’t have Codex yet, download the Codex app⁠ first.

작성자

OpenAI

더 읽어보기

모두 보기

GPT-5.6 is now the preferred model in Microsoft 365 Copilot > Cover image

GPT-5.6, Microsoft 365 Copilot의 기본 모델로 채택

제품2026년 7월 9일

GPT-5.6: 더 큰 목표에 맞춰 확장되는 프런티어 AI

제품2026년 7월 9일

ChatGPT는 더 큰 목표를 함께 실현하는 파트너로 나아갑니다

제품2026년 7월 9일