OpenAI GPT-Realtime-2: new voice models for AI agents

OpenAI launches three new voice models on May 7, 2026

On May 7, 2026, OpenAI announced three new voice models for the Realtime API: GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper. These models lift AI voice agents to a new level, with GPT-5-class reasoning, live translation between more than 70 languages and streaming transcription.

For businesses using an AI phone agent, this means shorter wait times, better multilingual conversations and agents that handle more complex tasks autonomously. In this article: what each model does, what they cost and what it means for Heyloha customers.

GPT-Realtime-2: a voice model that reasons

GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning. It can handle complex requests, call tools in parallel and continue the conversation naturally while it thinks. The context window has increased from 32,000 to 128,000 tokens, enabling longer and more coherent sessions.

Four innovations stand out. Adjustable reasoning effort (minimal, low, medium, high, xhigh) lets developers balance latency with complexity. Preambles let the agent say short phrases like 'one moment' before starting. Parallel tool calls with audio feedback ('checking your calendar now') keep conversations fluid. Stronger domain understanding means healthcare terminology, proper nouns and jargon are retained better.

OpenAI reports 15.2% higher scores on Big Bench Audio than GPT-Realtime-1.5 and 13.8% higher on Audio MultiChallenge. Zillow, an early adopter, reported a 26-point lift in call success rate (95% vs 69%) on their hardest benchmark.

GPT-Realtime-Translate: live translation across more than 70 languages

GPT-Realtime-Translate translates speech in real time from more than 70 input languages into 13 output languages. Two people can each speak in their preferred language and hear the other in their preferred language. The model also produces live transcripts during the conversation.

Use cases: multilingual customer support, cross-border sales calls, online education, events and streaming platforms with global audiences. BolnaAI reported 12.5% lower Word Error Rates for Hindi, Tamil and Telugu compared to other models tested. Deutsche Telekom is testing the model for customer support where customers can speak in the language they are most comfortable with.

GPT-Realtime-Whisper: streaming transcription with low latency

GPT-Realtime-Whisper is a new streaming speech-to-text model. It transcribes speech as someone talks, with adjustable latency. Lower settings produce faster partial transcripts; higher settings improve transcript quality.

Practical applications: live captions for meetings and events, notes that keep up with conversations, voice agents that continuously follow the user, and faster follow-up workflows in customer support, healthcare and sales.

Three new patterns for voice AI

OpenAI identifies three patterns developers are now building around. Voice-to-action: the user describes what they want and the system reasons, uses tools and completes the task. Zillow is building an assistant that responds to requests like 'find homes within my budget, avoid busy streets and book a tour for Saturday'.

Systems-to-voice: software turns context into live spoken guidance. A travel app can proactively say: 'Your inbound flight is delayed, but you can still make your connection. The new gate is X, the fastest route is Y'.

Voice-to-voice: AI helps live conversations continue across language barriers. Deutsche Telekom is building voice support where customers can speak in their preferred language and the model translates in real time.

What this means for Heyloha customers

Heyloha has been running on the OpenAI Realtime API since March 2026. The Heyloha phone agent already uses the production version of OpenAI's voice technology, with fast responses, natural intonation and automatic language detection.

GPT-Realtime-2 is now on our roadmap. We are evaluating the model on quality, latency and cost before rolling it out to customers. The improved reasoning and larger context window are a perfect fit for conversations that require multiple steps, like booking appointments or answering complex product questions.

For live translation, we are looking at GPT-Realtime-Translate as a complement to existing multilingual chat. Heyloha already supports 5 platform languages and automatic language detection. With this model, seamless multilingual voice becomes a realistic next step.

Pricing and availability

GPT-Realtime-2 costs 32 dollars per 1 million audio input tokens (0.40 dollars for cached input) and 64 dollars per 1 million audio output tokens. GPT-Realtime-Translate costs 0.034 dollars per minute. GPT-Realtime-Whisper costs 0.017 dollars per minute.

All three models are available via the OpenAI Realtime API. The Realtime API supports EU Data Residency for European applications. Heyloha customers do not pay OpenAI directly: Heyloha plans are all-in and agent costs are included. See pricing for an overview.

Frequently asked questions

What is GPT-Realtime-2? GPT-Realtime-2 is OpenAI's voice model for AI agents with GPT-5-class reasoning, a 128,000 token context window and adjustable reasoning effort. It was announced on May 7, 2026.

What is the difference between GPT-Realtime-2 and Whisper? GPT-Realtime-2 is a speech-to-speech model that listens, reasons and responds. GPT-Realtime-Whisper is a speech-to-text model that transcribes without responding. Use Realtime-2 for a phone agent, Whisper for live captions.

Which languages does GPT-Realtime-Translate support? GPT-Realtime-Translate translates more than 70 input languages into 13 output languages, including Dutch, English, German, French, Spanish and Hindi.

Does Heyloha use GPT-Realtime-2 yet? Heyloha has used the OpenAI Realtime API for the phone agent since March 2026. GPT-Realtime-2 is currently being evaluated for a future update.

Try Heyloha

Want to experience what a modern AI voice agent based on OpenAI's Realtime API can do? Start free with Heyloha and call your own number. No credit card needed, agent live in 30 minutes.