AI

OpenAI Realtime API voice models reset the AI voice agent market

OpenAI Realtime API voice models launch 7 May 2026 with GPT-Realtime-2, Translate and Whisper. 128K context window, low-cost translation, GPT-5 class reasoning.

OpenAI Realtime API logo card for new voice models May 2026

IMAGE CREDITS: IMAGE: OPENAI / WIKIMEDIA COMMONS

OpenAI Realtime API voice models just shipped a generation jump with GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper, and the price-per-minute on translation is the headline that should worry any voice startup with a moat made of latency. OpenAI announced the three models on 7 May, exiting the Realtime API from beta to general availability on the same day.

Key facts
  • OpenAI Realtime API voice models GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper launched 7 May 2026 alongside Realtime API general availability.
  • GPT-Realtime-2 carries GPT-5 class reasoning, a 128K context window, five reasoning effort levels and tool-use precision improvements.
  • GPT-Realtime-Translate handles over 70 input languages with 13 output languages at £0 (about $0.03)4 per minute.
  • GPT-Realtime-Whisper delivers streaming transcription at £0 (about $0.01)7 per minute, beating standalone Whisper-3 on common benchmarks.

Why the OpenAI Realtime API voice models change the maths

The OpenAI Realtime API voice models matter because the audio AI market spent two years building voice agents on top of cascaded ASR-LLM-TTS pipelines. Those pipelines added 700 to 1200ms of latency per turn and demanded a small army of fine-tuned speech models. GPT-Realtime-2 collapses that into a single endpoint that listens, reasons, calls tools, optionally produces an image input response and answers in voice, all without leaving the model. That is the architectural step Google’s Gemini Live and Anthropic’s Claude voice tier have been working toward, and OpenAI got there first to general availability.

Pricing is the second part of the story. £0 (about $0.03)4 per minute for live translation across 70-plus input languages is below Google Cloud Translate’s streaming tier and below ElevenLabs’ multilingual offering. We covered the ElevenLabs AI music push last month, and the same company is now under genuine pricing pressure on its core voice business. Whisper as a paid streaming product at £0 (about $0.01)7 per minute also kills the calculus of self-hosting Whisper-3 for anything below several thousand hours per month.

OpenAI Realtime API voice models wordmark
Image: OpenAI / Wikimedia Commons

GPT-Realtime-2 reasoning levels and tool calling

GPT-Realtime-2 is the first speech-to-speech model OpenAI describes as having GPT-5 class reasoning. Developers can pick from minimal, low, medium, high and xhigh reasoning effort, with low set as the default. Low is the sensible balance for call-centre work, restaurant ordering, hands-free admin and other latency-sensitive jobs. High is where the model thinks before it speaks, useful for legal triage and any voice agent that needs to weigh options before committing to an action. xhigh is the deep-reasoning lane that OpenAI is positioning for diagnostic and decision-support voice products.

The context window jumped from 32K to 128K tokens, which is the change voice agent builders have been begging for. A 128K window means a single Realtime session can carry an entire two-hour customer history, multiple tool schemas, an MCP server’s worth of memory and the in-call audio without summarisation tricks. Performance benchmarks support the story – OpenAI says GPT-Realtime-2 scores 15.2% higher on Big Bench Audio and 13.8% higher on Audio MultiChallenge than GPT-Realtime-1.5. Those are real margins, not vendor noise.

Video: Ray Fernando – OpenAI Realtime API demo walkthrough

Translate and Whisper: the OpenAI Realtime API voice models for everyone else

ModelPriceUse caseMTW read
GPT-Realtime-2£25 (about $32)/M audio inProduction voice agents with reasoningThe new default, kills the cascaded stack.
GPT-Realtime-Translate£0 (about $0.03)4/minLive simultaneous translationSub-cost of Google streaming translate, kills boutique startups.
GPT-Realtime-Whisper£0 (about $0.01)7/minStreaming transcriptionPure utility play, hosted Whisper makes self-hosting irrational below scale.

GPT-Realtime-Translate is the model that should make UK and EU contact-centre operators rethink their localisation budgets. 70-plus input languages with 13 output targets covers the European and Asian markets a London-based business actually serves, and at 3.4p per minute it is now cheaper than the legacy outsourced live-translation services in finance and aviation. The catch is target-language coverage – Welsh, Irish Gaelic and Scottish Gaelic are not on the 13-output list yet, and OpenAI has been opaque about when smaller languages get parity. The Anthropic AWS £79 (about $100)B deal we wrote about in April was partly about exactly this kind of voice-translation workload, and OpenAI just made it sharper.

OpenAI Realtime API voice models power ChatGPT advanced voice and partner apps
Image: OpenAI / Wikimedia Commons

What UK developers should watch

UK developers should care about three things. First, the Realtime API now supports SIP phone calling and remote MCP servers, which means a voice agent can answer a real PSTN number in minutes rather than the weeks Twilio integrations used to demand. Second, image inputs landed at the same time – your voice agent can accept a photo mid-call and act on it, which closes the gap with Google’s Gemini 3.1 Flash Lite multimodal route. Third, the pricing is denominated in USD, so any UK procurement model needs an FX buffer baked in.

The risk for OpenAI is the same as ever – reliability. We wrote about the Claude double outage in April and the OpenAI Realtime API’s beta period had its own service interruptions in Q1. General availability raises the SLA bar, and any UK-regulated voice product needs to plan for fallback ASR. The model is excellent. The dependency is real.

OpenAI Realtime API voice models and the multimodal angle

The launch ships image inputs alongside the new audio models, which closes a real gap. A voice agent on a customer support line can now accept a photo of a damaged product mid-call and respond with a description, an action recommendation and a tool call – all from the same Realtime session. The cascaded alternative would have required a separate vision model, a separate state-passing layer and at least one network hop between systems. OpenAI’s bundled approach cuts that to a single endpoint and a single billing meter, which is exactly the simplification the production voice agent market has been demanding since 2024.

The MCP integration is the other quietly important detail. Remote MCP server support means a voice agent can connect to a corporate knowledge base, a CRM, a calendar and a payment processor without rebuilding the function-calling layer for each. Anthropic’s MCP rollout passed 97 million installs last month, which we covered in our MCP adoption analysis, and OpenAI adopting the standard for Realtime API is the moment the protocol becomes the de facto inter-model integration layer. UK voice product teams that already shipped an MCP server for chat now get voice for free.

MTW verdict

OpenAI Realtime API voice models reset the voice agent market for the second time in eighteen months. GPT-Realtime-Translate at 3.4p per minute is the single product launch that will reshape pricing across UK voice startups. Build on it, but architect a fallback path to Whisper-3 self-hosted or Gemini Live before you commit a customer contract to a single vendor.

Buyer action

Where to buy or check next

Use this as the final check before ordering a phone, changing network or trusting a headline monthly price.

Stay in the loop

Get MTW reporting, reviews, guides, and buying advice in your inbox.

Subscribe

Reader discussion

Leave a comment

Comments are moderated. Keep it useful, accurate, and on topic.

Join the discussion

Your email address will not be published. All comments are held for moderation.

Spam protection

Keep reading

Today on MTW

The latest stories moving through the newsroom.