
🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4
OpenAI's May 7, 2026 realtime voice launch matters for one reason above all: developers can now price voice features before they build them. As reported by TechCrunch, OpenAI introduced three API models with clear metering: GPT-Realtime-2 at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens, GPT-Realtime-Translate at $0.034 per minute, and GPT-Realtime-Whisper at $0.017 per minute. That turns live voice reasoning, translation, and transcription into budgetable infrastructure rather than experimental capability.
For practitioners, the immediate takeaway is architectural. If a use case needs live voice reasoning, GPT-Realtime-2 is the relevant building block. If it needs multilingual spoken translation, GPT-Realtime-Translate is the simpler pricing model. If it needs streaming speech-to-text, GPT-Realtime-Whisper offers the cheapest entry point of the three. The right choice depends less on novelty than on call volume, interaction length, and how much complexity a team wants to own in the pipeline.
This guide focuses on those three models, what each one is for, and how to think about pricing and system design around them.
TL;DR: OpenAI's realtime voice API now exposes three distinct building blocks: live voice reasoning, live translation, and streaming transcription, each with a different pricing model.
As reported by TechCrunch on May 7, 2026, the launch breaks cleanly into three products:
| Model | Primary Use Case | Pricing Unit | Cost |
|---|---|---|---|
| GPT-Realtime-2 | Live voice reasoning | Per 1M audio tokens | $32 input / $64 output |
| GPT-Realtime-Translate | Real-time translation | Per minute | $0.034/min |
| GPT-Realtime-Whisper | Streaming speech-to-text | Per minute | $0.017/min |
The pricing split is the first architectural clue. GPT-Realtime-2 is billed like a model you reason with, while Translate and Whisper are billed like media services. That makes the latter two easier to budget directly against minutes of usage, while GPT-Realtime-2 requires teams to watch token consumption more closely.
One broader market note is worth keeping in view: around the same week, xAI also expanded its voice surface with Custom Voices on Grok 4.3, underscoring how quickly realtime voice became a competitive infrastructure layer across vendors. Still, the core story here is OpenAI's three-model lineup and the four prices attached to it.
TL;DR: GPT-Realtime-2 is the model for live voice agents, priced at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens.
GPT-Realtime-2 is the reasoning layer in the lineup. Its role is straightforward: handle live voice interactions where the system needs to listen and respond in an ongoing conversation.
That makes it the relevant option for use cases such as:
The key planning challenge is not capability but cost predictability. Because GPT-Realtime-2 is billed per audio token rather than per minute, teams should treat it differently from the translation and transcription models. The practical move is to instrument a pilot, measure token usage on representative calls, and then project spend from actual traffic patterns.
What should not happen is guessing at a token-to-minute conversion and treating that estimate as a budget. The launch reporting gives the token prices, but it does not provide a canonical latency benchmark or a universal audio-token conversion ratio. For that reason, GPT-Realtime-2 is best evaluated with real traces from the intended workflow rather than generic assumptions.
In practice, that means asking a few concrete questions early:
Those answers determine whether an end-to-end voice agent is economically attractive long before model quality becomes the deciding factor.
TL;DR: GPT-Realtime-Translate supports 70+ source languages into 13 target languages at $0.034 per minute, making multilingual voice workflows easy to model financially.
GPT-Realtime-Translate is the simplest of the three models to budget because its pricing is linear: $0.034 per minute. According to TechCrunch's reporting, it handles 70+ source languages and translates into 13 target languages.
That makes it a natural fit for scenarios such as:
The economics are easy to sketch:
| Monthly Call Volume | Avg. Call Length | Monthly Minutes | Monthly Cost |
|---|---|---|---|
| 1,000 calls | 5 min | 5,000 min | $170 |
| 5,000 calls | 5 min | 25,000 min | $850 |
| 10,000 calls | 8 min | 80,000 min | $2,720 |
| 50,000 calls | 8 min | 400,000 min | $13,600 |
At that rate, a 5-minute translated interaction costs $0.17. That is the kind of number a product manager or engineering lead can immediately plug into a forecast.
The main constraint is not price but fit. The model translates from a broad set of source languages into a narrower set of target languages, so teams should confirm that their required output languages are covered before they commit to a workflow around it. The launch reporting establishes the 70+-to-13 framing; implementation planning still depends on the exact target-language needs of the product.
TL;DR: GPT-Realtime-Whisper provides streaming speech-to-text at $0.017 per minute, making it the lowest-cost entry point in OpenAI's realtime voice lineup.
GPT-Realtime-Whisper is the transcription component: streaming speech-to-text at $0.017 per minute. Of the three models, it is the easiest to justify for teams that already know they need live transcripts.
Typical use cases include:
The cost math is straightforward:
| Use Case | Monthly Hours | Monthly Cost |
|---|---|---|
| 50 meetings × 1 hr | 50 hrs (3,000 min) | $51 |
| 200 meetings × 1 hr | 200 hrs (12,000 min) | $204 |
| 24/7 single-channel monitoring | 730 hrs (43,800 min) | $744.60 |
| Call center (10 agents, 6 hrs/day) | 1,800 hrs (108,000 min) | $1,836 |
That pricing makes GPT-Realtime-Whisper attractive wherever streaming matters more than a packaged SaaS interface. A team that can integrate an API directly may find the economics compelling, especially for high-volume transcription workloads.
The architectural distinction to keep in mind is simple: this model is for streaming transcription. If a workflow only needs transcripts after the fact, the realtime path may not be necessary. But if the transcript needs to appear while someone is speaking, this is the model in the lineup designed for that job.
TL;DR: The main design decision is whether to use GPT-Realtime-2 for live voice reasoning or compose separate transcription, reasoning, and output layers for more control.
The three-model launch creates two broad architectural patterns.
The first is the direct path: use GPT-Realtime-2 for live voice reasoning in a single realtime interaction loop.
The second is the composed path: use GPT-Realtime-Whisper for streaming transcription, pass text into a separate reasoning layer, and then handle output through another component as needed.
A simple comparison looks like this:
| Approach | Cost Model | Control | Operational Complexity |
|---|---|---|---|
| GPT-Realtime-2 | Audio-token based | Lower component-level control | Lower pipeline complexity |
| Whisper + separate reasoning/output layers | Mixed metering across components | Higher component-level control | Higher pipeline complexity |
The tradeoff is not abstract. A single-model path can simplify implementation and reduce the number of moving parts. A composed pipeline can make it easier to swap components, tune each stage independently, or route different workloads through different systems.
For example:
The important point is that the launch gives developers discrete building blocks rather than a single monolithic voice product. That is useful both technically and financially.
TL;DR: The practical way to evaluate these models is to start with usage forecasts and latency requirements, then choose the simplest architecture that meets both.
The strongest developer takeaway from this launch is that voice should now be treated like any other metered infrastructure category. Before building, teams can estimate spend from expected minutes, expected token usage, and projected traffic.
A useful planning sequence looks like this:
That last point is easy to miss. Model pricing is now transparent enough to make spreadsheet planning possible, but model spend is still only one line item. A voice product that looks cheap at the API layer can become expensive once routing, monitoring, and operational safeguards are included.
The broader industry context reinforces the point. May 2026 saw a wider push toward agentic and voice infrastructure across major labs. OpenAI's realtime voice models are one expression of that shift, but the practical lesson is vendor-agnostic: voice is no longer a novelty feature. It is a metered system component, and teams should evaluate it with the same discipline they apply to compute, storage, and networking.
GPT-Realtime-2 is priced at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens. GPT-Realtime-Translate costs $0.034 per minute. GPT-Realtime-Whisper costs $0.017 per minute.
GPT-Realtime-2 is the live voice reasoning model in the lineup. It is the model to evaluate for voice agents and other spoken interactions that require ongoing reasoning during the conversation.
TechCrunch reported that GPT-Realtime-Translate accepts 70+ source languages and translates into 13 target languages. Teams should confirm that their required target languages are included before building around it.
GPT-Realtime-Whisper is the right fit when transcription needs to happen as audio arrives rather than after a recording is complete. That makes it suitable for live captions, meeting transcripts, and other realtime text pipelines.
Start with actual usage patterns: session length, monthly volume, and, for GPT-Realtime-2, real token consumption from a pilot. Then measure end-to-end responsiveness in the intended environment, since the launch reporting provides pricing but not benchmark latency figures.
OpenAI's realtime voice launch is important less because it makes voice possible and more because it makes voice legible. With one token-priced reasoning model and two per-minute media models, developers can now compare architectures in financial terms instead of treating voice as a fuzzy R&D category. That changes how voice projects should be scoped: start with pricing, validate with pilot data, and choose the narrowest model that fits the job.
Discover more content: