OpenAI's Realtime Voice Models: A Developer's Pricing & Architecture Guide

🤖 Ghostwritten by Claude Opus 4.6 · Fact-checked & edited by GPT 5.4

OpenAI's May 7, 2026 realtime voice launch matters for one reason above all: developers can now price voice features before they build them. As reported by TechCrunch, OpenAI introduced three API models with clear metering: GPT-Realtime-2 at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens, GPT-Realtime-Translate at $0.034 per minute, and GPT-Realtime-Whisper at $0.017 per minute. That turns live voice reasoning, translation, and transcription into budgetable infrastructure rather than experimental capability.

For practitioners, the immediate takeaway is architectural. If a use case needs live voice reasoning, GPT-Realtime-2 is the relevant building block. If it needs multilingual spoken translation, GPT-Realtime-Translate is the simpler pricing model. If it needs streaming speech-to-text, GPT-Realtime-Whisper offers the cheapest entry point of the three. The right choice depends less on novelty than on call volume, interaction length, and how much complexity a team wants to own in the pipeline.

This guide focuses on those three models, what each one is for, and how to think about pricing and system design around them.

The Three Models at a Glance

TL;DR: OpenAI's realtime voice API now exposes three distinct building blocks: live voice reasoning, live translation, and streaming transcription, each with a different pricing model.

As reported by TechCrunch on May 7, 2026, the launch breaks cleanly into three products:

Model	Primary Use Case	Pricing Unit	Cost
GPT-Realtime-2	Live voice reasoning	Per 1M audio tokens	$32 input / $64 output
GPT-Realtime-Translate	Real-time translation	Per minute	$0.034/min
GPT-Realtime-Whisper	Streaming speech-to-text	Per minute	$0.017/min

The pricing split is the first architectural clue. GPT-Realtime-2 is billed like a model you reason with, while Translate and Whisper are billed like media services. That makes the latter two easier to budget directly against minutes of usage, while GPT-Realtime-2 requires teams to watch token consumption more closely.

One broader market note is worth keeping in view: around the same week, xAI also expanded its voice surface with Custom Voices on Grok 4.3, underscoring how quickly realtime voice became a competitive infrastructure layer across vendors. Still, the core story here is OpenAI's three-model lineup and the four prices attached to it.

GPT-Realtime-2: Live Voice Reasoning

TL;DR: GPT-Realtime-2 is the model for live voice agents, priced at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens.

GPT-Realtime-2 is the reasoning layer in the lineup. Its role is straightforward: handle live voice interactions where the system needs to listen and respond in an ongoing conversation.

That makes it the relevant option for use cases such as:

Voice agents that need to manage multi-turn conversations
Hands-free assistants for workers in motion
Interactive systems that need spoken input and spoken output in the same loop

The key planning challenge is not capability but cost predictability. Because GPT-Realtime-2 is billed per audio token rather than per minute, teams should treat it differently from the translation and transcription models. The practical move is to instrument a pilot, measure token usage on representative calls, and then project spend from actual traffic patterns.

What should not happen is guessing at a token-to-minute conversion and treating that estimate as a budget. The launch reporting gives the token prices, but it does not provide a canonical latency benchmark or a universal audio-token conversion ratio. For that reason, GPT-Realtime-2 is best evaluated with real traces from the intended workflow rather than generic assumptions.

In practice, that means asking a few concrete questions early:

How long are typical conversations?
How verbose are system responses likely to be?
How much variance exists between short and long calls?
What monthly token spend does that imply at expected volume?

Those answers determine whether an end-to-end voice agent is economically attractive long before model quality becomes the deciding factor.

GPT-Realtime-Translate: Multilingual Voice at a Per-Minute Rate

TL;DR: GPT-Realtime-Translate supports 70+ source languages into 13 target languages at $0.034 per minute, making multilingual voice workflows easy to model financially.

GPT-Realtime-Translate is the simplest of the three models to budget because its pricing is linear: $0.034 per minute. According to TechCrunch's reporting, it handles 70+ source languages and translates into 13 target languages.

That makes it a natural fit for scenarios such as:

Multilingual support lines
Live event or webinar translation
Cross-language communication in operational settings

The economics are easy to sketch:

Monthly Call Volume	Avg. Call Length	Monthly Minutes	Monthly Cost
1,000 calls	5 min	5,000 min	$170
5,000 calls	5 min	25,000 min	$850
10,000 calls	8 min	80,000 min	$2,720
50,000 calls	8 min	400,000 min	$13,600

At that rate, a 5-minute translated interaction costs $0.17. That is the kind of number a product manager or engineering lead can immediately plug into a forecast.

The main constraint is not price but fit. The model translates from a broad set of source languages into a narrower set of target languages, so teams should confirm that their required output languages are covered before they commit to a workflow around it. The launch reporting establishes the 70+-to-13 framing; implementation planning still depends on the exact target-language needs of the product.

GPT-Realtime-Whisper: Streaming Speech-to-Text

TL;DR: GPT-Realtime-Whisper provides streaming speech-to-text at $0.017 per minute, making it the lowest-cost entry point in OpenAI's realtime voice lineup.

GPT-Realtime-Whisper is the transcription component: streaming speech-to-text at $0.017 per minute. Of the three models, it is the easiest to justify for teams that already know they need live transcripts.

Typical use cases include:

Live meeting transcription
Real-time captioning
Voice input layers for applications where typing is inconvenient
Streaming text feeds that downstream systems can analyze

The cost math is straightforward:

Use Case	Monthly Hours	Monthly Cost
50 meetings × 1 hr	50 hrs (3,000 min)	$51
200 meetings × 1 hr	200 hrs (12,000 min)	$204
24/7 single-channel monitoring	730 hrs (43,800 min)	$744.60
Call center (10 agents, 6 hrs/day)	1,800 hrs (108,000 min)	$1,836

That pricing makes GPT-Realtime-Whisper attractive wherever streaming matters more than a packaged SaaS interface. A team that can integrate an API directly may find the economics compelling, especially for high-volume transcription workloads.

The architectural distinction to keep in mind is simple: this model is for streaming transcription. If a workflow only needs transcripts after the fact, the realtime path may not be necessary. But if the transcript needs to appear while someone is speaking, this is the model in the lineup designed for that job.

Architecture Choices: End-to-End Voice vs. Composed Pipelines

TL;DR: The main design decision is whether to use GPT-Realtime-2 for live voice reasoning or compose separate transcription, reasoning, and output layers for more control.

The three-model launch creates two broad architectural patterns.

The first is the direct path: use GPT-Realtime-2 for live voice reasoning in a single realtime interaction loop.

The second is the composed path: use GPT-Realtime-Whisper for streaming transcription, pass text into a separate reasoning layer, and then handle output through another component as needed.

A simple comparison looks like this:

Approach	Cost Model	Control	Operational Complexity
GPT-Realtime-2	Audio-token based	Lower component-level control	Lower pipeline complexity
Whisper + separate reasoning/output layers	Mixed metering across components	Higher component-level control	Higher pipeline complexity

The tradeoff is not abstract. A single-model path can simplify implementation and reduce the number of moving parts. A composed pipeline can make it easier to swap components, tune each stage independently, or route different workloads through different systems.

For example:

A team building a straightforward voice agent may prefer GPT-Realtime-2 for simplicity.
A team that already has a preferred reasoning stack may use GPT-Realtime-Whisper as the front end and keep the rest of the pipeline modular.
A multilingual workflow may layer GPT-Realtime-Translate into a support or event pipeline where per-minute predictability matters more than custom orchestration.

The important point is that the launch gives developers discrete building blocks rather than a single monolithic voice product. That is useful both technically and financially.

Cost Modeling and Latency Planning

TL;DR: The practical way to evaluate these models is to start with usage forecasts and latency requirements, then choose the simplest architecture that meets both.

The strongest developer takeaway from this launch is that voice should now be treated like any other metered infrastructure category. Before building, teams can estimate spend from expected minutes, expected token usage, and projected traffic.

A useful planning sequence looks like this:

Estimate volume. How many calls, sessions, or meeting hours will the system process each month?
Choose the pricing model that matches the job. Per-minute pricing is easier to forecast; token pricing may offer a better fit for live reasoning.
Measure real usage early. Especially for GPT-Realtime-2, pilot data matters more than assumptions.
Set a latency budget. Realtime systems succeed or fail on responsiveness, but the launch reporting does not provide benchmark latency numbers, so teams should measure their own end-to-end path.
Include non-model costs. Telephony, orchestration, observability, storage, and fallback handling all affect the true cost of a production voice system.

That last point is easy to miss. Model pricing is now transparent enough to make spreadsheet planning possible, but model spend is still only one line item. A voice product that looks cheap at the API layer can become expensive once routing, monitoring, and operational safeguards are included.

The broader industry context reinforces the point. May 2026 saw a wider push toward agentic and voice infrastructure across major labs. OpenAI's realtime voice models are one expression of that shift, but the practical lesson is vendor-agnostic: voice is no longer a novelty feature. It is a metered system component, and teams should evaluate it with the same discipline they apply to compute, storage, and networking.

Frequently Asked Questions

Q: What are the prices for OpenAI's new realtime voice models?

GPT-Realtime-2 is priced at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens. GPT-Realtime-Translate costs $0.034 per minute. GPT-Realtime-Whisper costs $0.017 per minute.

Q: What is GPT-Realtime-2 for?

GPT-Realtime-2 is the live voice reasoning model in the lineup. It is the model to evaluate for voice agents and other spoken interactions that require ongoing reasoning during the conversation.

Q: What languages does GPT-Realtime-Translate support?

TechCrunch reported that GPT-Realtime-Translate accepts 70+ source languages and translates into 13 target languages. Teams should confirm that their required target languages are included before building around it.

Q: When should a team choose GPT-Realtime-Whisper?

GPT-Realtime-Whisper is the right fit when transcription needs to happen as audio arrives rather than after a recording is complete. That makes it suitable for live captions, meeting transcripts, and other realtime text pipelines.

Q: What should developers measure first before adopting these models?

Start with actual usage patterns: session length, monthly volume, and, for GPT-Realtime-2, real token consumption from a pilot. Then measure end-to-end responsiveness in the intended environment, since the launch reporting provides pricing but not benchmark latency figures.

Key Takeaways

OpenAI launched three realtime voice models on May 7, 2026, as reported by TechCrunch.
GPT-Realtime-2 is priced at $32 per 1M audio-input tokens and $64 per 1M audio-output tokens for live voice reasoning.
GPT-Realtime-Translate costs $0.034 per minute and supports 70+ source languages into 13 target languages.
GPT-Realtime-Whisper costs $0.017 per minute for streaming speech-to-text.
The real story is unit economics: developers can now model voice features before committing to an architecture.
The best architecture depends on workload shape: direct realtime reasoning for simplicity, or composed pipelines for more control.

Conclusion

OpenAI's realtime voice launch is important less because it makes voice possible and more because it makes voice legible. With one token-priced reasoning model and two per-minute media models, developers can now compare architectures in financial terms instead of treating voice as a fuzzy R&D category. That changes how voice projects should be scoped: start with pricing, validate with pilot data, and choose the narrowest model that fits the job.