How it works
You open a WebSocket towss://api.opper.ai/v3/realtime, send a session.start frame that picks the model and configures the agent, then stream microphone audio in and play assistant audio out as it arrives. Opper resolves the provider from the model id, handles the upstream handshake (including provider quirks like xAI’s client-secret exchange), meters usage as you go, and gives you one event vocabulary so your client code stays the same no matter which model you target.
Just trying it out?
The Realtime quickstart is the fastest way to a working session. It has one runnable code block per environment (server or browser), and you’ll be live in under five minutes. For a production-shaped example with browser UI, microphone capture, audio playback, tool calls, and provider switching, see the brainstorm-time cookbook.Authentication
There are two patterns, depending on whether your client runs server-side or in a browser.Server-side: bearer token
Server-to-server clients (Node, Python, Go, anything that runs in your backend) connect directly with a project-scoped runtime API key in theAuthorization header. The quickstart uses this pattern.
opmak-…) are rejected. Only project-scoped runtime keys can open realtime sessions.
Browser: ephemeral tickets
Browsers can’t set anAuthorization header on the native WebSocket constructor. To open a realtime session directly from a browser, your backend mints a single-use ticket and the browser redeems it on the upgrade request.
Step 1: server-side mint. Your backend hits POST /v3/realtime-sessions with its normal API key. The response carries a short-lived client_secret and an expires_at timestamp.
Sec-WebSocket-Protocol subprotocol header. Prefer this form. Credentials in the URL query string end up in access logs, browser history, and Referer headers, while the subprotocol header is request-only and stays out of those places.
WebSocket constructor argument):
- Single-use. Replays return
401. - Default TTL 60 seconds, max 5 minutes.
Pre-binding for security
Any field you populate inconfig at mint time is locked for that ticket. The browser can’t override it on session.start, and it can’t change it later with session.update either. This is the main safety guarantee. A leaked ticket can only open the session your backend authorized. It can’t pivot to a more expensive model or a different system prompt, and it can’t unlock the field by updating it after start.
At minimum, bind model. Tighter setups also bind instructions, tools, and voice. Fields left zero stay open for the browser to set in session.start.
To force a boolean off (for example, to stop the browser from enabling output transcription), list the field in locked_fields. Without it, a zero value in config doesn’t lock anything, so the browser could still flip the bool on.
A ticket with locked_fields but no config is also valid: the listed fields are force-set to their zero value regardless of what the browser sends. Unknown field names are rejected at mint time with 400 Bad Request.
Preflight rejections
Before the WebSocket upgrade, Opper checks balance, plan, and concurrency caps. Failures come back as HTTP status codes rather than opaque WS closes, so you can react to them:| Status | Meaning |
|---|---|
401 Unauthorized | Missing key, non-runtime key, or project-less key |
402 Payment Required | Balance exhausted, spend cap hit, or plan doesn’t support realtime |
429 Too Many Requests | Concurrent-session cap reached for this project |
503 Service Unavailable | Realtime endpoint not configured for the requested provider |
Session lifecycle
A session goes through four phases. Most of the time you only care about the middle two. 1. Connect. WebSocket upgrade. A successful response is101 Switching Protocols. After this the connection is open, but the agent isn’t running yet.
2. Configure (session.start). Send this once as the first message after upgrade. The config selects the model, voice, system instructions, tool list, VAD parameters, and optional transcription toggles. Opper validates the config against the resolved model’s capabilities and rejects anything unsupported before dialing the upstream. The same validation runs on every later session.update, so unsupported modalities, voice, or reasoning_effort values return an error mid-stream.
3. Interact. Stream audio in (audio.append + audio.commit if not using server VAD), send text turns (text.input), receive audio.delta / text.delta frames, handle tool.call events, return tool.result.
4. Close. Either side can close the WebSocket. On a client-initiated close, Opper performs a final billing flush and tears down the upstream connection. On a server-initiated close (caps, idle timeout, balance exhaustion), you’ll see a closing event (session.terminating) with a structured reason before the connection ends.
Configuring the session
Everything goes inconfig on the first session.start frame:
| Field | Type | Notes |
|---|---|---|
model | string | Required. Provider-prefixed model id, e.g. openai/gpt-realtime-2. See Per-provider notes. |
voice | string | Provider-specific voice name. Optional, model falls back to its default. |
instructions | string | System prompt for the agent. |
modalities | string[] | Output modalities. OpenAI realtime accepts a single value (["audio"] or ["text"]); xAI and Gemini accept either form. Default depends on model. |
turn_detection | object | { type: "server_vad", threshold, prefix_padding_ms, silence_duration_ms }. Defaults are server-side. |
tools | object[] | Tool schema the model can call. Same shape as the regular Calls API tools. |
reasoning_effort | string | gpt-realtime-2 only. minimal / low / medium / high / xhigh. |
input_transcription | bool | Surface transcript.committed events with the user’s speech transcribed. Off by default. |
input_transcription_model | string | OpenAI only. Optional selector. Falls back to gpt-4o-mini-transcribe. |
output_transcription | bool | Surface text.delta events with the assistant’s spoken words. Off by default. |
Event vocabulary
All events are JSON over text WebSocket frames.Client → server
| Event | Purpose |
|---|---|
session.start | First frame. Opens the session with config. |
session.update | Update the config mid-session (provider-dependent; Gemini ignores this). |
audio.append | Append base64-encoded PCM16 mic audio at the session’s input sample rate. |
audio.commit | Mark end of speech (only needed when not using server VAD). |
audio.clear | Discard buffered mic audio that hasn’t been committed yet. |
text.input | Send a typed user message. |
image.input | Send a still image as a user message. image_url accepts a base64 data URI (data:image/png;base64,...) on every vision-capable realtime model; OpenAI also accepts https URLs, while Gemini Live requires inline data and rejects https with a clear error. Optional text rides on the same message item so the model sees the question alongside the image. Optional image_detail ("auto" / "low" / "high") is honored by OpenAI and ignored by Gemini. Only supported on realtime models whose capabilities include vision — the gateway responds with an error event otherwise. Image bytes are not stored in the session trace recording. |
response.create | Force the model to respond now (when not using server VAD). |
response.cancel | Cancel an in-flight response. |
tool.result | Return a tool result the model asked for via tool.call. |
Server → client
| Event | Purpose |
|---|---|
session.started | Confirmation that the upstream session is live. Carries session_id, input_sample_rate, output_sample_rate, audio_format. |
audio.delta | Assistant audio chunk (base64-encoded PCM16). Play as it arrives. |
text.delta | Streaming assistant text. Only fires if output_transcription is on (or the model emits text natively). |
transcript.committed | User speech transcript, after VAD commits a turn. Only fires if input_transcription is on. |
speech.started / speech.stopped | VAD signals for the user’s microphone. Use these to interrupt playback when the user barges in. |
response.started / response.completed | Lifecycle markers for an assistant turn. |
tool.call | Model wants to call a tool. You must reply with a matching tool.result. |
session.terminating | Server is closing the session. Carries a structured reason (see below). |
session.ended | Final frame before the upstream connection closes. |
error | Provider or protocol error mid-session. |
session.terminating.error.code: session_timeout, idle_timeout, balance_exhausted, project_spend_cap_hit, org_spend_cap_hit, billing_not_supported.
Tool calls
Tools work the same way as in the Calls API: declare them inconfig.tools with a JSON Schema for parameters. When the model decides to call one, you get a tool.call event. Reply with tool.result and the model continues the conversation. Opper handles the provider wire format internally, so your handler is the same across upstreams.
Per-provider notes
The protocol is the same across providers, but each upstream has quirks worth knowing.- OpenAI
- xAI
- Gemini
Models:
openai/gpt-realtime-2, openai/gpt-realtime, openai/gpt-realtime.Voices: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar (latest models).Sample rate: 24 kHz symmetric (in/out).Reasoning effort: gpt-realtime-2 accepts reasoning_effort and bills reasoning tokens. Other models ignore it.Transcription: input transcription runs in a parallel pipeline upstream, so the user transcript can arrive after the assistant has started responding. Enabling input_transcription adds a separate per-minute bill line (input_transcription_ms) at the rate of the transcription model.Transcription is set-at-start. input_transcription and input_transcription_model cannot be toggled via session.update on OpenAI. Mid-session changes return an error. Request transcription at session.start or not at all. xAI and Gemini fold transcription into their per-minute audio rate and don’t have this restriction.Session limits
| Limit | Default | Trigger |
|---|---|---|
| Concurrent sessions per project | 5 | 429 at preflight |
| Max session duration | 30 minutes | session.terminating with session_timeout |
| Idle timeout (no client traffic) | 60 seconds | session.terminating with idle_timeout |
| Session-start grace period | 10 seconds | WebSocket close if session.start never arrives |
Billing
Realtime sessions bill incrementally as the conversation runs. Per-model rates are published onGET /v3/models. Look for audio_input, audio_output, audio_input_per_minute, audio_output_per_minute, and input_transcription_per_minute on the relevant model.
Tracing
Each session emits generation records visible in the Trace explorer, filterable bymodel or session_id.
See also
- Realtime protocol reference: event vocabulary quick-ref for the API reference section.
- brainstorm-time cookbook: full working voice app with browser UI, tools, and provider switching.
- Models: the full list of supported realtime model ids.