Realtime voice

The Opper Realtime API lets you build voice-to-voice applications — your customer speaks, the model responds with audio in real time, and tool calls flow back and forth without you holding the connection together yourself. One protocol, multiple upstream models: switch between OpenAI, xAI, and Gemini by changing a string.

How it works

You open a WebSocket to wss://api.opper.ai/v3/realtime, send a session.start frame that picks the model and configures the agent, then stream microphone audio in and play assistant audio out as it arrives. Opper resolves the provider from the model id, handles the upstream handshake (including provider-specific quirks like xAI’s client-secret exchange), meters usage incrementally, and surfaces a unified event vocabulary so your client code looks the same regardless of which model you target.

┌──────────┐   audio.append    ┌──────────┐   provider-specific   ┌────────────┐
│  Client  │ ───────────────▶  │  Opper   │ ──────────────────▶   │  Upstream  │
│ (browser │                   │ Realtime │                       │  (OpenAI / │
│ or your  │ ◀───────────────  │  proxy   │ ◀──────────────────   │   xAI /    │
│ backend) │   audio.delta     │          │   normalized frames   │  Gemini)   │
└──────────┘                   └──────────┘                       └────────────┘

Just trying it out?

The Realtime quickstart is the fastest path to a working session — one runnable code block per environment (server or browser) and you’re live in under five minutes. For a complete production-shaped example with browser UI, microphone capture, audio playback, tool calls, and provider switching, see the brainstorm-time cookbook.

Authentication

Two patterns are supported depending on whether your client is server-side or browser.

Server-side: bearer token

Server-to-server clients (Node, Python, Go, anything that runs in your backend) connect directly with a project-scoped runtime API key in the Authorization header. This is the pattern the quick-start above uses.

new WebSocket("wss://api.opper.ai/v3/realtime", {
  headers: { Authorization: `Bearer ${process.env.OPPER_API_KEY}` },
});

Management keys (opmak-…) are rejected — only project-scoped runtime keys can open realtime sessions.

Browser: ephemeral tickets

Browsers can’t set an Authorization header on the native WebSocket constructor. To open a realtime session directly from a browser, your backend mints a single-use ticket and the browser redeems it on the upgrade request. Step 1: server-side mint. Your backend hits POST /v3/realtime-sessions with its normal API key. The response carries a short-lived client_secret and an expires_at timestamp.

// On your backend (e.g. /api/realtime-ticket)
const resp = await fetch("https://api.opper.ai/v3/realtime-sessions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPPER_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    config: {
      model: "openai/gpt-realtime-2",
      voice: "marin",
      instructions: "You are a concise voice assistant.",
      tools: [/* ... */],
    },
    ttl_seconds: 60,
  }),
});
const { client_secret, expires_at, ws_url } = await resp.json();
// Return client_secret + ws_url to the browser

Step 2: browser redeem. The browser opens the WebSocket carrying the ticket in the Sec-WebSocket-Protocol subprotocol header. This is the recommended form — credentials in the URL query string end up in access logs, browser history, and Referer headers; the subprotocol header is request-only and stays out of those surfaces.

// In the browser — recommended
const ws = new WebSocket(
  "wss://api.opper.ai/v3/realtime",
  [`opper-ticket.${clientSecret}`],
);

ws.onopen = () => {
  // The ticket already carries the bound config — for fields that weren't
  // pre-bound at mint time, send them now:
  ws.send(JSON.stringify({
    type: "session.start",
    config: { modalities: ["audio"] },
  }));
};

A query-parameter fallback is also accepted for environments that can’t set a subprotocol (uncommon — most clients support the second WebSocket constructor argument):

// Fallback only
const ws = new WebSocket(`wss://api.opper.ai/v3/realtime?ticket=${clientSecret}`);

Ticket properties:

Single-use — replays return 401.
Default TTL 60 seconds, max 5 minutes.

Pre-binding for security

Any field you populate in config at mint time is locked for that ticket — the browser cannot override it on session.start or on any subsequent session.update. This is the load-bearing safety guarantee: a leaked ticket can only open the specific session your backend authorized, not pivot to a more expensive model or different system prompt, and not unlock the field by updating it after start. Recommended minimum pre-binding: at least model. Tighter setups also bind instructions, tools, and voice. Fields left zero stay open for the browser to set in session.start. To force a boolean off — for example, prohibit the browser from enabling output transcription — list the field in locked_fields. Without it, a zero value in config doesn’t lock anything (the browser could still flip the bool on). A ticket with locked_fields but no config is also valid — the listed fields are force-set to their zero value regardless of what the browser sends. Unknown field names are rejected at mint time with 400 Bad Request.

// Forbid the browser from enabling output transcription
{
  config: {
    model: "openai/gpt-realtime-2",
    output_transcription: false,
  },
  locked_fields: ["output_transcription"],
}

// Server-side mint with strong binding — browser cannot change any of these
{
  config: {
    model: "openai/gpt-realtime-2",
    voice: "marin",
    instructions: "...locked system prompt...",
    tools: [/* customer-approved tools only */],
    reasoning_effort: "low",
  },
  ttl_seconds: 30,
}

Preflight rejections

Before the WebSocket upgrade, Opper checks balance, plan, and concurrency caps. Failures come back as HTTP status codes — not opaque WS closes — so you can react cleanly:

Status	Meaning
`401 Unauthorized`	Missing key, non-runtime key, or project-less key
`402 Payment Required`	Balance exhausted, spend cap hit, or plan doesn’t support realtime
`429 Too Many Requests`	Concurrent-session cap reached for this project
`503 Service Unavailable`	Realtime endpoint not configured for the requested provider

Session lifecycle

A session goes through four phases. Most of the time you only care about the middle two. 1. Connect. WebSocket upgrade. Successful response is 101 Switching Protocols. After this the connection is open but the agent isn’t running yet. 2. Configure (session.start). Send this once as the first message after upgrade. The config selects the model, voice, system instructions, tool list, VAD parameters, and optional transcription toggles. Opper validates the config against the resolved model’s capabilities and rejects anything unsupported before dialing the upstream. Capability validation also runs on every subsequent session.update: unsupported modalities, voice, or reasoning_effort values return an error mid-stream. 3. Interact. Stream audio in (audio.append + audio.commit if not using server VAD), send text turns (text.input), receive audio.delta / text.delta frames, handle tool.call events, return tool.result. 4. Close. Either side can close the WebSocket. On a client-initiated close, Opper performs a final billing flush and tears down the upstream connection. On a server-initiated close (caps, idle timeout, balance exhaustion), you’ll see a closing event (session.terminating) with a structured reason before the connection ends.

Configuring the session

Everything goes in config on the first session.start frame:

Field	Type	Notes
`model`	string	Required. Provider-prefixed model id, e.g. `openai/gpt-realtime-2`. See Per-provider notes.
`voice`	string	Provider-specific voice name. Optional — model falls back to its default.
`instructions`	string	System prompt for the agent.
`modalities`	string[]	Output modalities. OpenAI realtime accepts a single value (`["audio"]` or `["text"]`); xAI and Gemini accept either form. Default depends on model.
`turn_detection`	object	`{ type: "server_vad", threshold, prefix_padding_ms, silence_duration_ms }`. Defaults are server-side.
`tools`	object[]	Function-calling schema. Same shape as the regular Calls API tools.
`reasoning_effort`	string	`gpt-realtime-2` only. `minimal` / `low` / `medium` / `high` / `xhigh`.
`input_transcription`	bool	Surface `transcript.committed` events with the user’s speech transcribed. Off by default.
`input_transcription_model`	string	OpenAI only. Optional selector — falls back to `gpt-4o-mini-transcribe`.
`output_transcription`	bool	Surface `text.delta` events with the assistant’s spoken words. Off by default.

Event vocabulary

All events are JSON over text WebSocket frames.

Client → server

Event	Purpose
`session.start`	First frame. Opens the session with config.
`session.update`	Update the config mid-session (provider-dependent — Gemini ignores this).
`audio.append`	Append base64-encoded PCM16 mic audio at the session’s input sample rate.
`audio.commit`	Mark end of speech (only needed when not using server VAD).
`audio.clear`	Discard buffered mic audio that hasn’t been committed yet.
`text.input`	Send a typed user message.
`response.create`	Force the model to respond now (when not using server VAD).
`response.cancel`	Cancel an in-flight response.
`tool.result`	Return a tool result the model asked for via `tool.call`.

Server → client

Event	Purpose
`session.started`	Confirmation that the upstream session is live. Carries `session_id`, `input_sample_rate`, `output_sample_rate`, `audio_format`.
`audio.delta`	Assistant audio chunk (base64-encoded PCM16). Play as it arrives.
`text.delta`	Streaming assistant text. Only fires if `output_transcription` is on (or the model emits text natively).
`transcript.committed`	User speech transcript, after VAD commits a turn. Only fires if `input_transcription` is on.
`speech.started` / `speech.stopped`	VAD signals for the user’s microphone. Use these to interrupt playback when the user barges in.
`response.started` / `response.completed`	Lifecycle markers for an assistant turn.
`tool.call`	Model wants to call a function. You must reply with a matching `tool.result`.
`session.terminating`	Server is closing the session. Carries a structured reason (see below).
`session.ended`	Final frame before the upstream connection closes.
`error`	Provider or protocol error mid-session.

Termination reasons on session.terminating.error.code: session_timeout, idle_timeout, balance_exhausted, project_spend_cap_hit, org_spend_cap_hit, billing_not_supported.

Tool calls

Tools work the same way as in the Calls API — declare them in config.tools with a JSON Schema for parameters. When the model decides to call one, you get a tool.call event; reply with tool.result and the model continues the conversation. Opper handles the provider-specific wire format internally, so your handler is identical across upstreams.

case "tool.call":
  const result = await runTool(ev.tool_name, ev.tool_arguments);
  ws.send(JSON.stringify({
    type: "tool.result",
    tool_call_id: ev.tool_call_id,
    tool_result: result,
  }));
  break;

While a tool is executing, the upstream model is paused — there will be a natural silence in the audio stream proportional to your tool’s latency. Plan UX accordingly (an affordance like “looking that up…”) for tools that take more than a second or two.

Per-provider notes

The protocol is unified, but each upstream has quirks worth knowing.

OpenAI
xAI
Gemini

Models: openai/gpt-realtime-2, openai/gpt-realtime, openai/gpt-4o-realtime.Voices: alloy, ash, ballad, coral, echo, sage, shimmer, verse, marin, cedar (latest models).Sample rate: 24 kHz symmetric (in/out).Reasoning effort: gpt-realtime-2 accepts reasoning_effort and bills reasoning tokens. Other models ignore it.Transcription: input transcription runs in a parallel pipeline upstream, so the user transcript can arrive after the assistant has started responding. Enabling input_transcription adds a separate per-minute bill line (input_transcription_ms) at the rate of the transcription model.Transcription is set-at-start. input_transcription and input_transcription_model cannot be toggled via session.update on OpenAI — mid-session changes return an error. Request transcription at session.start or not at all. xAI and Gemini fold transcription into their per-minute audio rate and don’t have this restriction.

Models: xai/grok-voice-latest.Voices: ara, eve, leo, rex, sal.Sample rate: 24 kHz symmetric.Billing: flat per-minute rates for audio in / audio out. Input and output transcription are folded into the per-minute rate — toggling them on doesn’t cost extra.Connect latency: xAI requires a three-step handshake (client-secret exchange, WS dial, session.updated). Expect ~1s before session.started lands.

Models: gemini/gemini-3.1-flash-live-preview.Voices: Puck, Charon, Kore, Fenrir, Aoede, Leda, Orus, Zephyr.Sample rate: asymmetric — 16 kHz input, 24 kHz output. Both rates are reported in session.started. Read input_sample_rate for mic capture and output_sample_rate for playback.Tools: supported; translated to Gemini’s toolResponse format internally.Mid-turn close: Gemini reports usage only at turn boundaries. If you close mid-sentence, Opper waits briefly for the trailing usage frame before tearing down.

Session limits

Limit	Default	Trigger
Concurrent sessions per project	5	`429` at preflight
Max session duration	30 minutes	`session.terminating` with `session_timeout`
Idle timeout (no client traffic)	60 seconds	`session.terminating` with `idle_timeout`
Session-start grace period	10 seconds	WebSocket close if `session.start` never arrives

Billing

Realtime sessions bill incrementally as the conversation progresses. Per-model rates are published on GET /v3/models — look for audio_input, audio_output, audio_input_per_minute, audio_output_per_minute, and input_transcription_per_minute on the relevant model.

Tracing

Each session emits generation records visible in the Trace explorer, filterable by model or session_id.

Gateway

Control Plane

Realtime

Developer Tools

Guides

Realtime voice

How it works

Just trying it out?

Authentication

Server-side: bearer token

Browser: ephemeral tickets

Pre-binding for security

Preflight rejections

Session lifecycle

Configuring the session

Event vocabulary

Client → server

Server → client

Tool calls

Per-provider notes

Session limits

Billing

Tracing

See also

Gateway

Control Plane

Realtime

Developer Tools

Guides

Documentation Index

​How it works

​Just trying it out?

​Authentication

​Server-side: bearer token

​Browser: ephemeral tickets

​Pre-binding for security

​Preflight rejections

​Session lifecycle

​Configuring the session

​Event vocabulary

​Client → server

​Server → client

​Tool calls

​Per-provider notes

​Session limits

​Billing

​Tracing

​See also

How it works

Just trying it out?

Authentication

Server-side: bearer token

Browser: ephemeral tickets

Pre-binding for security

Preflight rejections

Session lifecycle

Configuring the session

Event vocabulary

Client → server

Server → client

Tool calls

Per-provider notes

Session limits

Billing

Tracing

See also