> ## Documentation Index
> Fetch the complete documentation index at: https://docs.opper.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Realtime protocol

> Full WebSocket protocol reference for two-way voice conversations with Opper.

The Opper Realtime API lets you build voice-to-voice applications. Your customer speaks, the model responds with audio in real time, and tool calls flow back and forth without you holding the connection together yourself. The same protocol works across OpenAI, xAI, and Gemini, so you switch upstream models by changing a string.

## How it works

You open a WebSocket to `wss://api.opper.ai/v3/realtime`, send a `session.start` frame that picks the model and configures the agent, then stream microphone audio in and play assistant audio out as it arrives. Opper resolves the provider from the model id, handles the upstream handshake (including provider quirks like xAI's client-secret exchange), meters usage as you go, and gives you one event vocabulary so your client code stays the same no matter which model you target.

```
┌──────────┐   audio.append    ┌──────────┐   provider-specific   ┌────────────┐
│  Client  │ ───────────────▶  │  Opper   │ ──────────────────▶   │  Upstream  │
│ (browser │                   │ Realtime │                       │  (OpenAI / │
│ or your  │ ◀───────────────  │  proxy   │ ◀──────────────────   │   xAI /    │
│ backend) │   audio.delta     │          │   normalized frames   │  Gemini)   │
└──────────┘                   └──────────┘                       └────────────┘
```

## Just trying it out?

The [Realtime quickstart](/build/realtime/quickstart) is the fastest way to a working session. It has one runnable code block per environment (server or browser), and you'll be live in under five minutes.

For a production-shaped example with browser UI, microphone capture, audio playback, tool calls, and provider switching, see the [brainstorm-time cookbook](https://github.com/opper-ai/opper-cookbook/tree/main/examples/brainstorm-time).

## Authentication

There are two patterns, depending on whether your client runs server-side or in a browser.

### Server-side: bearer token

Server-to-server clients (Node, Python, Go, anything that runs in your backend) connect directly with a project-scoped runtime API key in the `Authorization` header. The [quickstart](/build/realtime/quickstart) uses this pattern.

```typescript theme={null}
new WebSocket("wss://api.opper.ai/v3/realtime", {
  headers: { Authorization: `Bearer ${process.env.OPPER_API_KEY}` },
});
```

Management keys (`opmak-…`) are rejected. Only project-scoped runtime keys can open realtime sessions.

### Browser: ephemeral tickets

Browsers can't set an `Authorization` header on the native `WebSocket` constructor. To open a realtime session directly from a browser, your backend mints a single-use ticket and the browser redeems it on the upgrade request.

**Step 1: server-side mint.** Your backend hits `POST /v3/realtime-sessions` with its normal API key. The response carries a short-lived `client_secret` and an `expires_at` timestamp.

```typescript theme={null}
// On your backend (e.g. /api/realtime-ticket)
const resp = await fetch("https://api.opper.ai/v3/realtime-sessions", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.OPPER_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    config: {
      model: "openai/gpt-realtime-2",
      voice: "marin",
      instructions: "You are a concise voice assistant.",
      tools: [/* ... */],
    },
    ttl_seconds: 60,
  }),
});
const { client_secret, expires_at, ws_url } = await resp.json();
// Return client_secret + ws_url to the browser
```

**Step 2: browser redeem.** The browser opens the WebSocket carrying the ticket in the `Sec-WebSocket-Protocol` subprotocol header. Prefer this form. Credentials in the URL query string end up in access logs, browser history, and `Referer` headers, while the subprotocol header is request-only and stays out of those places.

```typescript theme={null}
// In the browser, recommended
const ws = new WebSocket(
  "wss://api.opper.ai/v3/realtime",
  [`opper-ticket.${clientSecret}`],
);

ws.onopen = () => {
  // The ticket already carries the bound config. For fields that weren't
  // pre-bound at mint time, send them now:
  ws.send(JSON.stringify({
    type: "session.start",
    config: { modalities: ["audio"] },
  }));
};
```

A query-parameter fallback is also accepted for environments that can't set a subprotocol (uncommon: most clients support the second `WebSocket` constructor argument):

```typescript theme={null}
// Fallback only
const ws = new WebSocket(`wss://api.opper.ai/v3/realtime?ticket=${clientSecret}`);
```

**Ticket properties:**

* Single-use. Replays return `401`.
* Default TTL 60 seconds, max 5 minutes.

### Pre-binding for security

Any field you populate in `config` at mint time is locked for that ticket. The browser can't override it on `session.start`, and it can't change it later with `session.update` either. This is the main safety guarantee. A leaked ticket can only open the session your backend authorized. It can't pivot to a more expensive model or a different system prompt, and it can't unlock the field by updating it after start.

At minimum, bind `model`. Tighter setups also bind `instructions`, `tools`, and `voice`. Fields left zero stay open for the browser to set in `session.start`.

To force a boolean off (for example, to stop the browser from enabling output transcription), list the field in `locked_fields`. Without it, a zero value in `config` doesn't lock anything, so the browser could still flip the bool on.

A ticket with `locked_fields` but no `config` is also valid: the listed fields are force-set to their zero value regardless of what the browser sends. Unknown field names are rejected at mint time with `400 Bad Request`.

```typescript theme={null}
// Forbid the browser from enabling output transcription
{
  config: {
    model: "openai/gpt-realtime-2",
    output_transcription: false,
  },
  locked_fields: ["output_transcription"],
}
```

```typescript theme={null}
// Server-side mint with strong binding. Browser cannot change any of these
{
  config: {
    model: "openai/gpt-realtime-2",
    voice: "marin",
    instructions: "...locked system prompt...",
    tools: [/* customer-approved tools only */],
    reasoning_effort: "low",
  },
  ttl_seconds: 30,
}
```

## Preflight rejections

Before the WebSocket upgrade, Opper checks balance, plan, and concurrency caps. Failures come back as HTTP status codes rather than opaque WS closes, so you can react to them:

| Status                    | Meaning                                                            |
| ------------------------- | ------------------------------------------------------------------ |
| `401 Unauthorized`        | Missing key, non-runtime key, or project-less key                  |
| `402 Payment Required`    | Balance exhausted, spend cap hit, or plan doesn't support realtime |
| `429 Too Many Requests`   | Concurrent-session cap reached for this project                    |
| `503 Service Unavailable` | Realtime endpoint not configured for the requested provider        |

## Session lifecycle

A session goes through four phases. Most of the time you only care about the middle two.

**1. Connect.** WebSocket upgrade. A successful response is `101 Switching Protocols`. After this the connection is open, but the agent isn't running yet.

**2. Configure (`session.start`).** Send this once as the first message after upgrade. The config selects the model, voice, system instructions, tool list, VAD parameters, and optional transcription toggles. Opper validates the config against the resolved model's capabilities and rejects anything unsupported before dialing the upstream. The same validation runs on every later `session.update`, so unsupported `modalities`, `voice`, or `reasoning_effort` values return an error mid-stream.

**3. Interact.** Stream audio in (`audio.append` + `audio.commit` if not using server VAD), send text turns (`text.input`), receive `audio.delta` / `text.delta` frames, handle `tool.call` events, return `tool.result`.

**4. Close.** Either side can close the WebSocket. On a client-initiated close, Opper performs a final billing flush and tears down the upstream connection. On a server-initiated close (caps, idle timeout, balance exhaustion), you'll see a closing event (`session.terminating`) with a structured reason before the connection ends.

## Configuring the session

Everything goes in `config` on the first `session.start` frame:

| Field                       | Type      | Notes                                                                                                                                               |
| --------------------------- | --------- | --------------------------------------------------------------------------------------------------------------------------------------------------- |
| `model`                     | string    | **Required.** Provider-prefixed model id, e.g. `openai/gpt-realtime-2`. See [Per-provider notes](#per-provider-notes).                              |
| `voice`                     | string    | Provider-specific voice name. Optional, model falls back to its default.                                                                            |
| `instructions`              | string    | System prompt for the agent.                                                                                                                        |
| `modalities`                | string\[] | Output modalities. OpenAI realtime accepts a single value (`["audio"]` or `["text"]`); xAI and Gemini accept either form. Default depends on model. |
| `turn_detection`            | object    | `{ type: "server_vad", threshold, prefix_padding_ms, silence_duration_ms }`. Defaults are server-side.                                              |
| `tools`                     | object\[] | Tool schema the model can call. Same shape as the regular Calls API tools.                                                                          |
| `reasoning_effort`          | string    | `gpt-realtime-2` only. `minimal` / `low` / `medium` / `high` / `xhigh`.                                                                             |
| `input_transcription`       | bool      | Surface `transcript.committed` events with the user's speech transcribed. Off by default.                                                           |
| `input_transcription_model` | string    | OpenAI only. Optional selector. Falls back to `gpt-4o-mini-transcribe`.                                                                             |
| `output_transcription`      | bool      | Surface `text.delta` events with the assistant's spoken words. Off by default.                                                                      |

## Event vocabulary

All events are JSON over text WebSocket frames.

### Client → server

| Event             | Purpose                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `session.start`   | First frame. Opens the session with config.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `session.update`  | Update the config mid-session (provider-dependent; Gemini ignores this).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| `audio.append`    | Append base64-encoded PCM16 mic audio at the session's input sample rate.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| `audio.commit`    | Mark end of speech (only needed when not using server VAD).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `audio.clear`     | Discard buffered mic audio that hasn't been committed yet.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `text.input`      | Send a typed user message.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| `image.input`     | Send a still image as a user message. `image_url` accepts a base64 data URI (`data:image/png;base64,...`) on every vision-capable realtime model; OpenAI also accepts https URLs, while Gemini Live requires inline data and rejects https with a clear error. Optional `text` rides on the same message item so the model sees the question alongside the image. Optional `image_detail` (`"auto"` / `"low"` / `"high"`) is honored by OpenAI and ignored by Gemini. Only supported on realtime models whose capabilities include `vision` — the gateway responds with an `error` event otherwise. Image bytes are not stored in the session trace recording. |
| `response.create` | Force the model to respond now (when not using server VAD).                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| `response.cancel` | Cancel an in-flight response.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
| `tool.result`     | Return a tool result the model asked for via `tool.call`.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |

### Server → client

| Event                                     | Purpose                                                                                                                          |
| ----------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------- |
| `session.started`                         | Confirmation that the upstream session is live. Carries `session_id`, `input_sample_rate`, `output_sample_rate`, `audio_format`. |
| `audio.delta`                             | Assistant audio chunk (base64-encoded PCM16). Play as it arrives.                                                                |
| `text.delta`                              | Streaming assistant text. Only fires if `output_transcription` is on (or the model emits text natively).                         |
| `transcript.committed`                    | User speech transcript, after VAD commits a turn. Only fires if `input_transcription` is on.                                     |
| `speech.started` / `speech.stopped`       | VAD signals for the user's microphone. Use these to interrupt playback when the user barges in.                                  |
| `response.started` / `response.completed` | Lifecycle markers for an assistant turn.                                                                                         |
| `tool.call`                               | Model wants to call a tool. You must reply with a matching `tool.result`.                                                        |
| `session.terminating`                     | Server is closing the session. Carries a structured reason (see below).                                                          |
| `session.ended`                           | Final frame before the upstream connection closes.                                                                               |
| `error`                                   | Provider or protocol error mid-session.                                                                                          |

Termination reasons on `session.terminating.error.code`: `session_timeout`, `idle_timeout`, `balance_exhausted`, `project_spend_cap_hit`, `org_spend_cap_hit`, `billing_not_supported`.

## Tool calls

Tools work the same way as in the Calls API: declare them in `config.tools` with a JSON Schema for parameters. When the model decides to call one, you get a `tool.call` event. Reply with `tool.result` and the model continues the conversation. Opper handles the provider wire format internally, so your handler is the same across upstreams.

```typescript theme={null}
case "tool.call":
  const result = await runTool(ev.tool_name, ev.tool_arguments);
  ws.send(JSON.stringify({
    type: "tool.result",
    tool_call_id: ev.tool_call_id,
    tool_result: result,
  }));
  break;
```

While a tool is executing, the upstream model is paused. You'll hear a silence in the audio stream proportional to your tool's latency. For tools that take more than a second or two, plan the UX around it with something like a "looking that up..." cue.

## Per-provider notes

The protocol is the same across providers, but each upstream has quirks worth knowing.

<Tabs>
  <Tab title="OpenAI">
    **Models:** `openai/gpt-realtime-2`, `openai/gpt-realtime`, `openai/gpt-realtime`.

    **Voices:** `alloy`, `ash`, `ballad`, `coral`, `echo`, `sage`, `shimmer`, `verse`, `marin`, `cedar` (latest models).

    **Sample rate:** 24 kHz symmetric (in/out).

    **Reasoning effort:** `gpt-realtime-2` accepts `reasoning_effort` and bills `reasoning` tokens. Other models ignore it.

    **Transcription:** input transcription runs in a parallel pipeline upstream, so the user transcript can arrive *after* the assistant has started responding. Enabling `input_transcription` adds a separate per-minute bill line (`input_transcription_ms`) at the rate of the transcription model.

    **Transcription is set-at-start.** `input_transcription` and `input_transcription_model` cannot be toggled via `session.update` on OpenAI. Mid-session changes return an error. Request transcription at `session.start` or not at all. xAI and Gemini fold transcription into their per-minute audio rate and don't have this restriction.
  </Tab>

  <Tab title="xAI">
    **Models:** `xai/grok-voice-latest`.

    **Voices:** `ara`, `eve`, `leo`, `rex`, `sal`.

    **Sample rate:** 24 kHz symmetric.

    **Billing:** flat per-minute rates for audio in / audio out. Input and output transcription are folded into the per-minute rate; toggling them on doesn't cost extra.

    **Connect latency:** xAI requires a three-step handshake (client-secret exchange, WS dial, session.updated). Expect \~1s before `session.started` lands.
  </Tab>

  <Tab title="Gemini">
    **Models:** `gemini/gemini-3.1-flash-live-preview`.

    **Voices:** `Puck`, `Charon`, `Kore`, `Fenrir`, `Aoede`, `Leda`, `Orus`, `Zephyr`.

    **Sample rate:** **asymmetric**. 16 kHz input, 24 kHz output. Both rates are reported in `session.started`. Read `input_sample_rate` for mic capture and `output_sample_rate` for playback.

    **Tools:** supported; translated to Gemini's `toolResponse` format internally.

    **Mid-turn close:** Gemini reports usage only at turn boundaries. If you close mid-sentence, Opper waits briefly for the trailing usage frame before tearing down.
  </Tab>
</Tabs>

## Session limits

| Limit                            | Default    | Trigger                                          |
| -------------------------------- | ---------- | ------------------------------------------------ |
| Concurrent sessions per project  | 5          | `429` at preflight                               |
| Max session duration             | 30 minutes | `session.terminating` with `session_timeout`     |
| Idle timeout (no client traffic) | 60 seconds | `session.terminating` with `idle_timeout`        |
| Session-start grace period       | 10 seconds | WebSocket close if `session.start` never arrives |

## Billing

Realtime sessions bill incrementally as the conversation runs. Per-model rates are published on `GET /v3/models`. Look for `audio_input`, `audio_output`, `audio_input_per_minute`, `audio_output_per_minute`, and `input_transcription_per_minute` on the relevant model.

## Tracing

Each session emits generation records visible in the [Trace explorer](/control-plane/observe), filterable by `model` or `session_id`.

## See also

* [Realtime protocol reference](/v3-api-reference/realtime/protocol): event vocabulary quick-ref for the API reference section.
* [brainstorm-time cookbook](https://github.com/opper-ai/opper-cookbook/tree/main/examples/brainstorm-time): full working voice app with browser UI, tools, and provider switching.
* [Models](/capabilities/models): the full list of supported realtime model ids.