Audio

The gateway covers both directions of audio with two synchronous endpoints:

Text to speech — POST /v3/audio/speech turns text into spoken audio.
Speech to text — POST /v3/audio/transcriptions turns audio into a transcript.

Both are provider-agnostic: model selects the provider, a small set of fields is normalized for you, and anything model-specific goes in parameters. You can also clone a custom voice from a short sample and reuse it. For live, two-way voice conversations, use Realtime instead.

Text to speech

Pass a model and the input text. The response carries the audio inline as base64 and, by default, a stored file_id + presigned url.

curl -sX POST https://api.opper.ai/v3/audio/speech \
  -H "Authorization: Bearer $OPPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Hello from Opper.",
    "voice": "alloy",
    "format": "mp3"
  }'

import os, base64, requests

r = requests.post(
    "https://api.opper.ai/v3/audio/speech",
    headers={"Authorization": f"Bearer {os.environ['OPPER_API_KEY']}"},
    json={
        "model": "openai/tts-1",
        "input": "Hello from Opper.",
        "voice": "alloy",
        "format": "mp3",
    },
)
r.raise_for_status()
audio = r.json()["audio"]
open("speech.mp3", "wb").write(base64.b64decode(audio["b64_json"]))
print("saved speech.mp3", "·", audio.get("file_id"))

Response

{
  "id": "spch_...",
  "model": "openai/tts-1",
  "created": 1716124800,
  "audio": {
    "b64_json": "SUQzBAAAAAA...",
    "mime_type": "audio/mpeg",
    "file_id": "file_...",
    "url": "https://...presigned..."
  },
  "usage": { "cost": 0.0009, "characters": 17 }
}

Field	What it does
`model`	Required. The TTS model, e.g. `openai/tts-1`.
`input`	Required. The text to speak.
`voice`	Provider voice; empty uses the model’s default. Validated against the model’s declared voices.
`format`	`mp3` (default), `wav`, `opus`, `aac`, `flac`, or `pcm`.
`speed`	`0.25`–`4.0` where the model supports it.
`store`	Persist the audio and return a `file_id` + `url`. Defaults to true; the stored file is permanent until deleted. Skipped on zero-data-retention projects.
`parameters`	Opaque per-provider passthrough.

Speech is billed per input character.

Expressiveness

A voice can sound flat by default. How you add expression depends on the provider:

ElevenLabs exposes voice controls through parameters.voice_settings: lower stability for a more dynamic, emotional read, higher style to exaggerate delivery, and use_speaker_boost. The elevenlabs/eleven_v3 model also understands inline audio tags in the input text — [excited], [whispers], [sarcastic], [laughs] — for fine-grained emotion.
Mistral Voxtral has no style knobs; it mirrors the delivery of the voice it speaks as. For a lively result, use — or clone — a lively-sounding voice.

curl -sX POST https://api.opper.ai/v3/audio/speech \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven_multilingual_v2",
    "input": "We did it — the results are in!",
    "voice": "voice_abc123",
    "parameters": { "voice_settings": { "stability": 0.3, "style": 0.8, "use_speaker_boost": true } }
  }'

curl -sX POST https://api.opper.ai/v3/audio/speech \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{
    "model": "elevenlabs/eleven_v3",
    "input": "[excited] We did it! [whispers] and no one even noticed.",
    "voice": "voice_abc123"
  }'

Audio tags count toward billed input characters, and eleven_v3 bills at a higher rate — great for a hero voice, more than you need for bulk narration. Use GET /v3/audio/models to discover the exact parameters a model accepts (its params.tts.parameters list).

Custom voices (cloning)

Clone a voice from a short reference sample and reuse it by an opaque id. POST /v3/audio/voices takes a model (which selects the provider to clone on), an audio reference — a file_id, https URL, or base64 data-URI of a few seconds of speech — and an optional name, languages, and gender. It returns a voice_... id you pass as voice on /v3/audio/speech, exactly like a built-in voice.

# 1. Clone — returns { "id": "voice_...", "provider": "mistral", ... }
curl -sX POST https://api.opper.ai/v3/audio/voices \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{
    "model": "mistral/voxtral-mini-tts-2603",
    "name": "my-voice",
    "audio": "file_abc123",
    "languages": ["de"]
  }'

# 2. Speak with it — pass the voice_... id as `voice`
curl -sX POST https://api.opper.ai/v3/audio/speech \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "mistral/voxtral-mini-tts-2603", "input": "Hallo!", "voice": "voice_..." }'

import os, requests

base = "https://api.opper.ai"
h = {"Authorization": f"Bearer {os.environ['OPPER_API_KEY']}"}

voice = requests.post(f"{base}/v3/audio/voices", headers=h, json={
    "model": "mistral/voxtral-mini-tts-2603",
    "name": "my-voice",
    "audio": "file_abc123",   # or an https URL / data-URI
    "languages": ["de"],
}).json()

speech = requests.post(f"{base}/v3/audio/speech", headers=h, json={
    "model": "mistral/voxtral-mini-tts-2603",
    "input": "Hallo!",
    "voice": voice["id"],     # voice_...
}).json()

Response (create)

{
  "id": "voice_033Qq...",
  "object": "voice",
  "name": "my-voice",
  "provider": "mistral",
  "languages": ["de"],
  "created": 1716124800,
  "expires_at": 1718716800
}

Manage your voices — all scoped to your project:

Method	Path	What it does
`GET`	`/v3/audio/voices`	List your project’s cloned voices (newest first, paginated).
`GET`	`/v3/audio/voices/{id}`	Fetch one voice.
`DELETE`	`/v3/audio/voices/{id}`	Remove it — also deletes it at the provider.

Field (create)	What it does
`model`	Required. Selects the provider to clone on (e.g. `mistral/voxtral-mini-tts-2603`, `elevenlabs/eleven_multilingual_v2`).
`audio`	Required. The reference sample: a `file_id`, https URL, or base64 data-URI.
`name`	A human label for the voice.
`languages`	Language hints, e.g. `["de"]`.
`gender`	Optional voice gender hint.
`parameters`	Opaque per-provider passthrough for create-voice fields we don’t normalize.

A clone carries the voice, not a language. Cloning captures timbre and accent; the output language comes from the text you synthesize with a multilingual model — so one clone can speak any supported language. Recording in the target language gives the most native accent.

A few things to know:

Provider binding. A voice is bound to the provider you cloned it on and only works with that provider’s TTS models. Using it with another provider’s model returns a 400.
Isolation & privacy. Voices are scoped to your project — never visible to or usable by another project — and the reference sample is not stored.
Expiry. Some providers auto-delete clones after a while (Mistral keeps one ~30 days); when they do, expires_at is returned, and once it lapses a speak returns a clear “no longer available” error — just create a new one. Providers that don’t expire voices omit expires_at.
Sample length. A few seconds is enough for Mistral Voxtral (zero-shot). ElevenLabs works best with ~1–2 minutes of clean audio (avoid more than ~3 minutes); recording quality matters more than length.

Speech to text

Pass a model and an audio source: a file_id from a previous generation or upload, an https URL, or a data-URI (max 25 MB decoded synchronously; up to 100 MB in async mode).

A file_id lets you transcribe audio you already have on Opper — a clip you uploaded, or speech you just generated with /v3/audio/speech — without re-sending the bytes. See Files for how files work, lifecycle, and storage quotas.

Transcribe a file_id

curl -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "openai/whisper-1", "audio": "file_abc123" }'

curl -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/whisper-1",
    "audio": "https://example.com/meeting.mp3",
    "language": "en"
  }'

import os, requests

r = requests.post(
    "https://api.opper.ai/v3/audio/transcriptions",
    headers={"Authorization": f"Bearer {os.environ['OPPER_API_KEY']}"},
    json={
        "model": "openai/whisper-1",
        "audio": "https://example.com/meeting.mp3",
        "language": "en",
    },
)
r.raise_for_status()
print(r.json()["text"])

Response

{
  "id": "trsc_...",
  "model": "openai/whisper-1",
  "created": 1716124800,
  "text": "Full transcript here...",
  "language": "en",
  "duration": 42.5,
  "segments": [
    { "start": 0.0, "end": 3.2, "text": "Full transcript here..." }
  ],
  "usage": { "cost": 0.004, "seconds": 42.5 }
}

Field	What it does
`model`	Required. The transcription model, e.g. `openai/whisper-1`.
`audio`	Required. A `file_id`, https URL, or data-URI to transcribe.
`language`	ISO-639-1 hint (e.g. `en`); the provider validates it.
`prompt`	Context or vocabulary hint to bias the transcript.
`diarize`	Request speaker labels on segments. Rejected on models that don’t support diarization.
`stream`	Return the transcript live as Server-Sent Events. Supported where `params.stt.stream` is true; a 400 otherwise. Mutually exclusive with `async`. See below.
`async`	Run as a background job for long recordings — returns `202` with a status URL to poll instead of blocking. See below.
`parameters`	Opaque per-provider passthrough.

Transcription is billed per provider-reported audio duration. Segment and word timestamps are returned when the model provides them.

Live transcript (streaming)

Set stream: true to receive the transcript as it’s produced — as Server-Sent Events instead of one JSON body — for a live transcript while the audio is processed. The stream emits transcript.text.delta chunks, then a terminal transcript.text.done with the full transcript and usage, then data: [DONE]. Streaming is supported on models whose params.stt.stream is true (discover them via GET /v3/audio/models) — e.g. mistral/voxtral-mini-2602; other models return a 400. stream and async are mutually exclusive.

# -N keeps curl from buffering, so events print as they arrive
curl -N -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "mistral/voxtral-mini-2602", "audio": "file_abc123", "stream": true }'

import os, json, requests

with requests.post(
    "https://api.opper.ai/v3/audio/transcriptions",
    headers={"Authorization": f"Bearer {os.environ['OPPER_API_KEY']}"},
    json={"model": "mistral/voxtral-mini-2602", "audio": "file_abc123", "stream": True},
    stream=True,
) as r:
    for line in r.iter_lines():
        if not line or not line.startswith(b"data: "):
            continue
        data = line[len(b"data: "):]
        if data == b"[DONE]":
            break
        event = json.loads(data)
        if event["type"] == "transcript.text.delta":
            print(event["delta"], end="", flush=True)
        elif event["type"] == "transcript.text.done":
            print("\n\n", event["usage"])

Event stream

data: {"type":"transcript.text.delta","delta":"Good morning and "}
data: {"type":"transcript.text.delta","delta":"welcome!"}
data: {"type":"transcript.text.done","id":"trsc_...","text":"Good morning and welcome!","language":"en","duration":101,"usage":{"cost":0.00505,"seconds":101}}
data: [DONE]

Billing is identical to the synchronous call (per provider-reported audio duration). For long recordings that don’t need a live transcript, use async instead; for two-way live voice, use Realtime.

Long recordings (async)

A synchronous request is a good fit up to roughly an hour of audio. For longer recordings, set async: true: the request returns immediately with a 202 and a status_url. Poll it until the job is completed, then fetch the transcript from the returned url. Async accepts the same audio sources as the synchronous call (file_id, https URL, or data-URI). Async also raises the decoded-audio limit from 25 MB to 100 MB (the synchronous cap stays 25 MB). To reach the full 100 MB, pass a file_id or https URL — a data-URI that large would need a ~133 MB request body. This lines up with the Files per-file limit, so a stored file_id transcribes in full.

Submit + poll

# 1. Submit — returns { "id": "gen_...", "status_url": ".../v3/artifacts/{id}/status" }
curl -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "mistral/voxtral-mini-2602", "audio": "file_abc123", "diarize": true, "async": true }'

# 2. Poll — 202 while processing, then 200 with a presigned url to the transcript JSON
curl -s https://api.opper.ai/v3/artifacts/{id}/status \
  -H "Authorization: Bearer $OPPER_API_KEY"

mistral/voxtral-mini-2602 is an EU-hosted batch transcription model (speaker diarization, word timestamps, up to 3 h per request) — a good fit for the async flow.

Discover models

GET /v3/audio/models lists the audio models available, tagged tts or stt, with their voices and formats. Each model’s params.tts / params.stt block also carries a parameters list of the native passthrough keys it accepts (e.g. ElevenLabs’ exact output_format preset, or a language_code), so you can discover the model-specific knobs for parameters rather than guessing:

# all audio models, or filter by type
curl -s "https://api.opper.ai/v3/audio/models?type=tts" \
  -H "Authorization: Bearer $OPPER_API_KEY"

What’s next

Realtime voice

Live, two-way voice over WebSocket.

Models

Which models do speech and transcription.

Control Plane

Govern providers, regions, and spend on every call.

Video

Generate video from a prompt or image.

Get started

Platform

Build

Control Plane

Tutorials

Tooling

Text to speech

Expressiveness

Custom voices (cloning)

Speech to text

Live transcript (streaming)

Long recordings (async)

Discover models

What’s next

Realtime voice

Models

Control Plane

Video

​Text to speech

​Expressiveness

​Custom voices (cloning)

​Speech to text

​Live transcript (streaming)

​Long recordings (async)

​Discover models

​What’s next

Realtime voice

Models

Control Plane

Video

Text to speech

Expressiveness

Custom voices (cloning)

Speech to text

Live transcript (streaming)

Long recordings (async)

Discover models

What’s next