- Text to speech —
POST /v3/audio/speechturns text into spoken audio. - Speech to text —
POST /v3/audio/transcriptionsturns audio into a transcript.
model selects the provider, a small set of fields is normalized for you, and anything model-specific goes in parameters. For live, two-way voice conversations, use Realtime instead.
Text to speech
Pass amodel and the input text. The response carries the audio inline as base64 and, by default, a stored file_id + presigned url.
Response
| Field | What it does |
|---|---|
model | Required. The TTS model, e.g. openai/tts-1. |
input | Required. The text to speak. |
voice | Provider voice; empty uses the model’s default. Validated against the model’s declared voices. |
format | mp3 (default), wav, opus, aac, flac, or pcm. |
speed | 0.25–4.0 where the model supports it. |
store | Persist the audio and return a file_id + url. Defaults to true; respects retention rules. |
parameters | Opaque per-provider passthrough. |
Speech to text
Pass amodel and an audio source: a file_id from a previous generation or upload, an https URL, or a data-URI (max 25 MB decoded).
A
file_id lets you transcribe audio you already have on Opper — a clip you uploaded, or speech you just generated with /v3/audio/speech — without re-sending the bytes. See Files for how files work, lifecycle, and storage quotas.Transcribe a file_id
Response
| Field | What it does |
|---|---|
model | Required. The transcription model, e.g. openai/whisper-1. |
audio | Required. A file_id, https URL, or data-URI to transcribe. |
language | ISO-639-1 hint (e.g. en); the provider validates it. |
prompt | Context or vocabulary hint to bias the transcript. |
diarize | Request speaker labels on segments. Rejected on models that don’t support diarization. |
parameters | Opaque per-provider passthrough. |
Discover models
GET /v3/audio/models lists the audio models available, tagged tts or stt, with their voices and formats:
What’s next
Realtime voice
Live, two-way voice over WebSocket.
Models
Which models do speech and transcription.
Control Plane
Govern providers, regions, and spend on every call.
Video
Generate video from a prompt or image.