Skip to main content
The gateway covers both directions of audio with two synchronous endpoints:
  • Text to speechPOST /v3/audio/speech turns text into spoken audio.
  • Speech to textPOST /v3/audio/transcriptions turns audio into a transcript.
Both are provider-agnostic: model selects the provider, a small set of fields is normalized for you, and anything model-specific goes in parameters. For live, two-way voice conversations, use Realtime instead.

Text to speech

Pass a model and the input text. The response carries the audio inline as base64 and, by default, a stored file_id + presigned url.
curl -sX POST https://api.opper.ai/v3/audio/speech \
  -H "Authorization: Bearer $OPPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/tts-1",
    "input": "Hello from Opper.",
    "voice": "alloy",
    "format": "mp3"
  }'
Response
{
  "id": "spch_...",
  "model": "openai/tts-1",
  "created": 1716124800,
  "audio": {
    "b64_json": "SUQzBAAAAAA...",
    "mime_type": "audio/mpeg",
    "file_id": "file_...",
    "url": "https://...presigned..."
  },
  "usage": { "cost": 0.0009, "characters": 17 }
}
FieldWhat it does
modelRequired. The TTS model, e.g. openai/tts-1.
inputRequired. The text to speak.
voiceProvider voice; empty uses the model’s default. Validated against the model’s declared voices.
formatmp3 (default), wav, opus, aac, flac, or pcm.
speed0.254.0 where the model supports it.
storePersist the audio and return a file_id + url. Defaults to true; respects retention rules.
parametersOpaque per-provider passthrough.
Speech is billed per input character.

Speech to text

Pass a model and an audio source: a file_id from a previous generation or upload, an https URL, or a data-URI (max 25 MB decoded).
A file_id lets you transcribe audio you already have on Opper — a clip you uploaded, or speech you just generated with /v3/audio/speech — without re-sending the bytes. See Files for how files work, lifecycle, and storage quotas.
Transcribe a file_id
curl -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" -H "Content-Type: application/json" \
  -d '{ "model": "openai/whisper-1", "audio": "file_abc123" }'
curl -sX POST https://api.opper.ai/v3/audio/transcriptions \
  -H "Authorization: Bearer $OPPER_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/whisper-1",
    "audio": "https://example.com/meeting.mp3",
    "language": "en"
  }'
Response
{
  "id": "trsc_...",
  "model": "openai/whisper-1",
  "created": 1716124800,
  "text": "Full transcript here...",
  "language": "en",
  "duration": 42.5,
  "segments": [
    { "start": 0.0, "end": 3.2, "text": "Full transcript here..." }
  ],
  "usage": { "cost": 0.004, "seconds": 42.5 }
}
FieldWhat it does
modelRequired. The transcription model, e.g. openai/whisper-1.
audioRequired. A file_id, https URL, or data-URI to transcribe.
languageISO-639-1 hint (e.g. en); the provider validates it.
promptContext or vocabulary hint to bias the transcript.
diarizeRequest speaker labels on segments. Rejected on models that don’t support diarization.
parametersOpaque per-provider passthrough.
Transcription is billed per provider-reported audio duration. Segment and word timestamps are returned when the model provides them.

Discover models

GET /v3/audio/models lists the audio models available, tagged tts or stt, with their voices and formats:
# all audio models, or filter by type
curl -s "https://api.opper.ai/v3/audio/models?type=tts" \
  -H "Authorization: Bearer $OPPER_API_KEY"

What’s next

Realtime voice

Live, two-way voice over WebSocket.

Models

Which models do speech and transcription.

Control Plane

Govern providers, regions, and spend on every call.

Video

Generate video from a prompt or image.