Multimodality

The gateway isn’t text-only. The same endpoint and API key reach models that see images and PDFs, generate images, speak and transcribe, and produce video — each with the routing, governance, and tracing you get on every call.

Modalities

Vision & PDFs

Send images and documents into a model as message content, with optional structured output.

Images

Generate and edit images with POST /v3/images. Sora, GPT Image, Imagen, and more.

Audio

Text to speech and speech to text with POST /v3/audio/speech and /v3/audio/transcriptions.

Video

Generate video from a prompt or reference image with POST /v3/videos.

Realtime voice

Two-way voice over WebSocket. OpenAI, xAI, and Gemini behind one protocol.

Models

The full catalog, with each model’s input and output modalities marked.

Discovering what a model can do

Each modality has a discovery endpoint that reports the models available and their capabilities, so you don’t have to hardcode a list:

Endpoint	Returns
`GET /v3/images/models`	Models for `POST /v3/images`, with sizes, aspect ratios, and edit support.
`GET /v3/audio/models`	Speech (`tts`) and transcription (`stt`) models, with voices and formats.
`GET /v3/videos/models`	Models for `POST /v3/videos`, with resolutions, aspect ratios, and max duration.

curl -s "https://api.opper.ai/v3/images/models" \
  -H "Authorization: Bearer $OPPER_API_KEY"

Input, output, and storage

Generation endpoints (/v3/images, /v3/audio/speech, /v3/videos) return the result inline as base64 by default and also persist a copy to Files so you get a reusable file_id and a presigned url. That stored output can be fed straight into a later call — an image into a video generation, an audio file_id into a transcription — without re-encoding. Stored outputs are permanent until you delete them; persistence is skipped on zero-data-retention projects.

Get started

Platform

Build

Control Plane

Tutorials

Tooling

Multimodality

Modalities

Vision & PDFs

Images

Audio

Video

Realtime voice

Models

Discovering what a model can do

Input, output, and storage

​Modalities

Vision & PDFs

Images

Audio

Video

Realtime voice

Models

​Discovering what a model can do

​Input, output, and storage

Modalities

Discovering what a model can do

Input, output, and storage