Modalities
Vision & PDFs
Send images and documents into a model as message content, with optional structured output.
Images
Generate and edit images with
POST /v3/images. Sora, GPT Image, Imagen, and more.Audio
Text to speech and speech to text with
POST /v3/audio/speech and /v3/audio/transcriptions.Video
Generate video from a prompt or reference image with
POST /v3/videos.Realtime voice
Two-way voice over WebSocket. OpenAI, xAI, and Gemini behind one protocol.
Models
The full catalog, with each model’s input and output modalities marked.
Discovering what a model can do
Each modality has a discovery endpoint that reports the models available and their capabilities, so you don’t have to hardcode a list:| Endpoint | Returns |
|---|---|
GET /v3/images/models | Models for POST /v3/images, with sizes, aspect ratios, and edit support. |
GET /v3/audio/models | Speech (tts) and transcription (stt) models, with voices and formats. |
GET /v3/videos/models | Models for POST /v3/videos, with resolutions, aspect ratios, and max duration. |
Input, output, and storage
Generation endpoints (/v3/images, /v3/audio/speech, /v3/videos) return the result inline as base64 by default and also persist a copy to Files so you get a reusable file_id and a presigned url. That stored output can be fed straight into a later call — an image into a video generation, an audio file_id into a transcription — without re-encoding. Persistence respects your retention rules and is skipped on zero-data-retention projects.