messages array, with a richer content field — no separate endpoint. Vision and PDF are model capabilities, so you send the media to a regular chat model that supports them.
Not every model does. Filter the catalog by capability — or call GET /v3/models?capability=vision (images) or ?capability=pdf (documents) — to find ones that accept each. The Claude, Gemini, and GPT families support both.
Images
Two ways to send an image: a hosted URL or inline base64.PDFs
PDFs work the same way. The model reads both the text and any embedded images (charts, diagrams, scanned pages).Python
Free text or structured output
| Need | Reach for |
|---|---|
| Show an image and ask a free-text question about it | A plain message (this page) |
| Extract structured fields from an image or PDF (a receipt, an invoice, a form) | Structured output with response_format |
| Run a multi-turn conversation about an uploaded document | A plain message (this page) |
| Batch process documents into a database | Structured output |
response_format when you want typed JSON out of an image or PDF. Leave it off when the model just needs to talk about the media.
What’s next
Structured output
Multimodal input with typed JSON output.
Conversations
Multi-turn chat. Works with image and PDF messages too.
Models
Which models accept which input types.