Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.opper.ai/llms.txt

Use this file to discover all available pages before exploring further.

Control Plane features are in early access and need to be turned on per account. Contact support@opper.ai if you’re interested.
Observe runs an LLM judge over your generations and writes the score back to the span. You define one or more Observe rules — each picks a judge tier, a sampling strategy, the kind of score you want (continuous or binary), and the scope it applies to. Results show up on the span in the trace view next to latency and cost.

Rules

Each Observe rule is one independent judge. You can keep a single org-wide rule, run more specialized rules on specific functions, or do both. Rules with overlapping scope all fire — there’s no priority resolution. Open platform.opper.ai and navigate to Controls → Observe to add or edit rules.

Configure

Each rule has the following fields.

Judge

  • Fast — cheapest, quickest. Good while you’re tuning.
  • Balanced (default) — recommended for most rules.
  • Thorough — highest-quality judgment at higher cost; use for rules where decisions are expensive.

Sample

  • All — every generation is evaluated.
  • Rate — 1 in N generations. Useful on high-volume functions where evaluating every call is overkill.
  • Adaptive — up to N per window (1h or 24h). Evaluates up to the cap, then tapers — 50% at 2× volume, 25% at 4× — so spend stays predictable under traffic spikes.

Score type

  • Score — a continuous 0–1 number. You set a threshold (“Flag below X”, 0–1, step 0.05); generations below it are marked failing. Criteria is optional — leave it blank for general quality, or click Add custom criteria to write your own.
  • Binary — a strict 0/1 verdict (“0 means the generation didn’t pass”). No threshold; criteria is required. Example: “Return 1 if the response cites a source, otherwise 0.”

Criteria

Free-text instructions (up to 4096 characters) telling the judge what to score on. Required for Binary; optional for Score (defaults to a general quality evaluation).

Scope

Each rule applies to one of three scopes:
  • Organization — every project and every function in your org.
  • Projects — one or more specific projects.
  • Functions — specific functions inside a project.
You set scope per rule. Rules at different scopes can coexist — a Thorough Binary rule on one critical function plus a Fast Score rule across the whole org is a typical pattern.

In traces

When a rule fires on a generation, the span carries:
  • A score gauge in the span header.
  • A short written observation under the header.
  • A collapsible Scorer Breakdown with per-scorer scores and (for rubric-style scorers) per-criteria pass/fail with explanations.
  • An entry on the span event timeline labeled Observe (eye icon) with status Passed or Flagged. When the rule has a name, that name leads the entry; a scope badge (Org-level / Project-level / Function-level) sits next to each event so you can see which rule fired.
Observation on a trace span
The traces table also has a scores column so you can scan a list of generations and spot low scores at a glance.

In playground

Observe rules apply to playground calls when the Project controls toggle in the task sidebar is on. The Observe results don’t render inline in the playground output — to see the score gauge, observation, scorer breakdown, and Observe event for a playground run, click the trace ↗ link in the output footer. To test a function as if no controls were in place, switch Project controls off for the run.
Start with All sampling on a Score rule while you’re tuning a new function. Switch to Rate or Adaptive once volume picks up.
Use Binary when the criteria is yes/no — “did the response cite a source”, “did it return valid JSON”. Use Score when you want a graded signal and a tunable threshold.