Evaluations
Evaluation is a critical part of building reliable LLM applications. It helps you verify your function behaves correctly across different inputs, compare performance between different models and prompts, track quality over time as you make changes, and build confidence in your LLM features before or during deployment.
Opper provides two types of evaluation:
-
Online Evaluation: Automatically runs on every function call in real-time, providing immediate feedback on output quality and correctness.
-
Offline Evaluation: Runs against your curated datasets, allowing you to systematically test your function against known test cases and ensure consistent behavior across important scenarios.
Together, these evaluation methods help you maintain high quality standards - catching issues as they happen while also verifying reliability across critical test cases.
Using Evaluation in Your Code
The Opper SDKs include support for creating custom evaluators and running evaluations directly in your code. This enables you to programmatically evaluate your LLM outputs using metrics tailored to your specific use case.
Creating Custom Evaluators
Custom evaluators are functions that take an LLM result and return metrics. Here's a simple example:
Creating LLM-based Evaluators
You can also create evaluators that use LLM calls for more subjective evaluations:
Running Evaluations
Here's a simple example of running evaluators on an LLM response:
Integration with Tracing
Evaluations are automatically associated with their corresponding trace spans, making it easy to see evaluation results in the context of your application's execution flow. This integration provides:
- A complete picture of function performance across multiple calls
- The ability to drill down into specific traces to understand evaluation results in context
- Easy access to all metrics from a specific evaluation run
See Tracing for more information on how to use traces in your application.
Best Practices
When implementing evaluations in your application, consider these best practices:
-
Define clear success criteria: Determine what "good" looks like for your specific application and create evaluators that measure those aspects.
-
Combine multiple evaluators: Use multiple evaluators to measure different aspects of your function's performance (correctness, tone, format, etc.).
-
Start with simple metrics: Begin with simple, deterministic evaluators before moving to more complex LLM-based evaluations.
-
Use evaluations for regression testing: Run evaluations before and after making changes to ensure quality doesn't degrade.
-
Include examples from real user interactions: Add problematic outputs from real usage to your test dataset to ensure your function handles edge cases correctly.
More Examples
For more examples and patterns, check out:
Offline Evaluation
Offline evaluations let you systematically test your function against a dataset of input/output pairs (see Datasets).

Each dataset entry contains an input to test and the expected output to compare against. When you run an evaluation, Opper:
- Executes your function on each input in the dataset
- Compares the function's output to the expected output
- Generates metrics on how well the outputs match expectations
Run an evaluation
To run an evaluation:
- Navigate to a function's evaluation tab
- Click "Run Evaluation" - this will be pre-filled with your current function configuration
- Optionally modify the instructions, model, or other settings to test different combinations
- Click run to start the evaluation against your dataset
The evaluation process typically takes between 30 seconds to a few minutes, depending on your dataset size.
Inspecting a run
Once complete, you'll see:
- An overall score showing how well the outputs matched expectations across all entries
- Individual results for each dataset entry showing:
- The model's actual output
- The expected output
- A detailed comparison highlighting any differences
- The evaluation score for that specific entry
This detailed breakdown helps you identify exactly where your function is performing well and where it may need improvements.
Online Evaluation
Online evaluation automatically assesses every function call in real-time. Using AI, it evaluates how well each output aligns with the function's instructions, input schema, and examples. This provides instant feedback on output quality and surfaces potential issues immediately.
When calls receive low evaluation scores, they make excellent candidates for your test dataset. Adding these edge cases with corrected outputs helps build comprehensive test coverage and guides improvements to your model selection and instructions.

The platform aggregates online evaluation scores to show your function's real-world performance. You can also view AI-generated suggestions for improving your instructions on the function overview page, which you can validate through offline evaluation.