Evaluations
Evaluation is a critical part of building reliable LLM applications. It helps you verify your function behaves correctly across different inputs, compare performance between different models and prompts, track quality over time as you make changes, and build confidence in your LLM features before or during deployment.
Opper provides two types of evaluation:
-
Online Evaluation: Automatically runs on every function call in real-time, providing immediate feedback on output quality and correctness.
-
Offline Evaluation: Runs against your curated datasets, allowing you to systematically test your function against known test cases and ensure consistent behavior across important scenarios.
Together, these evaluation methods help you maintain high quality standards - catching issues as they happen while also verifying reliability across critical test cases.
Offline Evaluation
Offline evaluations let you systematically test your function against a dataset of input/output pairs (see Datasets).
Each dataset entry contains an input to test and the expected output to compare against. When you run an evaluation, Opper:
- Executes your function on each input in the dataset
- Compares the function's output to the expected output
- Generates metrics on how well the outputs match expectations
Run an evaluation
To run an evaluation:
- Navigate to a function's evaluation tab
- Click "Run Evaluation" - this will be pre-filled with your current function configuration
- Optionally modify the instructions, model, or other settings to test different combinations
- Click run to start the evaluation against your dataset
The evaluation process typically takes between 30 seconds to a few minutes, depending on your dataset size.
Inspecting a run
Once complete, you'll see:
- An overall score showing how well the outputs matched expectations across all entries
- Individual results for each dataset entry showing:
- The model's actual output
- The expected output
- A detailed comparison highlighting any differences
- The evaluation score for that specific entry
This detailed breakdown helps you identify exactly where your function is performing well and where it may need improvements.
Online Evaluation
Online evaluation automatically assesses every function call in real-time. Using AI, it evaluates how well each output aligns with the function's instructions, input schema, and examples. This provides instant feedback on output quality and surfaces potential issues immediately.
When calls receive low evaluation scores, they make excellent candidates for your test dataset. Adding these edge cases with corrected outputs helps build comprehensive test coverage and guides improvements to your model selection and instructions.
The platform aggregates online evaluation scores to show your function's real-world performance. You can also view AI-generated suggestions for improving your instructions on the function overview page, which you can validate through offline evaluation.