Tests and evals
Test and evaluate task completions
Since LLMs are fundamentally probabilistic we typically need to spend some degree of effort to perform quality assurance on the completions. Opper offers a variety of ways of doing tests, from saving metrics to testing models.
How it works
Tests and evaluations are essentially about measuring the quality of the task completion. Since tasks are declarative and very structured they are much more easily evaluated than conventional LLM completions based on only prompts. In addition to enforcing schemas, promoting the defining of requirements in schemas, Opper also performs automatic observations using AI on every completion. You can also attach custom metrics to completions or spans.
Built in observations
The platform automatically performs an observation on the quality of the task completion. The results of these are available in the tracing view next to every task completion. The observation appears within 1-10 seconds. You will see a paragraph summary observation and a score from 0-100.
The built in observation is essentially a task in itself that completes into:
Custom metrics
You may also attach custom metrics to task completions or spans.
Here is a very simple example:
Run dataset evaluation
To test new models, prompts or other configuration it is possible to utilize the task datasets to run evaluations.
To run through a task with an alternative configuration, go into the dashboard, navigate to the function you want to test, go to evaluations and press “run”. You will be presented with options for changing the current configuration, including model and prompt.