Tests and evals

Since LLMs are fundamentally probabilistic we typically need to spend some degree of effort to perform quality assurance on the completions. Opper offers a variety of ways of doing tests, from saving metrics to testing models.

How it works

Tests and evaluations are essentially about measuring the quality of the task completion. Since tasks are declarative and very structured they are much more easily evaluated than conventional LLM completions based on only prompts. In addition to enforcing schemas, promoting the defining of requirements in schemas, Opper also performs automatic observations using AI on every completion. You can also attach custom metrics to completions or spans.

Built in observations

The platform automatically performs an observation on the quality of the task completion. The results of these are available in the tracing view next to every task completion. The observation appears within 1-10 seconds. You will see a paragraph summary observation and a score from 0-100.

The built in observation is essentially a task in itself that completes into:

class Score(BaseModel):
    thoughts: str = Field(
        description="Thoughts on how to evaluate the response",
    )
    observations: str = Field(
        description="Observations about the operation and the response",
    )
    correct: bool = Field(
        description="Did the model succeed at handling the task or not?",
    )
    score: float = Field(
        description="A value between 0 and 100 reflecting the quality of the response given the instructions, input and expected output",
    )

The better the input and output schemas are annotated, the clearer the task is and the better the evaluation will do. We highly recommend putting effort into specifying what great looks like

Custom metrics

You may also attach custom metrics to task completions or spans.

Here is a very simple example:

from opperai import Opper
from pydantic import BaseModel, Field
import os

def main():
    opper = Opper(http_bearer=os.getenv("OPPER_API_KEY"))

    result = opper.call(
        name="answer-question",
        instructions="Answer the question as concisely as possible",
        input="What is the capital of France?"
    )

    # Perform a simple evaluation
    is_one_word = 1 if len(result.message.split(" ")) == 1 else 0

    if is_one_word:
        print("The answer is concise")
  
        opper.span_metrics.create_metric(
            span_id=result.span_id,
            dimension="is_concise", 
            value=1, 
            comment="Evaluated if the answer is concise (1=True, 0=False)"
        )
    else:
        print("The answer is not concise")
  
        opper.span_metrics.create_metric(
            span_id=result.span_id,
            dimension="is_concise", 
            value=0, 
            comment="Evaluated if the answer is concise (1=True, 0=False)"
        )

if __name__ == "__main__":
    main()

Run dataset evaluation

To test new models, prompts or other configuration it is possible to utilize the task datasets to run evaluations.

To run through a task with an alternative configuration, go into the dashboard, navigate to the function you want to test, go to evaluations and press “run”. You will be presented with options for changing the current configuration, including model and prompt.

Overview

Capabilities

Examples

Resources

Guides

Tests and evals

How it works

Built in observations

Custom metrics

Run dataset evaluation

Overview

Capabilities

Examples

Resources

Guides

​How it works

​Built in observations

​Custom metrics

​Run dataset evaluation

How it works

Built in observations

Custom metrics

Run dataset evaluation