Since LLMs are fundamentally probabilistic we typically need to spend some degree of effort to perform quality assurance on the completions. Opper offers a variety of ways of doing tests, from saving metrics to testing models.
How it works
Tests and evaluations are essentially about measuring the quality of the task completion. Since tasks are declarative and very structured they are much more easily evaluated than conventional LLM completions based on only prompts. In addition to enforcing schemas, promoting the defining of requirements in schemas, Opper also performs automatic observations using AI on every completion. You can also attach custom metrics to completions or spans.
Built in observations
The platform automatically performs an observation on the quality of the task completion. The results of these are available in the tracing view next to every task completion. The observation appears within 1-10 seconds. You will see a paragraph summary observation and a score from 0-100.
The built in observation is essentially a task in itself that completes into:
class Score(BaseModel):
thoughts: str = Field(
description="Thoughts on how to evaluate the response",
)
observations: str = Field(
description="Observations about the operation and the response",
)
correct: bool = Field(
description="Did the model succeed at handling the task or not?",
)
score: float = Field(
description="A value between 0 and 100 reflecting the quality of the response given the instructions, input and expected output",
)
The better the input and output schemas are annotated, the clearer the task is and the better the evaluation will do. We highly recommend putting effort into specifying what great looks like
Custom metrics
You may also attach custom metrics to task completions or spans.
Here is a very simple example:
from opperai import Opper
from pydantic import BaseModel, Field
import os
def main():
opper = Opper(http_bearer=os.getenv("OPPER_API_KEY"))
result = opper.call(
name="answer-question",
instructions="Answer the question as concisely as possible",
input="What is the capital of France?"
)
# Perform a simple evaluation
is_one_word = 1 if len(result.message.split(" ")) == 1 else 0
if is_one_word:
print("The answer is concise")
opper.span_metrics.create_metric(
span_id=result.span_id,
dimension="is_concise",
value=1,
comment="Evaluated if the answer is concise (1=True, 0=False)"
)
else:
print("The answer is not concise")
opper.span_metrics.create_metric(
span_id=result.span_id,
dimension="is_concise",
value=0,
comment="Evaluated if the answer is concise (1=True, 0=False)"
)
if __name__ == "__main__":
main()
Run dataset evaluation
To test new models, prompts or other configuration it is possible to utilize the task datasets to run evaluations.
To run through a task with an alternative configuration, go into the dashboard, navigate to the function you want to test, go to evaluations and press “run”. You will be presented with options for changing the current configuration, including model and prompt.