Skip to main content
Coming soon. Observe is currently in development. Stay tuned for updates.
Observe gives you real-time visibility into how your AI systems are performing — tracking latency, costs, errors, and model behavior across all your calls. It combines tracing, metrics, and automated quality evaluation into a single view.

Tracing

Hierarchical tracing lets you follow requests through your entire AI pipeline. Every task completion is automatically attached to a span, and spans can be connected to build traces that capture complex multi-step workflows like agents and chatbots.

Automated Evaluation

The platform automatically performs an observation on the quality of every task completion. The results are available in the tracing view next to every task completion. The observation appears within 1-10 seconds — you will see a paragraph summary observation and a score from 0-100.
Observation
The built in observation completes into:
class Score(BaseModel):
    thoughts: str = Field(
        description="Thoughts on how to evaluate the response",
    )
    observations: str = Field(
        description="Observations about the operation and the response",
    )
    correct: bool = Field(
        description="Did the model succeed at handling the task or not?",
    )
    score: float = Field(
        description="A value between 0 and 100 reflecting the quality of the response given the instructions, input and expected output",
    )
The better the input and output schemas are annotated, the clearer the task is and the better the evaluation will do. We highly recommend putting effort into specifying what great looks like

Custom Metrics

You may also attach custom metrics to task completions or spans.
Metric
from opperai import Opper
from pydantic import BaseModel, Field
import os

def main():
    opper = Opper(http_bearer=os.getenv("OPPER_API_KEY"))

    result = opper.call(
        name="answer-question",
        instructions="Answer the question as concisely as possible",
        input="What is the capital of France?"
    )

    # Perform a simple evaluation
    is_one_word = 1 if len(result.message.split(" ")) == 1 else 0

    opper.span_metrics.create_metric(
        span_id=result.span_id,
        dimension="is_concise",
        value=is_one_word,
        comment="Evaluated if the answer is concise (1=True, 0=False)"
    )

if __name__ == "__main__":
    main()

Dataset Evaluation

To test new models, prompts or other configuration it is possible to utilize the task datasets to run evaluations.
Dataset Evaluation
To run through a task with an alternative configuration, go into the dashboard, navigate to the function you want to test, go to evaluations and press “run”. You will be presented with options for changing the current configuration, including model and prompt.