Agent Evaluation #188

nadheesh · 2026-01-19T06:32:56Z

nadheesh
Jan 19, 2026
Collaborator

Problem

Agents and AI-driven workflows are inherently non-deterministic. Without a structured, measurable evaluation process, teams cannot:

Reliably assess quality, correctness (groundedness, hallucination rate), or efficiency of agent behavior.
Detect or prevent regressions when changing agent logic, prompts, tools, or underlying LLM models.
Create a continuous feedback loop to systematically improve prompts, configurations, and agent components.

While multiple evaluation frameworks exist today, integrating them into real production agents—or embedding them naturally into the software development lifecycle—is still clumsy, fragmented, and largely manual. Also it can be risky to integrate evaluation into the production workloads. This is where a platform must step in: to make agent evaluation a first-class, repeatable capability that works seamlessly both during development and in production.

What We’re Looking For

We want concrete ideas around how evaluations should work end-to-end on the platform, not just individual features.

Evaluation Workflow on the Platform

Define a clear, opinionated process for running evaluations:

How users create, configure, run, and monitor evaluations.
How evaluations fit into development (pre-release) and production (post-deployment) workflows.
How results feedback into improving prompts, tools, agents, and models.
Safe isolation between evaluator execution and production workloads.

User Experience (UX)

Design a UX that makes evaluation usable:

A low-friction console experience for defining evaluators (e.g. LLM-as-judge).
Clear separation between what is being evaluated, how it’s evaluated, and when it runs and which data is used.
Visibility into results: scores, failures, trends, and regressions over time.

Evaluator Types to Support

Clarify the evaluator surface area:

Built-in evaluators (e.g., groundedness, hallucination, relevance, latency, cost).
Configurable LLM-as-judge evaluators for qualitative judgments.
Code-first evaluators imported via SDK for advanced or domain-specific logic.

Evaluation Data Scope

Define where evaluations can run:

Historical data (replay past traces, conversations, or agent runs).
Continuous evaluation on live or sampled production traffic.
Ad-hoc evaluation for debugging and experimentation.
Clear controls over data selection, filtering, and versioning.

nadheesh · 2026-01-19T06:54:20Z

nadheesh
Jan 19, 2026
Collaborator Author

Some ideas on how to evaluate:

Evaluation Strategies

1. Pre-defined Evaluators

We provide a set of pre-defined evaluators that users can select and apply to their agents.
Process:

The user starts creating an evaluation job.
The user adds an evaluator by selecting it from the provided list.
The user selects whether to apply the evaluator on the full trace or a specific span.
The user configures triggers (Manual, Regression, Scheduled, or Continuous), and which data to focus on (historical/continuous)
The user saves the configuration, and the platform deploys the evaluation job.

2. User-Defined Evaluators (Console LLM-as-judge)

Users define custom evaluators by modifying prompts and context via placeholders in the AMP Console.

Span-level Evaluators:

Users select to apply the evaluator on specific spans.
Users can modify the prompt and inject context using amp-attributes (e.g., model name, tool arguments, or status).

Trace-level Evaluators:

Users select to apply the evaluator on trace-level.
Users define trace grouping/filtering criteria (e.g., kind == tool to isolate the tool trajectory).
Users access the list of filtered tool spans using indexes or collections to focus on specific sequence behaviours during evaluation. Attributes from spans can be accessed similar to span-level evaluators.

3. User Import Evaluators (Code-First)

For complex logic, we provide an SDK-driven approach allowing users to write and import custom evaluation scripts.

Process of Writing Evaluator Scripts:

Fetch Traces: Users use the amp-sdk to retrieve historical trace data based on tags, versions, or timeframes.
Arrange Traces: Users use amp-trace utilities to flatten, filter, or group spans into the required input format for the logic.
Apply Evaluator: Users apply logic via amp-evaluators to produce a score and rationale.
Code Example:

import amp_sdk
from amp_trace import TraceProcessor
from amp_evaluators import GroundednessEvaluator

# 1. Fetch Traces
client = amp_sdk.AmpClient(api_key="your_api_key")
traces = client.fetch_traces(agent_id="support_agent", limit=50)

# 2. Arrange Traces (Filtering only tool trajectory)
processed_data = [TraceProcessor.filter_spans(t, kind="tool") for t in traces]

# 3. Apply Evaluator logic
evaluator = GroundednessEvaluator(model="gpt-4")
results = evaluator.evaluate_batch(processed_data)

# 4. Upload results back to AMP Platform
client.upload_eval_results(results)

Importing Evaluator Scripts to Platform

Once the script is written and verified, it must be registered to the AMP Platform to automate the evaluation workflow.

Select Import Method: In the Evaluation Console, the user selects "Import Evaluator Script" instead of using the standard LLM-as-judge form.
Upload Script: The user uploads the Python script. The platform validates the script against the amp-sdk requirements.
Configure Environment: The user defines any environment variables or secrets (e.g., external API keys) required for the script to run if needed.
Deployment: The evaluation job is registered. It will now execute the script automatically based on the user's defined trigger criteria (Manual, Regression, Scheduled, or Continuous), and which data to focus on (historical/continuous)

Key Considerations:

How do we identify a certain span uniquely to apply evaluators to that same span during option 2?
How to bring in LLM configurations for LLM-as-judge and configurations for custom evaluator scripts

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Agent Evaluation #188

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Agent Evaluation #188

Uh oh!

nadheesh Jan 19, 2026 Collaborator

Problem

What We’re Looking For

Replies: 1 comment

Uh oh!

Uh oh!

nadheesh Jan 19, 2026 Collaborator Author

Evaluation Strategies

1. Pre-defined Evaluators

2. User-Defined Evaluators (Console LLM-as-judge)

Span-level Evaluators:

Trace-level Evaluators:

3. User Import Evaluators (Code-First)

Process of Writing Evaluator Scripts:

Importing Evaluator Scripts to Platform

Key Considerations:

nadheesh
Jan 19, 2026
Collaborator

nadheesh
Jan 19, 2026
Collaborator Author