-
Notifications
You must be signed in to change notification settings - Fork 0
feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335
Copy link
Copy link
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Add support for evaluating multi-iteration agent loops with: expected vs actual iteration counts, termination reason classification, and cross-run trend analysis over N runs.
What already exists in AgentV
- Trials (
TrialsConfig): Run the same test multiple times withpass_at_k,mean,confidence_intervalstrategies. This covers statistical robustness of scoring. - Compare (
agentv compare): Two-run delta with win/loss/tie classification and mean delta. - Trace metrics:
durationMs,costUsd,tokenUsage,toolCallsByNameper eval.
What's still missing
Trials measure scoring variance (same test, N runs). What loop-based orchestrators need is measuring agent efficiency within a single run:
- Iteration tracking within a run — How many loop iterations did the agent take? Was that more or less than expected?
- Termination reason taxonomy — Why did the loop end? (completion signal, max iterations, timeout, cost limit, thrashing, consecutive failures)
- N-way trend analysis — Compare scores across 3+ runs for regression detection, not just pairwise
What this looks like in AgentV
tests:
- id: fizzbuzz-tdd
criteria: Implement FizzBuzz with TDD
input: ...
execution:
completion_signal: TESTS_PASSING # NEW: detect this in output → mark success
expected_iterations: 5 # NEW: baseline expectation
assert:
- type: iteration_efficiency
threshold: 2.0 # actual/expected ratio capResult schema additions:
{"test_id":"fizzbuzz","iterations":7,"expected_iterations":5,"iteration_delta":2,"termination_reason":"completion_signal"}Trend command:
agentv trend .agentv/results/ # NEW: analyze N results for regressionArchitecture alignment
expected_iterationsandcompletion_signalare optional fields on test spec (non-breaking)termination_reasonis a new optional field in result JSONL (non-breaking)iteration_efficiencyis a new assertion type (via feat(sdk): assertion plugin registry — replace hardcoded switch/case with extensible registration #320)agentv trendextends the existingagentv comparepattern to N-way- Trials already handle multi-attempt; this handles multi-iteration within one attempt
Research source
- ralph-orchestrator: TaskResult struct with
iterations,expected_iterations,iteration_delta, termination taxonomy (10 reasons) - copilot-swarm-orchestrator: RunMetrics, AnalyticsLog.compareWithHistory()
AgentV Studio Surface (2026-03-27)
This issue now includes a dashboard visualization surface as part of the AgentV Studio platform (#788).
Objective (clarified)
- Core engine: Iteration tracking within runs, termination reason classification, N-way trend analysis (
agentv trend) - Studio UI: Regression alert feed, regression timeline with git correlation, auto-clustering of similar regressions
Design Latitude
- Regression detection algorithm (threshold-based vs. statistical significance)
- Alert severity classification (how much score drop constitutes a regression?)
- Git correlation depth (commit hash only vs. showing file diffs)
- Alert persistence (in-memory vs. stored in history repo)
- Clustering approach for similar regressions (string matching vs. embedding-based)
Acceptance Signals
-
expected_iterationsandcompletion_signalfields work in test specs -
termination_reasonfield appears in result JSONL -
iteration_efficiencyassertion type evaluates actual/expected ratio -
agentv trendcommand analyzes N results for regression detection - Studio UI: alert feed shows regressions with severity indicators
- Studio UI: regression timeline chart with git commit annotations
- Studio UI: similar regressions are auto-clustered across tests
- Studio UI: clicking a regression links to root cause explorer (feat(dashboard): root cause explorer — trace-driven failure diagnosis #786)
Non-Goals
- Predictive regression forecasting (detection only)
- External notification systems (Slack, PagerDuty — future enhancement)
- Automated rollback of code changes based on regressions
Dependencies
- feat: AgentV Studio — eval management platform with historical trends, quality gates, and orchestration #563 (AgentV Studio platform) — provides the dashboard surface
- feat(sdk): assertion plugin registry — replace hardcoded switch/case with extensible registration #320 (custom assertion types) — needed for
iteration_efficiency - project: AgentV Studio — eval management platform with quality gates, orchestration, and analysis #788 — AgentV Studio tracking epic
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Backlog