Skip to content

feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335

@christso

Description

@christso

Summary

Add support for evaluating multi-iteration agent loops with: expected vs actual iteration counts, termination reason classification, and cross-run trend analysis over N runs.

What already exists in AgentV

  • Trials (TrialsConfig): Run the same test multiple times with pass_at_k, mean, confidence_interval strategies. This covers statistical robustness of scoring.
  • Compare (agentv compare): Two-run delta with win/loss/tie classification and mean delta.
  • Trace metrics: durationMs, costUsd, tokenUsage, toolCallsByName per eval.

What's still missing

Trials measure scoring variance (same test, N runs). What loop-based orchestrators need is measuring agent efficiency within a single run:

  1. Iteration tracking within a run — How many loop iterations did the agent take? Was that more or less than expected?
  2. Termination reason taxonomy — Why did the loop end? (completion signal, max iterations, timeout, cost limit, thrashing, consecutive failures)
  3. N-way trend analysis — Compare scores across 3+ runs for regression detection, not just pairwise

What this looks like in AgentV

tests:
  - id: fizzbuzz-tdd
    criteria: Implement FizzBuzz with TDD
    input: ...
    execution:
      completion_signal: TESTS_PASSING  # NEW: detect this in output → mark success
      expected_iterations: 5            # NEW: baseline expectation

assert:
  - type: iteration_efficiency
    threshold: 2.0                      # actual/expected ratio cap

Result schema additions:

{"test_id":"fizzbuzz","iterations":7,"expected_iterations":5,"iteration_delta":2,"termination_reason":"completion_signal"}

Trend command:

agentv trend .agentv/results/   # NEW: analyze N results for regression

Architecture alignment

Research source

  • ralph-orchestrator: TaskResult struct with iterations, expected_iterations, iteration_delta, termination taxonomy (10 reasons)
  • copilot-swarm-orchestrator: RunMetrics, AnalyticsLog.compareWithHistory()

AgentV Studio Surface (2026-03-27)

This issue now includes a dashboard visualization surface as part of the AgentV Studio platform (#788).

Objective (clarified)

  1. Core engine: Iteration tracking within runs, termination reason classification, N-way trend analysis (agentv trend)
  2. Studio UI: Regression alert feed, regression timeline with git correlation, auto-clustering of similar regressions

Design Latitude

  • Regression detection algorithm (threshold-based vs. statistical significance)
  • Alert severity classification (how much score drop constitutes a regression?)
  • Git correlation depth (commit hash only vs. showing file diffs)
  • Alert persistence (in-memory vs. stored in history repo)
  • Clustering approach for similar regressions (string matching vs. embedding-based)

Acceptance Signals

  • expected_iterations and completion_signal fields work in test specs
  • termination_reason field appears in result JSONL
  • iteration_efficiency assertion type evaluates actual/expected ratio
  • agentv trend command analyzes N results for regression detection
  • Studio UI: alert feed shows regressions with severity indicators
  • Studio UI: regression timeline chart with git commit annotations
  • Studio UI: similar regressions are auto-clustered across tests
  • Studio UI: clicking a regression links to root cause explorer (feat(dashboard): root cause explorer — trace-driven failure diagnosis #786)

Non-Goals

  • Predictive regression forecasting (detection only)
  • External notification systems (Slack, PagerDuty — future enhancement)
  • Automated rollback of code changes based on regressions

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions