feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection

## Summary

Add support for evaluating multi-iteration agent **loops** with: expected vs actual iteration counts, termination reason classification, and cross-run trend analysis over N runs.

## What already exists in AgentV

- **Trials** (`TrialsConfig`): Run the same test multiple times with `pass_at_k`, `mean`, `confidence_interval` strategies. This covers statistical robustness of scoring.
- **Compare** (`agentv compare`): Two-run delta with win/loss/tie classification and mean delta.
- **Trace metrics**: `durationMs`, `costUsd`, `tokenUsage`, `toolCallsByName` per eval.

## What's still missing

Trials measure **scoring variance** (same test, N runs). What loop-based orchestrators need is measuring **agent efficiency** within a single run:

1. **Iteration tracking within a run** — How many loop iterations did the agent take? Was that more or less than expected?
2. **Termination reason taxonomy** — Why did the loop end? (completion signal, max iterations, timeout, cost limit, thrashing, consecutive failures)
3. **N-way trend analysis** — Compare scores across 3+ runs for regression detection, not just pairwise

## What this looks like in AgentV

```yaml
tests:
  - id: fizzbuzz-tdd
    criteria: Implement FizzBuzz with TDD
    input: ...
    execution:
      completion_signal: TESTS_PASSING  # NEW: detect this in output → mark success
      expected_iterations: 5            # NEW: baseline expectation

assert:
  - type: iteration_efficiency
    threshold: 2.0                      # actual/expected ratio cap
```

Result schema additions:
```jsonl
{"test_id":"fizzbuzz","iterations":7,"expected_iterations":5,"iteration_delta":2,"termination_reason":"completion_signal"}
```

Trend command:
```bash
agentv trend .agentv/results/   # NEW: analyze N results for regression
```

## Architecture alignment

- `expected_iterations` and `completion_signal` are optional fields on test spec (non-breaking)
- `termination_reason` is a new optional field in result JSONL (non-breaking)
- `iteration_efficiency` is a new assertion type (via #320)
- `agentv trend` extends the existing `agentv compare` pattern to N-way
- Trials already handle multi-attempt; this handles multi-iteration within one attempt

## Research source

- ralph-orchestrator: TaskResult struct with `iterations`, `expected_iterations`, `iteration_delta`, termination taxonomy (10 reasons)
- copilot-swarm-orchestrator: RunMetrics, AnalyticsLog.compareWithHistory()

---

## AgentV Studio Surface (2026-03-27)

This issue now includes a dashboard visualization surface as part of the AgentV Studio platform (#788).

### Objective (clarified)

1. **Core engine**: Iteration tracking within runs, termination reason classification, N-way trend analysis (`agentv trend`)
2. **Studio UI**: Regression alert feed, regression timeline with git correlation, auto-clustering of similar regressions

### Design Latitude

- Regression detection algorithm (threshold-based vs. statistical significance)
- Alert severity classification (how much score drop constitutes a regression?)
- Git correlation depth (commit hash only vs. showing file diffs)
- Alert persistence (in-memory vs. stored in history repo)
- Clustering approach for similar regressions (string matching vs. embedding-based)

### Acceptance Signals

- [ ] `expected_iterations` and `completion_signal` fields work in test specs
- [ ] `termination_reason` field appears in result JSONL
- [ ] `iteration_efficiency` assertion type evaluates actual/expected ratio
- [ ] `agentv trend` command analyzes N results for regression detection
- [ ] Studio UI: alert feed shows regressions with severity indicators
- [ ] Studio UI: regression timeline chart with git commit annotations
- [ ] Studio UI: similar regressions are auto-clustered across tests
- [ ] Studio UI: clicking a regression links to root cause explorer (#786)

### Non-Goals

- Predictive regression forecasting (detection only)
- External notification systems (Slack, PagerDuty — future enhancement)
- Automated rollback of code changes based on regressions

### Dependencies

- #563 (AgentV Studio platform) — provides the dashboard surface
- #320 (custom assertion types) — needed for `iteration_efficiency`
- #788 — AgentV Studio tracking epic


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335

Summary

What already exists in AgentV

What's still missing

What this looks like in AgentV

Architecture alignment

Research source

AgentV Studio Surface (2026-03-27)

Objective (clarified)

Design Latitude

Acceptance Signals

Non-Goals

Dependencies

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(eval): iteration tracking, termination taxonomy, and cross-run regression detection #335

Description

Summary

What already exists in AgentV

What's still missing

What this looks like in AgentV

Architecture alignment

Research source

AgentV Studio Surface (2026-03-27)

Objective (clarified)

Design Latitude

Acceptance Signals

Non-Goals

Dependencies

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions