Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation

## Problem
AgentV evaluations that involve stochastic agents and/or non-deterministic evaluators can be unstable run-to-run. Today, it is hard to:
- quantify reliability (variance/instability) as a first-class metric
- compare agents/models fairly when scores are noisy
- gate CI on stability-aware aggregates rather than a single run

OpenCode Bench addresses this for coding benchmarks via (a) multiple isolated episodes and (b) variance penalties when judges disagree. AgentV is not coding-specific, but these two primitives generalize to any agentic eval.

## Proposed Feature (single cohesive feature)
Add framework-level primitives for stability-aware evaluation:

1) Multi-run evaluation (sampling) with episode isolation
- Allow running the same eval case N times (fresh state per run where applicable).
- Persist per-run results in the output (e.g., run index, seed if present, per-run score details).

2) Variance/disagreement penalty as an aggregation primitive
- Provide an optional aggregation mode that penalizes instability across runs and/or across judges.
- The penalty should be configurable (lambda, which dimensions contribute, etc.).

These are tightly coupled: multi-run creates the distribution; variance penalty consumes it. They can be implemented as separate internal building blocks, but should ship together as one user-facing capability.

## Scope / UX Sketch
- CLI option to repeat runs, e.g. `--runs N` (name TBD).
- Output includes per-run records plus an aggregate summary.
- Aggregation outputs include: mean/base score, variance/instability, final score after penalty, and optionally simple confidence intervals.

## Non-goals
- Not prescribing domain-specific metrics (those remain evaluators/wrappers/plugins).
- Not building dashboards/notifications (wrappers can handle reporting).

## References
- OpenCode Bench concepts: multi-episode isolation + variance penalty for judge disagreement.
- AgentV wrapper pattern for metrics/threshold gates: examples/showcase/export-screening (confusion matrix + policy-weighted overall).

## Acceptance Criteria
- Can run the same dataset case multiple times and see all runs in output.
- Can compute an aggregate score that penalizes instability (configurable).
- Deterministic behavior when agent/evaluators are deterministic (variance = 0, no penalty).
- Documentation/example showing how wrappers can consume per-run exports to compute custom stability gates (e.g., p95 score meets a threshold; variance is below a threshold).



## Related
- #118 Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds)
- #119 Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117

Problem

Proposed Feature (single cohesive feature)

Scope / UX Sketch

Non-goals

References

Acceptance Criteria

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117

Description

Problem

Proposed Feature (single cohesive feature)

Scope / UX Sketch

Non-goals

References

Acceptance Criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions