-
Notifications
You must be signed in to change notification settings - Fork 0
Stability-aware evaluation: multi-run sampling + variance/disagreement penalty aggregation #117
Copy link
Copy link
Closed as not planned
Closed as not planned
Copy link
Labels
enhancementNew feature or requestNew feature or request
Description
Problem
AgentV evaluations that involve stochastic agents and/or non-deterministic evaluators can be unstable run-to-run. Today, it is hard to:
- quantify reliability (variance/instability) as a first-class metric
- compare agents/models fairly when scores are noisy
- gate CI on stability-aware aggregates rather than a single run
OpenCode Bench addresses this for coding benchmarks via (a) multiple isolated episodes and (b) variance penalties when judges disagree. AgentV is not coding-specific, but these two primitives generalize to any agentic eval.
Proposed Feature (single cohesive feature)
Add framework-level primitives for stability-aware evaluation:
- Multi-run evaluation (sampling) with episode isolation
- Allow running the same eval case N times (fresh state per run where applicable).
- Persist per-run results in the output (e.g., run index, seed if present, per-run score details).
- Variance/disagreement penalty as an aggregation primitive
- Provide an optional aggregation mode that penalizes instability across runs and/or across judges.
- The penalty should be configurable (lambda, which dimensions contribute, etc.).
These are tightly coupled: multi-run creates the distribution; variance penalty consumes it. They can be implemented as separate internal building blocks, but should ship together as one user-facing capability.
Scope / UX Sketch
- CLI option to repeat runs, e.g.
--runs N(name TBD). - Output includes per-run records plus an aggregate summary.
- Aggregation outputs include: mean/base score, variance/instability, final score after penalty, and optionally simple confidence intervals.
Non-goals
- Not prescribing domain-specific metrics (those remain evaluators/wrappers/plugins).
- Not building dashboards/notifications (wrappers can handle reporting).
References
- OpenCode Bench concepts: multi-episode isolation + variance penalty for judge disagreement.
- AgentV wrapper pattern for metrics/threshold gates: examples/showcase/export-screening (confusion matrix + policy-weighted overall).
Acceptance Criteria
- Can run the same dataset case multiple times and see all runs in output.
- Can compute an aggregate score that penalizes instability (configurable).
- Deterministic behavior when agent/evaluators are deterministic (variance = 0, no penalty).
- Documentation/example showing how wrappers can consume per-run exports to compute custom stability gates (e.g., p95 score meets a threshold; variance is below a threshold).
Related
- Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds) #118 Showcase: strengthen export-screening CI gating (multi-sample runs + stability-aware thresholds)
- Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate) #119 Showcase: evaluator conformance harness (compatibility + consistency fixtures, CI gate)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request