Skip to content

feat(eval): composable quality gates with auto-remediation triggers #334

@christso

Description

@christso

Summary

Add severity levels and auto-remediation triggers to AgentV's existing quality gate primitives.

What already exists in AgentV

  • Required evaluators: required: boolean | number on any evaluator — makes it a pass/fail gate
  • Required rubric items: required_min_score: number on rubric items — minimum 0-10 score to pass
  • Score ranges: score_ranges on rubrics for banded scoring (0-10 scale)
  • Negation: negate: boolean to invert scores
  • Composite evaluators: Combine multiple evaluators with weighted aggregation

What's still missing

AgentV has binary pass/fail gating but lacks:

  1. Severity levels — Distinguish error (blocks) from warning (informational) from info (logged only). Currently everything is pass/fail.
  2. Auto-remediation hooks — When a gate fails, trigger a follow-up action (re-run with fix prompt, add a remediation step). Currently failure just fails.
  3. Reusable gate library — Shareable, composable quality gate definitions (scaffold detection, duplicate code, hardcoded config, README accuracy).

What this looks like in AgentV

assert:
  - name: no_scaffold_defaults
    type: code_judge
    script: ./gates/scaffold-check.py
    severity: warning             # NEW: warning|error|info (default: error)

  - name: no_duplicate_blocks
    type: code_judge
    script: ./gates/duplicate-check.py
    severity: error               # blocks eval

  - name: readme_accuracy
    type: code_judge
    script: ./gates/readme-verify.py
    severity: info                # logged only

Architecture alignment

  • severity is an optional field on any evaluator config (non-breaking, default: error preserves current behavior)
  • Severity maps to result JSONL: errors affect verdict, warnings appear in output but don't fail
  • Auto-remediation is a post-eval hook (separate concern from eval itself)
  • Reusable gates are just code_judge scripts — could ship as an agentv-gates package
  • Extends existing required field semantics: severity: warning + required: true = required but non-blocking

Research source

  • copilot-swarm-orchestrator: quality-gates.yaml with scaffoldDefaults, duplicateBlocks, hardcodedConfig, readmeClaims, testIsolation; gracefulDegradation: true
  • ralph-orchestrator: backpressure philosophy — deterministic checks as first-class evaluation

AgentV Studio Surface (2026-03-27)

This issue now includes a dashboard management surface as part of the AgentV Studio platform (#788).

Objective (clarified)

  1. Core engine: Add severity levels (error/warning/info) and auto-remediation triggers to evaluator configs
  2. Studio UI: Dashboard-driven gate configuration, visual threshold editor, alert routing, one-click remediation

Design Latitude

  • Auto-remediation hook format (shell command, eval rerun, webhook)
  • Gate definition storage (YAML config vs. Studio-managed JSON)
  • Whether gate library ships as built-in gates or a separate agentv-gates package
  • Visual threshold editor implementation (slider vs. numeric input with histogram overlay)
  • Alert routing destinations (Studio feed, webhook, email)

Acceptance Signals

  • severity: warning|error|info field works on any evaluator config
  • Warnings appear in output but do not fail the eval
  • Auto-remediation hooks trigger on gate failure
  • Studio UI: gates are listable, creatable, editable, deletable
  • Studio UI: threshold editor shows historical score distribution
  • Studio UI: one-click remediation triggers rerun or mutation from gate failure alert
  • Gate compliance history visible in Studio

Non-Goals

  • Real-time gate evaluation during streaming (gates evaluate after run completes)
  • Gate marketplace or sharing across organizations
  • Replacing YAML-based gate configuration (Studio is an additional surface, not a replacement)

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions