Skip to content

docs: separate execution quality from trigger quality in eval guidance #566

@christso

Description

@christso

Objective

Explicitly document the distinction between execution-quality evaluation ("does the skill help when loaded?") and trigger-quality evaluation ("does the system load the skill when it should?") in AgentV's docs and roadmap, preventing overloaded evaluation configs and clarifying scope for future work.

Architecture Boundary

docs-examples

Context

Both Anthropic's skill-creator and Tessl treat trigger optimization as a separate evaluation track from task execution quality. Anthropic's skill-creator has dedicated tooling for trigger evaluation: repeated trigger trials, train/test splits, held-out model selection, and description optimization.

AgentV's current eval guidance naturally emphasizes execution quality. Without explicitly calling out trigger quality as a separate concern, users may try to overload execution eval configs with trigger-detection logic, or expect AgentV to handle skill discovery optimization that belongs in a different product surface.

Design Latitude

  • Choose where this distinction is documented (roadmap doc, architecture doc, existing eval guide, or new conceptual guide)
  • May be as simple as a "Concepts" or "Evaluation Types" section in the docs
  • Choose how to frame trigger-quality as a future direction without over-promising
  • May reference the skill-creator's trigger evaluation approach as industry context

Acceptance Signals

  • AgentV docs explicitly name "execution quality" and "trigger quality" as distinct evaluation concerns
  • The docs explain why they are different problems (noisy vs deterministic, different optimization surfaces)
  • Current AgentV eval tooling is positioned as execution-quality evaluation
  • Trigger-quality evaluation is framed as a future direction, not a current gap
  • The docs include a clear statement that execution eval configs should not be used for trigger evaluation
  • The agentv-eval-builder skill reference card is updated to reflect this distinction (per CLAUDE.md Documentation Updates guidelines)

Non-Goals

  • Building trigger-evaluation CLI commands or runtime features
  • Adding trigger trial tooling, train/test splits, or description optimizers
  • Creating a skill marketplace or discovery system
  • Changing the current eval schema or config format

Research Basis

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions