-
Notifications
You must be signed in to change notification settings - Fork 0
docs: separate execution quality from trigger quality in eval guidance #566
Description
Objective
Explicitly document the distinction between execution-quality evaluation ("does the skill help when loaded?") and trigger-quality evaluation ("does the system load the skill when it should?") in AgentV's docs and roadmap, preventing overloaded evaluation configs and clarifying scope for future work.
Architecture Boundary
docs-examples
Context
Both Anthropic's skill-creator and Tessl treat trigger optimization as a separate evaluation track from task execution quality. Anthropic's skill-creator has dedicated tooling for trigger evaluation: repeated trigger trials, train/test splits, held-out model selection, and description optimization.
AgentV's current eval guidance naturally emphasizes execution quality. Without explicitly calling out trigger quality as a separate concern, users may try to overload execution eval configs with trigger-detection logic, or expect AgentV to handle skill discovery optimization that belongs in a different product surface.
Design Latitude
- Choose where this distinction is documented (roadmap doc, architecture doc, existing eval guide, or new conceptual guide)
- May be as simple as a "Concepts" or "Evaluation Types" section in the docs
- Choose how to frame trigger-quality as a future direction without over-promising
- May reference the skill-creator's trigger evaluation approach as industry context
Acceptance Signals
- AgentV docs explicitly name "execution quality" and "trigger quality" as distinct evaluation concerns
- The docs explain why they are different problems (noisy vs deterministic, different optimization surfaces)
- Current AgentV eval tooling is positioned as execution-quality evaluation
- Trigger-quality evaluation is framed as a future direction, not a current gap
- The docs include a clear statement that execution eval configs should not be used for trigger evaluation
- The
agentv-eval-builderskill reference card is updated to reflect this distinction (per CLAUDE.md Documentation Updates guidelines)
Non-Goals
- Building trigger-evaluation CLI commands or runtime features
- Adding trigger trial tooling, train/test splits, or description optimizers
- Creating a skill marketplace or discovery system
- Changing the current eval schema or config format