feat(providers,evaluators): claude-cli provider + trigger-judge evaluator#597
Merged
feat(providers,evaluators): claude-cli provider + trigger-judge evaluator#597
Conversation
Deploying agentv with
|
| Latest commit: |
d42610c
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2360774d.agentv.pages.dev |
| Branch Preview URL: | https://feat-593-skill-trigger-eval.agentv.pages.dev |
5f3fc45 to
c4cbdea
Compare
… evaluator (#593) - Add ClaudeCliProvider that spawns `claude -p` as a subprocess, streams output via --output-format stream-json --include-partial-messages, and extracts tool calls, token usage, and cost from stream events - Rename existing SDK provider class to ClaudeSdkProvider (claude-sdk.ts) with kind 'claude-sdk' for explicit opt-in to the Agent SDK path - Register 'claude' and 'claude-cli' as aliases for ClaudeCliProvider; 'claude-sdk' maps to ClaudeSdkProvider - Add 'claude-cli' and 'claude-sdk' to ProviderKind, AGENT_PROVIDER_KINDS, KNOWN_PROVIDERS, and ResolvedTarget union - Add TriggerJudgeEvaluator that checks whether the agent invoked a named skill by scanning tool calls for Skill invocations (args.skill match) or skill file reads (.claude/commands/, .claude/skills/) - Register trigger-judge in evaluator parser, schema, builtin registry, and EvaluatorConfig union - Regenerate eval-schema.json to include trigger-judge schema - Add unit tests for trigger-judge evaluator and claude provider aliases
--output-format stream-json requires --verbose when using -p (--print) mode. Without it the CLI exits with code 1 immediately. Also adds E2E tests validating output, tokenUsage, durationMs, and log file emission parity between claude-cli and claude-sdk providers.
…ges/ example Removes TriggerJudgeEvaluator from core built-ins (violates Principles 1 & 2: Claude-Code-specific, expressible as a code-judge script) and adds: - packages/core/src/evaluation/registry/judge-discovery.ts: new discoverJudges() function, mirroring discoverAssertions() but scans .agentv/judges/ - Wired discoverJudges into orchestrator alongside discoverAssertions - Exported discoverJudges from core public API and registry/index.ts - examples/features/agent-skills-evals/.agentv/judges/trigger-judge.ts: reference implementation as a code-judge script using defineCodeJudge - Regenerated eval-schema.json (trigger-judge removed from EvaluatorSchema union)
Align with the global rename from PR #604.
d42610c to
5008ce5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #593
claude-cliprovider (ClaudeCliProvider): spawnsclaude -pas a subprocess with--output-format stream-json --include-partial-messages, parses streaming events to extract tool calls, token usage, and cost, and stripsCLAUDECODEenv var to allow nested sessionsclaudeis now an alias forclaude-cli(the new default); both resolve toClaudeCliProviderclaude-sdkprovider (ClaudeSdkProvider): the existingClaudeProviderrenamed and made explicitly opt-in; kind changed to'claude-sdk', id prefix updated accordinglytrigger-judgeevaluator (TriggerJudgeEvaluator): checks post-hoc whether the agent invoked a named skill by scanningresponse.toolCallsforSkilltool invocations (matchingargs.skill) orReadcalls loading files from.claude/commands/or.claude/skills/ProviderKind,AGENT_PROVIDER_KINDS,KNOWN_PROVIDERS, andResolvedTargetupdated withclaude-cliandclaude-sdkTriggerJudgeEvaluatorConfigadded toEvaluatorConfigunion andEVALUATOR_KIND_VALUESeval-schema.jsonregenerated to include the newtrigger-judgeschemaRisk
Low — purely additive. Existing
claudetargets continue to work (now route to the subprocess provider instead of the SDK provider). The old SDK provider is available atclaude-sdk. No existing YAML or API changes required.