feat: skill-trigger evaluation + claude-cli provider (subprocess-based)

## Objective

Add first-class skill-trigger evaluation to AgentV, using a `claude-cli` provider that runs `claude -p` as a subprocess. `claude` becomes an alias for `claude-cli` (the new default); the existing SDK-based provider becomes `claude-sdk`.

Side effect: fixes #521 (SDK has its own internal timeout that fires at ~2 mins, overriding AgentV's `timeout_seconds`; direct subprocess control removes that layer).

## Context

The current `claude` provider uses `@anthropic-ai/claude-agent-sdk` and exposes completed `assistant` messages with extracted tool calls. This is sufficient for execution-quality evaluation but misses two things skill-creator provides natively:

1. **Mid-stream trigger detection** — skill-creator streams `--include-partial-messages` and exits early on `content_block_start` when it sees a `Skill` or `Read` tool call. Fast, low-cost binary signal without waiting for the full response.
2. **Raw subprocess control** — `claude -p` subprocess is harness-consistent with the way skills are actually used interactively. The SDK adds abstraction that can hide behaviour visible to users in real sessions.

skill-creator's trigger detection pattern (reference implementation):
- Creates a temporary `.claude/commands/<skill>-<uuid>.md` file before the run
- Runs `claude -p <query> --output-format stream-json --include-partial-messages`
- Strips `CLAUDECODE` env var to allow nested sessions (AgentV already does this in the SDK provider)
- Watches for `content_block_start` with `Skill`/`Read` tool name containing the injected skill name
- Cleans up command file after run

Workspace setup for trigger evals uses the existing `before_all` hook pattern (`npx allagents`) — no new mechanism needed.

## Proposed Design

### Provider naming

| Provider name | Implementation | Notes |
|--------------|----------------|-------|
| `claude` | alias for `claude-cli` | default; existing configs continue to work |
| `claude-cli` | `claude -p` subprocess | new default implementation |
| `claude-sdk` | `@anthropic-ai/claude-agent-sdk` | explicit opt-in — note: already registered as an alias in `PROVIDER_ALIASES` pointing to the current SDK impl; this just formalises it |

### Provider: `claude-cli`

- Spawns `claude -p` as a subprocess
- Passes workspace `cwd`, model, system prompt, max turns via CLI flags
- Streams `--output-format stream-json --include-partial-messages` for event-level access
- Returns `ProviderResponse` in the same shape as the SDK provider (output, toolCalls, tokenUsage, costUsd)
- Strips `CLAUDECODE` env var (same as current SDK provider)

### Evaluator: `trigger-judge`

New evaluator type alongside `tool-trajectory` in `eval-file.schema.ts`:
- Checks whether the agent invoked the `Skill` tool (or `Read` on a skill path) during the run
- Takes a `skill` parameter (skill name to check for)
- Returns pass/fail with the triggering tool call as evidence
- Can detect early via stream events with `claude-cli`, or post-hoc from `toolCalls` with `claude-sdk`

```yaml
evaluators:
  - type: trigger-judge
    skill: my-skill-name
```

## Key Files

**Track 1 — Provider (do first):**
- `packages/core/src/evaluation/providers/claude.ts` → rename to `claude-sdk.ts` (keep class, update `kind`)
- `packages/core/src/evaluation/providers/claude-cli.ts` → new file, subprocess implementation
- `packages/core/src/evaluation/providers/types.ts` — add `claude-cli` to `AGENT_PROVIDER_KINDS`; `claude-sdk` already in `PROVIDER_ALIASES`, point `claude` alias to `claude-cli`
- `packages/core/src/evaluation/providers/index.ts` — update registration (line 90: `.register('claude', ...)`)
- `packages/core/src/evaluation/providers/targets.ts` — update provider resolution (line 789, case `'claude'`)
- `packages/core/src/evaluation/validation/targets-validator.ts` — update validation (line 218)

**Track 2 — Evaluator (parallel, no dependency on Track 1):**
- `packages/core/src/evaluation/validation/eval-file.schema.ts` — add `trigger-judge` schema alongside `tool-trajectory` (line 140)
- Evaluator runner (wherever `tool-trajectory` is evaluated — follow the same pattern)

## Architecture Boundary

`core-runtime` — both tracks are in `packages/core/`.

## Design Latitude

- Whether `claude-cli` is a fully separate class or shares a base class with `claude-sdk`
- Whether mid-stream early-exit optimisation is implemented or deferred (post-hoc `toolCalls` check is correct and simpler; early-exit is a perf improvement)
- Whether `trigger-judge` is a standalone evaluator or an extension to `tool-trajectory`

## Acceptance Signals

- `provider: claude` in targets.yaml resolves to `claude-cli` behaviour
- `provider: claude-sdk` opts in to the previous SDK behaviour
- `provider: claude-cli` works explicitly
- An eval with `type: trigger-judge, skill: my-skill` correctly detects when Claude invokes the named skill
- `claude-cli` produces equivalent `ProviderResponse` as `claude-sdk` for execution-quality evals
- Skill trigger eval runs end-to-end: `before_all` installs skill → query → pass/fail trigger result
- Tests cover trigger-judge evaluator logic and provider alias resolution

## Non-Goals

- Trigger-description optimization loop (skill-creator's domain)
- Removing `claude-sdk` — stays as explicit opt-in
- Non-Claude harness trigger detection (Copilot, Gemini, etc. have no skills concept)

## Related

- #521 bug: claude provider times out after 2 mins — fixed as side effect
- #569 tracking: Anthropic skill-creator eval framework alignment
- #566 docs: separate execution quality from trigger quality (closed — this implements what #566 deferred)
- Anthropic skill-creator `run_eval.py` — reference implementation for trigger detection


Provider name	Implementation	Notes
`claude`	alias for `claude-cli`	default; existing configs continue to work
`claude-cli`	`claude -p` subprocess	new default implementation
`claude-sdk`	`@anthropic-ai/claude-agent-sdk`	explicit opt-in — note: already registered as an alias in `PROVIDER_ALIASES` pointing to the current SDK impl; this just formalises it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: skill-trigger evaluation + claude-cli provider (subprocess-based) #593

Objective

Context

Proposed Design

Provider naming

Provider: `claude-cli`

Evaluator: `trigger-judge`

Key Files

Architecture Boundary

Design Latitude

Acceptance Signals

Non-Goals

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: skill-trigger evaluation + claude-cli provider (subprocess-based) #593

Description

Objective

Context

Proposed Design

Provider naming

Provider: claude-cli

Evaluator: trigger-judge

Key Files

Architecture Boundary

Design Latitude

Acceptance Signals

Non-Goals

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Provider: `claude-cli`

Evaluator: `trigger-judge`