Skip to content

feat: skill-eval companion artifacts (grading, timing, benchmark)#579

Merged
christso merged 1 commit intomainfrom
feat/565-companion-artifacts
Mar 14, 2026
Merged

feat: skill-eval companion artifacts (grading, timing, benchmark)#579
christso merged 1 commit intomainfrom
feat/565-companion-artifacts

Conversation

@christso
Copy link
Copy Markdown
Collaborator

Closes #565

Changes

  • New ArtifactWriter module following existing JsonlWriter pattern
  • Produces grading/<test-id>.json, timing.json, benchmark.json from JSONL
  • --artifacts <dir> CLI flag on agentv eval run
  • JSONL parser handles snake_case keys from existing output files
  • 29 tests for artifact generation, schema compatibility, and file I/O

Artifact Schemas

grading/<test-id>.json — Per-test grading with expectations (text/passed/evidence), summary, execution_metrics, and AgentV extensions (evaluators, workspace_changes, conversation)

timing.json — Aggregate duration_ms, total_tokens, and token_usage (input/output)

benchmark.json — Cross-test statistics per target with mean/stddev for pass_rate, time_seconds, tokens, tool_calls, cost_usd

Interoperability

Shared fields (expectations[].text/passed/evidence, summary, run_summary) use identical names and types as Anthropic's skill-creator. AgentV-specific fields are additive (evaluators, workspace_changes, conversation, per_evaluator_summary).

Add ArtifactWriter module that produces grading/<test>.json, timing.json,
and benchmark.json from existing JSONL eval results. Includes --artifacts
CLI flag for eval run command.

- Grading artifacts map per-evaluator hits/misses to skill-creator's
  expectations/evidence format with AgentV extensions (evaluators,
  workspace_changes, conversation)
- Timing artifact aggregates duration and token usage across all results
- Benchmark artifact computes per-target statistics (mean/stddev) for
  pass_rate, time, tokens, tool_calls, and cost
- JSONL parser handles snake_case keys from existing output files
- 29 tests covering artifact generation, schema compatibility, and I/O
- Schemas are supersets of Anthropic skill-creator conventions

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: 391e4fd
Status: ✅  Deploy successful!
Preview URL: https://2f6d16ef.agentv.pages.dev
Branch Preview URL: https://feat-565-companion-artifacts.agentv.pages.dev

View logs

@christso christso marked this pull request as ready for review March 14, 2026 05:36
@christso christso merged commit bf74717 into main Mar 14, 2026
1 check passed
@christso christso deleted the feat/565-companion-artifacts branch March 14, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: skill-eval companion artifacts (grading, timing, benchmark)

1 participant