Skip to content

Fix eval yaml parser#4

Merged
christso merged 4 commits intomainfrom
fix/eval-parsing
Nov 11, 2025
Merged

Fix eval yaml parser#4
christso merged 4 commits intomainfrom
fix/eval-parsing

Conversation

@christso
Copy link
Copy Markdown
Collaborator

No description provided.

@christso christso changed the title Fix eval parser Fix eval yaml parser Nov 11, 2025
- Add noExternal config to tsup to bundle workspace dependency
- Bump version to 0.1.4
- Fixes ERR_MODULE_NOT_FOUND when installing via npm
@christso christso merged commit 2937f92 into main Nov 11, 2025
@christso christso deleted the fix/eval-parsing branch November 11, 2025 03:45
christso added a commit that referenced this pull request Mar 6, 2026
…e to CLAUDE.md

Remove ErrorRetry interface, errorRetries field on EvaluationResult,
and retry tracking code — no industry precedent, and retry count
can be added later if needed. Add YAGNI as design principle #4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
christso added a commit that referenced this pull request Mar 6, 2026
…y tracking (#442)

* feat(eval): add retry-errors, fail_on_error tolerance, and error retry tracking (#433, #434, #435)

Implements three follow-up features from #431 execution status classification:
- --retry-errors <jsonl>: re-run only execution_error test cases from a previous output
- execution.fail_on_error config: true (halt on first), false (never halt), or 0.0-1.0 threshold
- errorRetries field on EvaluationResult to track transient errors retried during provider invocation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* test: add unit tests for extractFailOnError config parser

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: add retry-errors, fail_on_error, and errorRetries documentation

Updates eval-schema.json, SKILL.md, running-evals.mdx, and eval-files.mdx
with documentation for the three new features from #433, #434, #435.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: add fail_on_error to Zod schema and regenerate eval-schema.json

The eval-schema-sync test requires the Zod schema to be the source of
truth. Adds FailOnErrorSchema to ExecutionSchema and regenerates the
JSON schema to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: format eval-schema.json with Biome

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: address code review feedback for fail_on_error implementation

- Rewrite threshold test to exercise actual ratio math (succeed →
  succeed → fail → fail → fail triggers halt at 3/5=0.60 > 0.5)
- Fix docs range notation from 0.0-1.0 to >0.0-1.0 (exclusive of 0)
- Add concurrency best-effort note to docs
- Add comment explaining why 0 is excluded from numeric thresholds
- Add lightweight validation (testId + score) in loadNonErrorResults

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: simplify fail_on_error to boolean-only (remove ratio threshold)

Align with industry standards (promptfoo, braintrust) by keeping
fail_on_error as a simple true/false toggle. The numeric ratio
threshold (0.0-1.0) was YAGNI — post-hoc analysis of JSONL output
is sufficient for error ratio decisions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix Biome formatting in config-loader

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* refactor: remove errorRetries tracking (YAGNI) and add YAGNI principle to CLAUDE.md

Remove ErrorRetry interface, errorRetries field on EvaluationResult,
and retry tracking code — no industry precedent, and retry count
can be added later if needed. Add YAGNI as design principle #4.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* style: fix trailing blank lines

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant