eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift#67
Merged
Merged
Conversation
… lift
Compiler eval was scoring legitimate compilations as failures and missing
real ones. Three classes of changes:
1. Judge prompt fixes (eval/routine_eval/user_proxy.py):
- keyword_placement: must_have_for_steps now describes target *elements*
(not specific routine steps). Judge looks across all routine steps for
the must_have target — handles the merged-step pattern where one step
covers multiple interactions and the Keywords line addresses any one
of them. Distinguishes "no Keywords line" from "Keywords line on a
different valid target in the same merged step" so reasoning text
doesn't dismiss valid tokens as bogus.
- asking_behavior: now asymmetric — only penalizes missing required
topics. Extra questions, including overlap with `forbidden`, do NOT
reduce the score. Erring on the side of asking more is acceptable
compiler behaviour; under-asking is the real failure mode.
- Acceptable_tokens treated as illustrative, not exhaustive — any token
plausibly distinctive per priority rules 1–7 passes.
2. Fixture corrections:
- finviz_filter_clear: Performance view tab now accepts both
`Performance` (visible text, rule 7 — actually the more distinctive
token since `view-tab` class is shared by every view tab on the page)
and `view-tab`.
- github-trending-contenteditable-question: removed `lake-title` (Yuque
title input) from must_have_for_steps; text-input targets are
disambiguated by the act of typing into them after focus, so a
Keywords token is nice-to-have rather than load-bearing. Rewrote
raw_intention.md as first-person user voice (was mixing fixture
engineering meta with user intent — judge LLM was being primed by
"this fixture exercises…" language).
- techforum_count_ambiguous: removed `main-search` from
must_have_for_steps for the same reason (input fields aren't
load-bearing keyword targets).
3. Per-fixture artifacts in eval/output (evaluate_routine_compile.py +
user_proxy.py):
- Default output layout now mirrors the main eval:
`eval/output/<timestamp>_compiler_eval/<compile_alias>/`.
- Per fixture: `traces/<fixture_id>_compiler_trace.json` (full compiler
conversation dump copied from ~/.openbrowser/compiler_traces/) and
`judges/<fixture_id>_judge.json` (judge prompt + raw tool-call args +
parsed scores + reasoning).
- JudgmentResult gains `prompt` and `raw_args` fields (excluded from
to_dict() so the canonical regression report stays diffable).
- FixtureRunResult captures trace_path from the SSE complete payload.
4. agent-sdk bump (pyproject.toml + uv.lock): pulls in
system_prompt_compiler.j2 changes — typed-text classification under
rule 4 (catches the contenteditable instruction-paste failure mode)
and a pre-write ambiguity-enumeration template in Workflow step 4
(mechanical procedure for smaller models that were rationalizing past
the prior soft rule).
Net effect on the qwen-plus regression baselines: average pass rate across
qwen35plus-fast and qwen36plus-fast lifts from 50% to 67% (refreshed
canonical reports include the latest run results). The remaining failures
in both models are real compiler issues — primarily skipping the
position-vs-identity question on gh-trending under variance — not judge
artifacts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Compiler eval was scoring legitimate compilations as failures and missing real ones. This PR ships:
eval/routine_eval/user_proxy.py):keyword_placementnow treatsmust_have_for_stepsas describing target elements and looks across all steps (handles merged-step routines);asking_behaviornow asymmetric — only penalizes missing required topics, not extra ones;acceptable_tokenstreated as illustrative not exhaustive.eval/output(matches main-eval layout):<timestamp>_compiler_eval/<compile_alias>/{traces,judges}/.traces/copies the full compiler conversation dump;judges/records the judge prompt + raw tool-call args + parsed scores + reasoning. Lets you re-read both sides of any judge call without re-running the eval.pyproject.toml+uv.lock) pulls insystem_prompt_compiler.j2changes: a "Typed text in rich-text / contenteditable bodies" sub-bullet that classifies typed body content into static / pasted-reference / agent-instruction categories (catches the gh-trending instruction-paste failure mode), and a pre-write ambiguity-enumeration template in Workflow step 4 (mechanical procedure for smaller models that were rationalizing past the prior soft rule).Companion commit in agent-sdk: `9b289cd3` on `open-browser` branch.
Eval impact
Refreshed canonical reports (
compile_evaluation_report_qwen{35,36}plus-fast.json) reflect the latest run with all changes in place:Remaining failures are real compiler weaknesses (most commonly: skipping the position-vs-identity question on gh-trending under variance), not judge artifacts.
Test plan
🤖 Generated with Claude Code