Skip to content

eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift#67

Merged
softpudding merged 1 commit into
mainfrom
feat/compiler-eval-judge-improvements
Apr 25, 2026
Merged

eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift#67
softpudding merged 1 commit into
mainfrom
feat/compiler-eval-judge-improvements

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

Compiler eval was scoring legitimate compilations as failures and missing real ones. This PR ships:

  • Judge prompt fixes (eval/routine_eval/user_proxy.py): keyword_placement now treats must_have_for_steps as describing target elements and looks across all steps (handles merged-step routines); asking_behavior now asymmetric — only penalizes missing required topics, not extra ones; acceptable_tokens treated as illustrative not exhaustive.
  • Fixture corrections: finviz Performance tab accepts visible text `Performance` (more distinctive than the shared `view-tab` class); text-input targets removed from `must_have_for_steps` on Yuque title and TechForum search (inputs are disambiguated by the act of typing); rewrote gh-trending `raw_intention.md` as first-person user voice (was mixing fixture-engineering meta with intent).
  • Per-fixture artifacts in eval/output (matches main-eval layout): <timestamp>_compiler_eval/<compile_alias>/{traces,judges}/. traces/ copies the full compiler conversation dump; judges/ records the judge prompt + raw tool-call args + parsed scores + reasoning. Lets you re-read both sides of any judge call without re-running the eval.
  • agent-sdk bump (pyproject.toml + uv.lock) pulls in system_prompt_compiler.j2 changes: a "Typed text in rich-text / contenteditable bodies" sub-bullet that classifies typed body content into static / pasted-reference / agent-instruction categories (catches the gh-trending instruction-paste failure mode), and a pre-write ambiguity-enumeration template in Workflow step 4 (mechanical procedure for smaller models that were rationalizing past the prior soft rule).

Companion commit in agent-sdk: `9b289cd3` on `open-browser` branch.

Eval impact

Refreshed canonical reports (compile_evaluation_report_qwen{35,36}plus-fast.json) reflect the latest run with all changes in place:

Model Pass rate before Pass rate now
qwen35plus-fast 1/3 (33%) 2/3 (67%)
qwen36plus-fast 2/3 (67%) 2/3 (67%)
average 50% 67%

Remaining failures are real compiler weaknesses (most commonly: skipping the position-vs-identity question on gh-trending under variance), not judge artifacts.

Test plan

  • CI green
  • Spot-check that `uv sync` against the new agent-sdk rev resolves cleanly
  • Optionally re-run `eval/routine_eval/evaluate_routine_compile.py --all --compile-alias qwen36plus-fast` to confirm new layout populates `eval/output/_compiler_eval/qwen36plus-fast/{traces,judges}/`

🤖 Generated with Claude Code

… lift

Compiler eval was scoring legitimate compilations as failures and missing
real ones. Three classes of changes:

1. Judge prompt fixes (eval/routine_eval/user_proxy.py):
   - keyword_placement: must_have_for_steps now describes target *elements*
     (not specific routine steps). Judge looks across all routine steps for
     the must_have target — handles the merged-step pattern where one step
     covers multiple interactions and the Keywords line addresses any one
     of them. Distinguishes "no Keywords line" from "Keywords line on a
     different valid target in the same merged step" so reasoning text
     doesn't dismiss valid tokens as bogus.
   - asking_behavior: now asymmetric — only penalizes missing required
     topics. Extra questions, including overlap with `forbidden`, do NOT
     reduce the score. Erring on the side of asking more is acceptable
     compiler behaviour; under-asking is the real failure mode.
   - Acceptable_tokens treated as illustrative, not exhaustive — any token
     plausibly distinctive per priority rules 1–7 passes.

2. Fixture corrections:
   - finviz_filter_clear: Performance view tab now accepts both
     `Performance` (visible text, rule 7 — actually the more distinctive
     token since `view-tab` class is shared by every view tab on the page)
     and `view-tab`.
   - github-trending-contenteditable-question: removed `lake-title` (Yuque
     title input) from must_have_for_steps; text-input targets are
     disambiguated by the act of typing into them after focus, so a
     Keywords token is nice-to-have rather than load-bearing. Rewrote
     raw_intention.md as first-person user voice (was mixing fixture
     engineering meta with user intent — judge LLM was being primed by
     "this fixture exercises…" language).
   - techforum_count_ambiguous: removed `main-search` from
     must_have_for_steps for the same reason (input fields aren't
     load-bearing keyword targets).

3. Per-fixture artifacts in eval/output (evaluate_routine_compile.py +
   user_proxy.py):
   - Default output layout now mirrors the main eval:
     `eval/output/<timestamp>_compiler_eval/<compile_alias>/`.
   - Per fixture: `traces/<fixture_id>_compiler_trace.json` (full compiler
     conversation dump copied from ~/.openbrowser/compiler_traces/) and
     `judges/<fixture_id>_judge.json` (judge prompt + raw tool-call args +
     parsed scores + reasoning).
   - JudgmentResult gains `prompt` and `raw_args` fields (excluded from
     to_dict() so the canonical regression report stays diffable).
   - FixtureRunResult captures trace_path from the SSE complete payload.

4. agent-sdk bump (pyproject.toml + uv.lock): pulls in
   system_prompt_compiler.j2 changes — typed-text classification under
   rule 4 (catches the contenteditable instruction-paste failure mode)
   and a pre-write ambiguity-enumeration template in Workflow step 4
   (mechanical procedure for smaller models that were rationalizing past
   the prior soft rule).

Net effect on the qwen-plus regression baselines: average pass rate across
qwen35plus-fast and qwen36plus-fast lifts from 50% to 67% (refreshed
canonical reports include the latest run results). The remaining failures
in both models are real compiler issues — primarily skipping the
position-vs-identity question on gh-trending under variance — not judge
artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding merged commit 4f70a0d into main Apr 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant