eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift by softpudding · Pull Request #67 · softpudding/OpenBrowser

softpudding · 2026-04-25T07:40:35Z

Summary

Compiler eval was scoring legitimate compilations as failures and missing real ones. This PR ships:

Judge prompt fixes (eval/routine_eval/user_proxy.py): keyword_placement now treats must_have_for_steps as describing target elements and looks across all steps (handles merged-step routines); asking_behavior now asymmetric — only penalizes missing required topics, not extra ones; acceptable_tokens treated as illustrative not exhaustive.
Fixture corrections: finviz Performance tab accepts visible text `Performance` (more distinctive than the shared `view-tab` class); text-input targets removed from `must_have_for_steps` on Yuque title and TechForum search (inputs are disambiguated by the act of typing); rewrote gh-trending `raw_intention.md` as first-person user voice (was mixing fixture-engineering meta with intent).
Per-fixture artifacts in eval/output (matches main-eval layout): <timestamp>_compiler_eval/<compile_alias>/{traces,judges}/. traces/ copies the full compiler conversation dump; judges/ records the judge prompt + raw tool-call args + parsed scores + reasoning. Lets you re-read both sides of any judge call without re-running the eval.
agent-sdk bump (pyproject.toml + uv.lock) pulls in system_prompt_compiler.j2 changes: a "Typed text in rich-text / contenteditable bodies" sub-bullet that classifies typed body content into static / pasted-reference / agent-instruction categories (catches the gh-trending instruction-paste failure mode), and a pre-write ambiguity-enumeration template in Workflow step 4 (mechanical procedure for smaller models that were rationalizing past the prior soft rule).

Companion commit in agent-sdk: `9b289cd3` on `open-browser` branch.

Eval impact

Refreshed canonical reports (compile_evaluation_report_qwen{35,36}plus-fast.json) reflect the latest run with all changes in place:

Model	Pass rate before	Pass rate now
qwen35plus-fast	1/3 (33%)	2/3 (67%)
qwen36plus-fast	2/3 (67%)	2/3 (67%)
average	50%	67%

Remaining failures are real compiler weaknesses (most commonly: skipping the position-vs-identity question on gh-trending under variance), not judge artifacts.

Test plan

CI green
Spot-check that `uv sync` against the new agent-sdk rev resolves cleanly
Optionally re-run `eval/routine_eval/evaluate_routine_compile.py --all --compile-alias qwen36plus-fast` to confirm new layout populates `eval/output/_compiler_eval/qwen36plus-fast/{traces,judges}/`

🤖 Generated with Claude Code

… lift Compiler eval was scoring legitimate compilations as failures and missing real ones. Three classes of changes: 1. Judge prompt fixes (eval/routine_eval/user_proxy.py): - keyword_placement: must_have_for_steps now describes target *elements* (not specific routine steps). Judge looks across all routine steps for the must_have target — handles the merged-step pattern where one step covers multiple interactions and the Keywords line addresses any one of them. Distinguishes "no Keywords line" from "Keywords line on a different valid target in the same merged step" so reasoning text doesn't dismiss valid tokens as bogus. - asking_behavior: now asymmetric — only penalizes missing required topics. Extra questions, including overlap with `forbidden`, do NOT reduce the score. Erring on the side of asking more is acceptable compiler behaviour; under-asking is the real failure mode. - Acceptable_tokens treated as illustrative, not exhaustive — any token plausibly distinctive per priority rules 1–7 passes. 2. Fixture corrections: - finviz_filter_clear: Performance view tab now accepts both `Performance` (visible text, rule 7 — actually the more distinctive token since `view-tab` class is shared by every view tab on the page) and `view-tab`. - github-trending-contenteditable-question: removed `lake-title` (Yuque title input) from must_have_for_steps; text-input targets are disambiguated by the act of typing into them after focus, so a Keywords token is nice-to-have rather than load-bearing. Rewrote raw_intention.md as first-person user voice (was mixing fixture engineering meta with user intent — judge LLM was being primed by "this fixture exercises…" language). - techforum_count_ambiguous: removed `main-search` from must_have_for_steps for the same reason (input fields aren't load-bearing keyword targets). 3. Per-fixture artifacts in eval/output (evaluate_routine_compile.py + user_proxy.py): - Default output layout now mirrors the main eval: `eval/output/<timestamp>_compiler_eval/<compile_alias>/`. - Per fixture: `traces/<fixture_id>_compiler_trace.json` (full compiler conversation dump copied from ~/.openbrowser/compiler_traces/) and `judges/<fixture_id>_judge.json` (judge prompt + raw tool-call args + parsed scores + reasoning). - JudgmentResult gains `prompt` and `raw_args` fields (excluded from to_dict() so the canonical regression report stays diffable). - FixtureRunResult captures trace_path from the SSE complete payload. 4. agent-sdk bump (pyproject.toml + uv.lock): pulls in system_prompt_compiler.j2 changes — typed-text classification under rule 4 (catches the contenteditable instruction-paste failure mode) and a pre-write ambiguity-enumeration template in Workflow step 4 (mechanical procedure for smaller models that were rationalizing past the prior soft rule). Net effect on the qwen-plus regression baselines: average pass rate across qwen35plus-fast and qwen36plus-fast lifts from 50% to 67% (refreshed canonical reports include the latest run results). The remaining failures in both models are real compiler issues — primarily skipping the position-vs-identity question on gh-trending under variance — not judge artifacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding merged commit 4f70a0d into main Apr 25, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift#67

eval(routine_eval): tighter judge, per-fixture artifacts, large-model lift#67
softpudding merged 1 commit into
mainfrom
feat/compiler-eval-judge-improvements

softpudding commented Apr 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 25, 2026

Summary

Eval impact

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant