fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh by softpudding · Pull Request #69 · softpudding/OpenBrowser

softpudding · 2026-05-13T00:18:50Z

Summary

Stack of action-side fixes for the Qwen agent: pixel-coord alignment, scroll normalization + post-scroll settle, click no-op detection with overlay screenshot, keyboard select-all/clear on macOS, pixel-confirm preview without zoom-crop, drive move-picker drilldown.
Eval harness: --tests subset support for the runner; frontend now surfaces reasoning_content / thinking_blocks on event cards.
Refreshes eval/evaluation_report.json with a fresh full-run across both flash models (qwen3.5-flash 82.9%, qwen3.6-flash 91.4%, aggregate 87.1% on 70 runs). Plus-side was dropped from this run because both qwen35plus and qwen36plus aliases resolved to the same dashscope/qwen3.5-plus and shared one DashScope quota bucket — the second alias run died on LLMRateLimitError (20 of 35 tests) mid-flow.

Test plan

CI: lint / typecheck / build
CI: unit tests
Manual: run uv run python eval/evaluate_browser_agent.py --tests bluebook_simple --model-alias qwen36flash to smoke-test the runner + the report format.
Verify eval/evaluation_report.json parses and renders in any dashboard / consumer that reads it.

🤖 Generated with Claude Code

…ards When an assistant turn has no tool call and empty content but non-empty reasoning (e.g. qwen-flash thinking-only responses), the timeline showed a mystery empty "AGENT / Role: assistant" card with no clue why. The SSE-payload whitelist in normalizeFrontendEvent was dropping the fields even after the visualizer added them. Carry reasoning_content and thinking_blocks through the visualizer for both MessageEvent and ActionEvent, pass them through the normalizer, and render a collapsed grey Reasoning/Thinking expander on the card. Also bump agent-sdk pin to 3799d1cf so qwen3-coder-style XML tool calls that arrive in reasoning_content get recovered into structured tool calls instead of stalling the agent loop on empty messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…al size The confirmation preview used to crop and rescale the viewport around the click target. That gave the agent a screenshot whose coordinate system did not match every other screenshot in the conversation, so a "retarget" reply could land pixels picked from zoom space — wrong. Return the full viewport screenshot instead, with the existing yellow target box and orange candidate outlines drawn on the live DOM (which the screenshot picks up naturally) plus a canvas-side fail-safe in device-pixel space. The agent now confirms the marked element, or emits a fresh coordinate from the same coordinate system it sees everywhere else — extension-side detection no longer constrains the retarget, since fresh-pixel estimates remain valid. Update both small- and big-model mouse_tool.j2 prompts to describe the preview in affirmative terms (no "zoomed crop" wording) and to invite either a candidate-center retarget or a fresh-pixel estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The drive eval's Move-items dialog rendered every folder as a single flat scrollable list, forcing the agent to scan ~30 path-labelled rows to find a known nested target. Replace with a real drill-down: breadcrumb header at the top (clickable segments navigate up) and a list of direct child folders for the current location (click to drill in). The current breadcrumb tail is the destination, so 'Move items' commits to wherever you've navigated to. Add a --tests flag to evaluate_browser_agent.py that filters the all-tests scheduler to a named subset, so rerunning a handful of failing cases doesn't burn the whole benchmark slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Several keyboard paths failed on macOS so the agent couldn't replace text in a focused field, which broke rename flows across the eval: - Agents reached for Ctrl+A from training, which on macOS is "go to start of line" (Emacs binding), not select-all. Remap Control+<shortcut-key> -> Meta+<shortcut-key> on the host side so the intent lands correctly, and surface the swap in the observation message so the agent learns what actually fired. Add a `literal: true` field on KeyboardAction to bypass the remap when the agent really wants the raw Control combination. - CDP `Input.dispatchKeyEvent` doesn't trigger Chromium's built-in select-all accelerator even when given Meta+a (the comment on the existing `clear` action already noted this). Add an `ensureSelectAllOnActive` JS fallback that runs after the key event for any Meta+a or Control+a press and forces the visual selection on the focused input/textarea/contenteditable, so a following `type` replaces instead of appending. - The `clear` action was silently dead in production: the extension and command model existed, but the server processor's dispatch had no branch for KeyboardClearCommand, so every clear invocation raised "Unknown command type". Add the missing dispatch. - Replace a frozen-Observation mutation in the clear branch (`obs.success = False`) with `model_copy(update=...)`, which is the pydantic-v2 idiom for "modify a frozen model". Update the small- and big-model keyboard prompts to teach the macOS shape affirmatively (Meta+a/c/v/x/z as the command shortcuts) and to mention the auto-translation and `literal` escape hatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ed no-op radius - Echo click/move/drag/confirm observation coordinates in the agent's own space (Qwen [0,1000]) so the value matches what the model emitted. - Refuse keyboard type when nothing editable is focused and surface a warning analogous to the click no-op message; annotate successful type with the target field id. - When a click no-op produces no nearby interactables at 30px, re-probe at 100px then 300px and tag each hint with its distance, so the agent always has somewhere concrete to re-aim at. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…LED message - Mouse scroll now actively polls window.scrollY until it stabilizes before screenshotting; a single 1000px wheel on a page with `scroll-behavior: smooth` animates 500-900ms and was returning a mid-animation blank viewport. Capped at 1.5s with a 200ms paint settle after. - Gate preview now includes the previewed target element's structured identity (Target: <div> "4" → center=(584, 257)) on the same line as the candidate list, so the agent can read what the yellow box is before deciding whether to confirm or re-aim. - Observation renderer at base.py:444 now emits `**Action**: {message}` on FAILED observations too; previously the keyboard no-focus warning ("Click into the target input field first…") was dropped because the failure renderer only emitted Status + Error, and the agent saw a meaningless `Status: FAILED, Error: None`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Qwen models emit clicks/moves in [0, 1000] normalized space, but scroll amount was still being treated as raw CSS pixels — so the same number that means "viewport center" for a click meant "500 actual pixels" for a scroll, an inconsistency the agent had to mentally compensate for. - Denormalize the scroll amount against the axis-relevant viewport dimension before dispatching the wheel event (vh for up/down, vw for left/right). Non-Qwen models pass through unchanged. - Echo the agent's input amount back in the observation message ("Scrolled down by 500", no "px") so action and observation use the same space. - Update the mouse tool prompts and the `amount` field description to teach the [0, 1000] semantic: 1000 ≈ one full viewport. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The agent's scroll amount is normalized [0, 1000]; the server denormalizes to CSS pixels before constructing MouseScrollCommand. MouseScrollCommand.amount was capped at le=1000, so on a 1080-tall viewport the denormalized 1080 hit Pydantic validation — the resulting error message ("Input should be <= 1000 [input_value=1080]") leaked the CSS-pixel number back into the agent's observation, breaking the "agent only ever sees [0, 1000]" contract. Raise the wire-type cap to 20000 (way beyond any real viewport) so the denormalization path can never overflow into a validation error. The field is documented as internal — agents talk to mouse.scroll which stays in [0, 1000]. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Scrolling at the bottom of a page, at an edge in the unscrollable direction, or on a non-scrollable region dispatches the wheel event successfully but doesn't move the viewport. The previous observation said "Scrolled down by N" regardless — the agent saw an apparent success and assumed progress had been made, often looping its scroll without checking. - performMouseScroll now snapshots scrollX/Y before the wheel event and re-reads after waitForScrollSettle returns. If the position didn't change, it probes documentElement scrollHeight/innerHeight to label the edge ("already at the top/bottom/left/right of the page") so the agent gets a specific hint, not a generic stall. - Server scroll dispatch reads detail.moved; on false it surfaces a warning ("Scroll N had no effect — <reason>. Try a different region, the opposite direction, or a different navigation.") - Missing `moved` field is treated as success so older builds during a rolling upgrade don't false-warn. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…etection Close the design gap on empty-space clicks: the agent now both reads a "no DOM change" warning with nearby-element coordinates AND sees the orange-dashed candidate outlines in the screenshot it observes. - _draw_no_op_overlay now returns the post-overlay screenshot URL. _overlay_screenshot_into_result swaps it into result_dict["data"][ "screenshot"], so the agent's observation image shows the highlighted candidates instead of the pre-overlay snapshot. - Tighten the MutationObserver in performPixelClick to skip idempotent attribute mutations (oldValue === newValue), which a doc-level click handler doing `el.style.display = 'none'` on every click would otherwise spam. - Tighten the selection-change probe: count "selection changed" only when a range is selected or the caret landed in a real editable context (input/textarea/contenteditable). A click on body whitespace resolves the caret to some text node but is not an agent-meaningful state change. - Surface mutations/active_changed/scroll_changed/selection_changed from performPixelClick alongside triggered_anything; useful for future debugging without re-instrumenting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drop dashscope/qwen3.5-plus from evaluation_report.json — the plus slot was contaminated by a DashScope hourly-quota exhaustion mid-run (20 of 35 second-alias tests died with LLMRateLimitError). Keeping only the two flash models gives a clean comparison: qwen3.5-flash 82.9% (29/35), qwen3.6-flash 91.4% (32/35), aggregate 87.1% (61/70). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-commit auto-formatted spillover from the scroll/click/keyboard fixes in this stack. No behavior changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding and others added 12 commits May 9, 2026 13:39

style: apply pre-commit formatting (black + prettier)

ca0e764

Pre-commit auto-formatted spillover from the scroll/click/keyboard fixes in this stack. No behavior changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding merged commit 9ce62be into main May 13, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh#69

fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh#69
softpudding merged 12 commits into
mainfrom
eval/full-run-20260513

softpudding commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented May 13, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant