Skip to content

fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh#69

Merged
softpudding merged 12 commits into
mainfrom
eval/full-run-20260513
May 13, 2026
Merged

fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh#69
softpudding merged 12 commits into
mainfrom
eval/full-run-20260513

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

  • Stack of action-side fixes for the Qwen agent: pixel-coord alignment, scroll normalization + post-scroll settle, click no-op detection with overlay screenshot, keyboard select-all/clear on macOS, pixel-confirm preview without zoom-crop, drive move-picker drilldown.
  • Eval harness: --tests subset support for the runner; frontend now surfaces reasoning_content / thinking_blocks on event cards.
  • Refreshes eval/evaluation_report.json with a fresh full-run across both flash models (qwen3.5-flash 82.9%, qwen3.6-flash 91.4%, aggregate 87.1% on 70 runs). Plus-side was dropped from this run because both qwen35plus and qwen36plus aliases resolved to the same dashscope/qwen3.5-plus and shared one DashScope quota bucket — the second alias run died on LLMRateLimitError (20 of 35 tests) mid-flow.

Test plan

  • CI: lint / typecheck / build
  • CI: unit tests
  • Manual: run uv run python eval/evaluate_browser_agent.py --tests bluebook_simple --model-alias qwen36flash to smoke-test the runner + the report format.
  • Verify eval/evaluation_report.json parses and renders in any dashboard / consumer that reads it.

🤖 Generated with Claude Code

softpudding and others added 12 commits May 9, 2026 13:39
…ards

When an assistant turn has no tool call and empty content but non-empty
reasoning (e.g. qwen-flash thinking-only responses), the timeline showed
a mystery empty "AGENT / Role: assistant" card with no clue why. The
SSE-payload whitelist in normalizeFrontendEvent was dropping the fields
even after the visualizer added them.

Carry reasoning_content and thinking_blocks through the visualizer for
both MessageEvent and ActionEvent, pass them through the normalizer,
and render a collapsed grey Reasoning/Thinking expander on the card.

Also bump agent-sdk pin to 3799d1cf so qwen3-coder-style XML tool calls
that arrive in reasoning_content get recovered into structured tool
calls instead of stalling the agent loop on empty messages.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al size

The confirmation preview used to crop and rescale the viewport around
the click target. That gave the agent a screenshot whose coordinate
system did not match every other screenshot in the conversation, so a
"retarget" reply could land pixels picked from zoom space — wrong.

Return the full viewport screenshot instead, with the existing yellow
target box and orange candidate outlines drawn on the live DOM (which
the screenshot picks up naturally) plus a canvas-side fail-safe in
device-pixel space. The agent now confirms the marked element, or
emits a fresh coordinate from the same coordinate system it sees
everywhere else — extension-side detection no longer constrains the
retarget, since fresh-pixel estimates remain valid.

Update both small- and big-model mouse_tool.j2 prompts to describe the
preview in affirmative terms (no "zoomed crop" wording) and to invite
either a candidate-center retarget or a fresh-pixel estimate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The drive eval's Move-items dialog rendered every folder as a single flat
scrollable list, forcing the agent to scan ~30 path-labelled rows to find
a known nested target. Replace with a real drill-down: breadcrumb header
at the top (clickable segments navigate up) and a list of direct child
folders for the current location (click to drill in). The current
breadcrumb tail is the destination, so 'Move items' commits to wherever
you've navigated to.

Add a --tests flag to evaluate_browser_agent.py that filters the
all-tests scheduler to a named subset, so rerunning a handful of
failing cases doesn't burn the whole benchmark slot.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several keyboard paths failed on macOS so the agent couldn't replace
text in a focused field, which broke rename flows across the eval:

- Agents reached for Ctrl+A from training, which on macOS is "go to
  start of line" (Emacs binding), not select-all. Remap
  Control+<shortcut-key> -> Meta+<shortcut-key> on the host side so the
  intent lands correctly, and surface the swap in the observation
  message so the agent learns what actually fired. Add a `literal: true`
  field on KeyboardAction to bypass the remap when the agent really
  wants the raw Control combination.

- CDP `Input.dispatchKeyEvent` doesn't trigger Chromium's built-in
  select-all accelerator even when given Meta+a (the comment on the
  existing `clear` action already noted this). Add an
  `ensureSelectAllOnActive` JS fallback that runs after the key event
  for any Meta+a or Control+a press and forces the visual selection on
  the focused input/textarea/contenteditable, so a following `type`
  replaces instead of appending.

- The `clear` action was silently dead in production: the extension and
  command model existed, but the server processor's dispatch had no
  branch for KeyboardClearCommand, so every clear invocation raised
  "Unknown command type". Add the missing dispatch.

- Replace a frozen-Observation mutation in the clear branch
  (`obs.success = False`) with `model_copy(update=...)`, which is the
  pydantic-v2 idiom for "modify a frozen model".

Update the small- and big-model keyboard prompts to teach the macOS
shape affirmatively (Meta+a/c/v/x/z as the command shortcuts) and to
mention the auto-translation and `literal` escape hatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed no-op radius

- Echo click/move/drag/confirm observation coordinates in the agent's own
  space (Qwen [0,1000]) so the value matches what the model emitted.
- Refuse keyboard type when nothing editable is focused and surface a
  warning analogous to the click no-op message; annotate successful type
  with the target field id.
- When a click no-op produces no nearby interactables at 30px, re-probe
  at 100px then 300px and tag each hint with its distance, so the agent
  always has somewhere concrete to re-aim at.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LED message

- Mouse scroll now actively polls window.scrollY until it stabilizes
  before screenshotting; a single 1000px wheel on a page with
  `scroll-behavior: smooth` animates 500-900ms and was returning a
  mid-animation blank viewport. Capped at 1.5s with a 200ms paint
  settle after.
- Gate preview now includes the previewed target element's structured
  identity (Target: <div> "4" → center=(584, 257)) on the same line as
  the candidate list, so the agent can read what the yellow box is
  before deciding whether to confirm or re-aim.
- Observation renderer at base.py:444 now emits `**Action**: {message}`
  on FAILED observations too; previously the keyboard no-focus warning
  ("Click into the target input field first…") was dropped because the
  failure renderer only emitted Status + Error, and the agent saw a
  meaningless `Status: FAILED, Error: None`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen models emit clicks/moves in [0, 1000] normalized space, but scroll
amount was still being treated as raw CSS pixels — so the same number
that means "viewport center" for a click meant "500 actual pixels" for
a scroll, an inconsistency the agent had to mentally compensate for.

- Denormalize the scroll amount against the axis-relevant viewport
  dimension before dispatching the wheel event (vh for up/down, vw
  for left/right). Non-Qwen models pass through unchanged.
- Echo the agent's input amount back in the observation message
  ("Scrolled down by 500", no "px") so action and observation use the
  same space.
- Update the mouse tool prompts and the `amount` field description
  to teach the [0, 1000] semantic: 1000 ≈ one full viewport.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent's scroll amount is normalized [0, 1000]; the server
denormalizes to CSS pixels before constructing MouseScrollCommand.
MouseScrollCommand.amount was capped at le=1000, so on a 1080-tall
viewport the denormalized 1080 hit Pydantic validation — the resulting
error message ("Input should be <= 1000 [input_value=1080]") leaked the
CSS-pixel number back into the agent's observation, breaking the
"agent only ever sees [0, 1000]" contract.

Raise the wire-type cap to 20000 (way beyond any real viewport) so the
denormalization path can never overflow into a validation error. The
field is documented as internal — agents talk to mouse.scroll which
stays in [0, 1000].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scrolling at the bottom of a page, at an edge in the unscrollable
direction, or on a non-scrollable region dispatches the wheel event
successfully but doesn't move the viewport. The previous observation
said "Scrolled down by N" regardless — the agent saw an apparent
success and assumed progress had been made, often looping its scroll
without checking.

- performMouseScroll now snapshots scrollX/Y before the wheel event
  and re-reads after waitForScrollSettle returns. If the position
  didn't change, it probes documentElement scrollHeight/innerHeight
  to label the edge ("already at the top/bottom/left/right of the
  page") so the agent gets a specific hint, not a generic stall.
- Server scroll dispatch reads detail.moved; on false it surfaces a
  warning ("Scroll N had no effect — <reason>. Try a different
  region, the opposite direction, or a different navigation.")
- Missing `moved` field is treated as success so older builds during
  a rolling upgrade don't false-warn.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etection

Close the design gap on empty-space clicks: the agent now both reads a
"no DOM change" warning with nearby-element coordinates AND sees the
orange-dashed candidate outlines in the screenshot it observes.

- _draw_no_op_overlay now returns the post-overlay screenshot URL.
  _overlay_screenshot_into_result swaps it into result_dict["data"][
  "screenshot"], so the agent's observation image shows the highlighted
  candidates instead of the pre-overlay snapshot.
- Tighten the MutationObserver in performPixelClick to skip idempotent
  attribute mutations (oldValue === newValue), which a doc-level click
  handler doing `el.style.display = 'none'` on every click would
  otherwise spam.
- Tighten the selection-change probe: count "selection changed" only
  when a range is selected or the caret landed in a real editable
  context (input/textarea/contenteditable). A click on body whitespace
  resolves the caret to some text node but is not an agent-meaningful
  state change.
- Surface mutations/active_changed/scroll_changed/selection_changed
  from performPixelClick alongside triggered_anything; useful for
  future debugging without re-instrumenting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop dashscope/qwen3.5-plus from evaluation_report.json — the plus
slot was contaminated by a DashScope hourly-quota exhaustion mid-run
(20 of 35 second-alias tests died with LLMRateLimitError). Keeping
only the two flash models gives a clean comparison: qwen3.5-flash
82.9% (29/35), qwen3.6-flash 91.4% (32/35), aggregate 87.1% (61/70).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit auto-formatted spillover from the scroll/click/keyboard
fixes in this stack. No behavior changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding merged commit 9ce62be into main May 13, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant