fix: pixel-coord alignment, scroll/click/keyboard reliability + eval refresh#69
Merged
Conversation
…ards When an assistant turn has no tool call and empty content but non-empty reasoning (e.g. qwen-flash thinking-only responses), the timeline showed a mystery empty "AGENT / Role: assistant" card with no clue why. The SSE-payload whitelist in normalizeFrontendEvent was dropping the fields even after the visualizer added them. Carry reasoning_content and thinking_blocks through the visualizer for both MessageEvent and ActionEvent, pass them through the normalizer, and render a collapsed grey Reasoning/Thinking expander on the card. Also bump agent-sdk pin to 3799d1cf so qwen3-coder-style XML tool calls that arrive in reasoning_content get recovered into structured tool calls instead of stalling the agent loop on empty messages. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…al size The confirmation preview used to crop and rescale the viewport around the click target. That gave the agent a screenshot whose coordinate system did not match every other screenshot in the conversation, so a "retarget" reply could land pixels picked from zoom space — wrong. Return the full viewport screenshot instead, with the existing yellow target box and orange candidate outlines drawn on the live DOM (which the screenshot picks up naturally) plus a canvas-side fail-safe in device-pixel space. The agent now confirms the marked element, or emits a fresh coordinate from the same coordinate system it sees everywhere else — extension-side detection no longer constrains the retarget, since fresh-pixel estimates remain valid. Update both small- and big-model mouse_tool.j2 prompts to describe the preview in affirmative terms (no "zoomed crop" wording) and to invite either a candidate-center retarget or a fresh-pixel estimate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The drive eval's Move-items dialog rendered every folder as a single flat scrollable list, forcing the agent to scan ~30 path-labelled rows to find a known nested target. Replace with a real drill-down: breadcrumb header at the top (clickable segments navigate up) and a list of direct child folders for the current location (click to drill in). The current breadcrumb tail is the destination, so 'Move items' commits to wherever you've navigated to. Add a --tests flag to evaluate_browser_agent.py that filters the all-tests scheduler to a named subset, so rerunning a handful of failing cases doesn't burn the whole benchmark slot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Several keyboard paths failed on macOS so the agent couldn't replace text in a focused field, which broke rename flows across the eval: - Agents reached for Ctrl+A from training, which on macOS is "go to start of line" (Emacs binding), not select-all. Remap Control+<shortcut-key> -> Meta+<shortcut-key> on the host side so the intent lands correctly, and surface the swap in the observation message so the agent learns what actually fired. Add a `literal: true` field on KeyboardAction to bypass the remap when the agent really wants the raw Control combination. - CDP `Input.dispatchKeyEvent` doesn't trigger Chromium's built-in select-all accelerator even when given Meta+a (the comment on the existing `clear` action already noted this). Add an `ensureSelectAllOnActive` JS fallback that runs after the key event for any Meta+a or Control+a press and forces the visual selection on the focused input/textarea/contenteditable, so a following `type` replaces instead of appending. - The `clear` action was silently dead in production: the extension and command model existed, but the server processor's dispatch had no branch for KeyboardClearCommand, so every clear invocation raised "Unknown command type". Add the missing dispatch. - Replace a frozen-Observation mutation in the clear branch (`obs.success = False`) with `model_copy(update=...)`, which is the pydantic-v2 idiom for "modify a frozen model". Update the small- and big-model keyboard prompts to teach the macOS shape affirmatively (Meta+a/c/v/x/z as the command shortcuts) and to mention the auto-translation and `literal` escape hatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ed no-op radius - Echo click/move/drag/confirm observation coordinates in the agent's own space (Qwen [0,1000]) so the value matches what the model emitted. - Refuse keyboard type when nothing editable is focused and surface a warning analogous to the click no-op message; annotate successful type with the target field id. - When a click no-op produces no nearby interactables at 30px, re-probe at 100px then 300px and tag each hint with its distance, so the agent always has somewhere concrete to re-aim at. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…LED message
- Mouse scroll now actively polls window.scrollY until it stabilizes
before screenshotting; a single 1000px wheel on a page with
`scroll-behavior: smooth` animates 500-900ms and was returning a
mid-animation blank viewport. Capped at 1.5s with a 200ms paint
settle after.
- Gate preview now includes the previewed target element's structured
identity (Target: <div> "4" → center=(584, 257)) on the same line as
the candidate list, so the agent can read what the yellow box is
before deciding whether to confirm or re-aim.
- Observation renderer at base.py:444 now emits `**Action**: {message}`
on FAILED observations too; previously the keyboard no-focus warning
("Click into the target input field first…") was dropped because the
failure renderer only emitted Status + Error, and the agent saw a
meaningless `Status: FAILED, Error: None`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qwen models emit clicks/moves in [0, 1000] normalized space, but scroll
amount was still being treated as raw CSS pixels — so the same number
that means "viewport center" for a click meant "500 actual pixels" for
a scroll, an inconsistency the agent had to mentally compensate for.
- Denormalize the scroll amount against the axis-relevant viewport
dimension before dispatching the wheel event (vh for up/down, vw
for left/right). Non-Qwen models pass through unchanged.
- Echo the agent's input amount back in the observation message
("Scrolled down by 500", no "px") so action and observation use the
same space.
- Update the mouse tool prompts and the `amount` field description
to teach the [0, 1000] semantic: 1000 ≈ one full viewport.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The agent's scroll amount is normalized [0, 1000]; the server
denormalizes to CSS pixels before constructing MouseScrollCommand.
MouseScrollCommand.amount was capped at le=1000, so on a 1080-tall
viewport the denormalized 1080 hit Pydantic validation — the resulting
error message ("Input should be <= 1000 [input_value=1080]") leaked the
CSS-pixel number back into the agent's observation, breaking the
"agent only ever sees [0, 1000]" contract.
Raise the wire-type cap to 20000 (way beyond any real viewport) so the
denormalization path can never overflow into a validation error. The
field is documented as internal — agents talk to mouse.scroll which
stays in [0, 1000].
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scrolling at the bottom of a page, at an edge in the unscrollable
direction, or on a non-scrollable region dispatches the wheel event
successfully but doesn't move the viewport. The previous observation
said "Scrolled down by N" regardless — the agent saw an apparent
success and assumed progress had been made, often looping its scroll
without checking.
- performMouseScroll now snapshots scrollX/Y before the wheel event
and re-reads after waitForScrollSettle returns. If the position
didn't change, it probes documentElement scrollHeight/innerHeight
to label the edge ("already at the top/bottom/left/right of the
page") so the agent gets a specific hint, not a generic stall.
- Server scroll dispatch reads detail.moved; on false it surfaces a
warning ("Scroll N had no effect — <reason>. Try a different
region, the opposite direction, or a different navigation.")
- Missing `moved` field is treated as success so older builds during
a rolling upgrade don't false-warn.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etection Close the design gap on empty-space clicks: the agent now both reads a "no DOM change" warning with nearby-element coordinates AND sees the orange-dashed candidate outlines in the screenshot it observes. - _draw_no_op_overlay now returns the post-overlay screenshot URL. _overlay_screenshot_into_result swaps it into result_dict["data"][ "screenshot"], so the agent's observation image shows the highlighted candidates instead of the pre-overlay snapshot. - Tighten the MutationObserver in performPixelClick to skip idempotent attribute mutations (oldValue === newValue), which a doc-level click handler doing `el.style.display = 'none'` on every click would otherwise spam. - Tighten the selection-change probe: count "selection changed" only when a range is selected or the caret landed in a real editable context (input/textarea/contenteditable). A click on body whitespace resolves the caret to some text node but is not an agent-meaningful state change. - Surface mutations/active_changed/scroll_changed/selection_changed from performPixelClick alongside triggered_anything; useful for future debugging without re-instrumenting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop dashscope/qwen3.5-plus from evaluation_report.json — the plus slot was contaminated by a DashScope hourly-quota exhaustion mid-run (20 of 35 second-alias tests died with LLMRateLimitError). Keeping only the two flash models gives a clean comparison: qwen3.5-flash 82.9% (29/35), qwen3.6-flash 91.4% (32/35), aggregate 87.1% (61/70). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pre-commit auto-formatted spillover from the scroll/click/keyboard fixes in this stack. No behavior changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--testssubset support for the runner; frontend now surfacesreasoning_content/thinking_blockson event cards.eval/evaluation_report.jsonwith a fresh full-run across both flash models (qwen3.5-flash 82.9%, qwen3.6-flash 91.4%, aggregate 87.1% on 70 runs). Plus-side was dropped from this run because bothqwen35plusandqwen36plusaliases resolved to the samedashscope/qwen3.5-plusand shared one DashScope quota bucket — the second alias run died onLLMRateLimitError(20 of 35 tests) mid-flow.Test plan
uv run python eval/evaluate_browser_agent.py --tests bluebook_simple --model-alias qwen36flashto smoke-test the runner + the report format.eval/evaluation_report.jsonparses and renders in any dashboard / consumer that reads it.🤖 Generated with Claude Code