feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema#68
Merged
Conversation
Replace the highlight + element_id paradigm with a human-like pixel control loop: the agent sees a clean screenshot with a visible cursor sprite and drives the page via a virtual mouse and keyboard. Same toolset for fresh tasks and routine replay; the legacy highlight / element-interaction modules stay on disk for non-agent flows but are no longer exposed. Tools (live agent surface): tab, mouse, keyboard, dialog. - mouse: move (eased lerp), click (in-place — must move first), drag, scroll, reset. Coordinates in Qwen-VL [0,1000] normalized space; the server denormalizes to CSS pixels via the captured viewport. - keyboard: type (one char at a time, real keydown/keypress/input events for ASCII printables, fallback insertText for CJK/emoji), press (named keys + modifiers; Enter/Tab/Space carry text so keypress fires and form-submit works), clear (Ctrl+A → Backspace). - tab: clean screenshots with the cursor in-frame on every action; refresh/view/back/forward auto-fill the active tab_id. Cursor sprite is a 36x36 white-and-black arrow with a red dot and pulsing red ring at the click point, injected via preCaptureScript so it lands in the captured frame even after navigation. Schema: extend MouseClickCommand with optional x/y (now ignored), add MouseDragCommand, drop le=1280/le=720 bounds on MouseMoveCommand, add live_mode flag to BaseCommand for extension routing. Pixel commands finally reach the wire — added MouseDragCommand routing in CommandProcessor.execute and case handlers in the extension switch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eaks Rewrite the mouse and keyboard tool prompts and Field descriptions to describe only the canonical use, instead of warning the model away from shapes that the executor already silently handles. - mouse: drop "click does not take coordinates" / "MUST NOT supply x/y" language. Field descriptions now state the affirmative use only. Examples already only show the in-place form. - keyboard: lead with "focus an input first" as a positive rule, drop the per-character / "convenience wrapper" implementation notes. Add a small_model `type into a field` pattern that pairs with mouse. - mouse scroll (small_model): document that the wheel hits the container under the cursor — `move` first to scroll an inner panel. - mouse_tool.py: revert the validator that rejected x/y on click; the executor already drops them silently. The schema reflects only what the agent should learn to send. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ompts)
Picks up the rewrite of system_prompt_{large,small}.j2 in
softpudding/agent-sdk@37227545 — drops element_id/highlight/replay
language and aligns the top-level system prompts with this repo's
mouse + keyboard tools.
uv.lock regenerated by `uv sync`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…file picker
Two pixel-paradigm bug fixes for the live agent.
1. Mouse/keyboard CDP events were sluggish because Chrome throttles the
agent's background tab while the user works in their own. After
`chrome.debugger.attach`, send `Emulation.setFocusEmulationEnabled` and
`Page.setWebLifecycleState({state: 'active'})` so the renderer runs at
foreground priority — CDP-only, never changes OS-level tab focus, never
interrupts the user. Re-asserted on every attach call so SPA navigation
doesn't drop the flags.
2. Native `<select>` dropdowns and OS file pickers don't render into CDP
screenshots, so a click on one left the agent blind. Now `mouse_click`
hit-tests the cursor (walking through `<label>.control` for styled
file inputs); on a hit, the native click is suppressed, the element is
tagged with `data-ob-pending-form-target`, and the click observation
surfaces the option list (for `<select>`) or input metadata (for
`<input type=file>`). The agent then completes the action via two new
tools: `select_option(values)` matches by `value` → exact label →
case-insensitive substring; `upload_file(paths)` uses
`DOM.setFileInputFiles`. Rich error messages on `option_not_found` /
`no_pending_*` so the agent can self-correct.
System prompts (large + small) updated in agent-sdk to introduce the new
tools and a NATIVE_FORM_CONTROLS note; mirrored into the venv copy used
by the running server.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ile prompts) Picks up the system-prompt updates that introduce the two new tools and the NATIVE_FORM_CONTROLS guidance for native <select> / file pickers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…didates Before dispatching a pixel-paradigm `mouse(click)` or `mouse(drag)`, the executor probes the click point for the hit element and any interactables within a 30-CSS-px edge-distance window. When two or more interactables sit near the click, the action enters a confirmation stage: the extension renders a zoomed crop with a YELLOW box on the hit control plus orange dashed outlines on nearby candidates within 140 CSS px center-distance. The agent commits via `mouse(action="confirm")` or re-aims using the candidate centers (Qwen [0,1000] space) listed in the message. Implementation: - New extension command `analyze_pixel_targets` reuses the highlight detection engine to compute hit + neighborhood + verdict. - New extension command `render_pixel_confirm` produces the zoomed crop with bright orange dashed outlines on neighbors. - `BrowserExecutor` tracks the virtual cursor on the server side so the in-place pixel `click` knows where to gate. `mouse_click_pixel` / `mouse_drag_pixel` pendings reuse the existing 2PC machinery. - Non-confirm actions taken while a pixel pending is set count as an implicit rejection: the rejected candidates are appended to the new observation's message so course correction has them fresh. - Candidate rendering reuses `_format_highlighted_element_lines` so the agent reads the same descriptor-first phrasing the element paradigm produced; bare `<tag>` fallback surfaces a truncated HTML snippet. - `mouse_move` / `reset_mouse` post-action capture waits 150 ms so the cursor sprite's CSS transition completes before the screenshot. - Tool prompts (`mouse_tool.j2`) describe the confirmation in the affirmative voice; system prompt update ships via the bumped agent-sdk pin. Bumps agent-sdk pin to 2ea1956a so the system prompt teaches the agent about the pixel-click preview shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the rewritten system_prompt_large/small that drop highlight, element_interaction, element_id, BLUE/YELLOW staging, and the corner-badge label rule. The runtime hasn't exposed those tools for a while, so the prompt residue was the last source of `highlight`-tool hallucinations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…observation
- mouse `click` now accepts optional `x, y`. With both supplied the
cursor moves there and clicks in one step; the gate runs against the
new cursor. Without coords it still clicks at the cursor's current
position for hover-then-click flows. Half-supplied coords are
rejected so the agent re-emits cleanly.
- The dense-neighborhood gate now paints its overlay on the live page
DOM in addition to the screenshot crop: a single absolute-positioned
div per box (yellow solid for the hit, orange dashed for nearby
candidates) lives inside one overlay container, so a human watching
the browser sees what the agent sees and removing the container
cleans everything up. Drawing as floating divs (not via `outline`
on the element itself) avoids stacking with the page's own focus or
hover ring, fixing the double-yellow rectangle on Xiaohongshu's
favorite button.
- The new `clear_pixel_overlay` command fires before any non-confirm
mouse action and at the top of `_commit_pending_pixel_action` so
the overlay disappears whether the agent confirms or re-aims, and
the post-commit screenshot is unhighlighted.
- Tool/observation prompts updated for the new shapes: `click {x, y}`
shown as the canonical form; gated-preview observation is one line
("Click previewed on the yellow target. Confirm to commit, or
re-emit `click` ...") with the verification step taught in the
system prompt's gated-preview block instead of an in-page banner.
- Non-gated click message includes the actual click point in CSS
pixels.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… targets Arms a MutationObserver in `performMouseClick` before dispatch, reads the count ~250 ms after the click, and returns `triggered_anything: false` when no non-cursor mutations were recorded — i.e. the click did nothing the page reacted to. The observer filters out the cursor sprite (`#__ob_cursor__`) and the pixel-confirm overlay so neither their post-click refreshes nor their teardown counts as page activity. DOM mutations cover CSS-class flips, aria toggles, child appends, and text changes, so a like-heart toggle that only swaps an SVG color still registers as a real click. When the server sees `triggered_anything: false`, the click observation gets a one-line warning plus the same nearby-interactable candidate block the dense-gate preview uses (descriptor + center in [0,1000] space), so the agent has somewhere concrete to re-aim instead of looping on the same dead pixel. Validated on the Gmail Finance Follow-up eval test that previously hit the 660 s wall-clock cap stuck in a click-Escape-click loop on the reply-draft modal: with the warning in place the agent re-aims on each no-op, finishes the workflow in 258 s, and scores 8.0/8.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rlays extension-side Two follow-ups to the no-op-click warning: 1. **Visualize candidates on the live page.** When `triggered_anything: false` comes back from the click probe, the server now also fires `render_pixel_confirm` (mode `pixel_miss`, no banner) so the same orange-dashed boxes the agent reads as text in its observation are painted onto the live page DOM. A human watching the browser sees exactly what alternatives the agent is being told to consider. Serialized candidates gained a `bbox_css` field so the confirm path can draw the overlay without re-running the gate. 2. **Tear overlays down extension-side.** Cleanup moved from the server (which only fired before mouse actions) into the extension's pixel-action dispatcher, so every incoming agent action — mouse, keyboard, select, upload — clears any leftover overlay before running. `_execute_mouse_action` and `_commit_pending_pixel_action` no longer send their own `clear_pixel_overlay` (the local cleanup covers both gated-preview and no-op-warning overlays uniformly). Also fix a leak in the Gmail eval mock: the "Create a new label" input had `placeholder="Finance/Board-Prep"` — the exact label the test asks the agent to create — and no `autocomplete="off"`, so Chrome also surfaced the value as a history suggestion. Replaced the placeholder with a generic `e.g. Marketing/Q4-Launch` and disabled autocomplete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hat doesn't false-alarm on focus `keyboard clear` previously sent Ctrl+A + Backspace via two CDP key events. On macOS Cmd is the select-all modifier; Ctrl+A no-ops, the Backspace deletes one character, and the SUCCESS message lies. Replace with a dedicated `keyboard_clear` command that runs JS against `document.activeElement` (input/textarea value setter, contenteditable textContent, role=textbox) and dispatches input+change so framework- controlled widgets observe the reset. Reports SUCCESS only when the field actually ended up empty; otherwise surfaces the focused element descriptor so the agent can recover. The post-click no-op probe only counted DOM mutations, so clicking into an input fired a false "no DOM change" warning — :focus is a CSS state that updates `document.activeElement` without mutating the tree. Extend the probe to also capture activeElement, scrollY, and Selection state at arm time and diff at read time. Trivial body/null transitions are filtered so initial-load focus shifts don't false-positive in the other direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er only set fields
The previous prompt taught coordinates with `{x, y}` set notation and
`(x, y)` tuple notation. Plus-tier models read those literally and
emitted `{"x": [519, 669], "y": [519, 669]}` arrays — 110 events across
50 conversations on the last full-eval, the biggest single error
cluster. Replace x/y/x2/y2 scalar fields with `coordinate`,
`start_coordinate`, `end_coordinate` arrays so the agent's natural
representation matches the schema instead of fighting it.
Make `action` default to "move" so a bare `{"coordinate":[x,y]}` works
when flash models forget the action key (14 events, 100% flash on the
prior eval). Tighten the prompt examples so every snippet is a complete
JSON envelope `{ "action": "...", "coordinate": [...] }` with no
ambient prose like `mouse click {x, y}` that small models picked up as
a tool-name pattern (`Tool 'click' not found` — 65 events, 100% flash).
Document right-click and double-click explicitly so agents stop
inventing `"action": "right_click"` / `"double_click"` literals.
Override OpenBrowserAction.visualize so persisted ActionEvent text only
contains fields the agent actually set (model_fields_set + action).
Stops `text:null, key:null, modifiers:[], kind:KeyboardAction` and
`start_coordinate:null, end_coordinate:null, button:left, count:1,
direction:down, amount:300, steps:10` from polluting the rendered
arguments shown to humans, the compiler agent, and any condense path.
Mirrors mouse-tool prompt edits in the agent-sdk system prompts
(system_prompt_{large,small}.j2) and the venv copy.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pts)
Picks up agent-sdk commit that drops `{x,y}`/`(x,y)` set/tuple notation
from the system prompts in favor of explicit `coordinate: [x, y]` arrays
and drag's `start_coordinate`/`end_coordinate`. Pairs with the in-repo
MouseAction schema rewrite (4ce0f41).
Refresh evaluation_report.json with the post-fix run (20260507_092809):
255 → 36 tool-call validation errors (86% reduction); pattern E
`x:[a,b]` 110 → 1 events (99%). Aggregate pass rate 75.0% → 78.6% vs
main; net +35.7 task-score across the four `*-fast` aliases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced two issues: 1. Pre-commit (black + extension prettier) wanted to reformat several files touched by recent commits. Apply the formatter pass. 2. Three unit tests still asserted the pre-pixel-paradigm contract (`['tab', 'highlight', 'element_interaction', ...]` toolset; tab prompt mentioning `default highlight element_type:"any" page 1`). Update them to the current pixel-paradigm shape: `['tab', 'mouse', 'keyboard', 'dialog', 'select_option', 'upload_file', ...]` toolset and the new "virtual cursor visible" contract from the rewritten tab prompt. 499 pytest pass / 4 skipped locally. No production code changed beyond the formatter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pixel-level browser interaction (virtual cursor, pixel-click density gate, gated preview-confirm overlay) plus the
coordinate: [x, y]schema refactor that reduces tool-call validation errors by 86% (255 → 36 events on 140-run benchmark).coordinate/start_coordinate/end_coordinatearray fields replacex,y,x2,y2scalars onMouseAction. Pattern E (x: [a, b]array form, 110 events on the prior eval) is now ~0.actiondefaults to"move"so a bare{ "coordinate": [x, y] }slides the cursor; eliminates pattern C (missing-action).text:null, key:null, modifiers:[], kind:KeyboardActionclutter in observations / DB / condense paths.keyboard clearrewritten as JS-based reset ondocument.activeElement(input/textarea/contenteditable/role=textbox); prior Ctrl+A + Backspace silently no-opped on macOS.document.activeElement, scroll position, and Selection — stops false "no DOM change" warnings on focus-only clicks into inputs.Eval results (140 runs, 35 tests × 4 -fast models, 2026-05-07_092809)
*3.6-plus pass rate is flat but task score is +11.0 (more partial credit on harder tests). One DashScope infra failure on gmail_inbox_cleanup; once re-run, expected pass rate 77.1% (27/35).
Top per-test wins vs main: staybnb_book +13.0 (3.6-plus), github_issue_triage_deep +8.5 (3.5-plus), gmail_vendor_escalation +8.2 (3.6-plus), drive_project_reorg +7.5 (3.6-plus). 27 improvements ≥+1.0; 15 hard regressions (mostly Theme T2 preview-confirm time-budget on dense vertical UIs like Gmail inbox).
Pairs with agent-sdk commit
1ac8fff4(coordinate-as-array prompts).Test plan
python eval/evaluate_browser_agent.pyacross 4-fastaliases — 110/140 passcoordinate=[x,y], in-place click, drag withstart_coordinate/end_coordinate, reject[1]/[1,2,3]/[-5,100]/[1,1500]model_fields_setrendering: only agent-emitted fields appear inAction.visualizenpm run devsucceeds, dev-reload pushed to Chromekeyboard_clearsmoke-tested via Pydantic + extension command wiring🤖 Generated with Claude Code