feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema by softpudding · Pull Request #68 · softpudding/OpenBrowser

softpudding · 2026-05-07T04:50:47Z

Summary

Pixel-level browser interaction (virtual cursor, pixel-click density gate, gated preview-confirm overlay) plus the coordinate: [x, y] schema refactor that reduces tool-call validation errors by 86% (255 → 36 events on 140-run benchmark).

New coordinate/start_coordinate/end_coordinate array fields replace x,y,x2,y2 scalars on MouseAction. Pattern E (x: [a, b] array form, 110 events on the prior eval) is now ~0.
action defaults to "move" so a bare { "coordinate": [x, y] } slides the cursor; eliminates pattern C (missing-action).
Action visualization renders only fields the agent actually set — no more text:null, key:null, modifiers:[], kind:KeyboardAction clutter in observations / DB / condense paths.
keyboard clear rewritten as JS-based reset on document.activeElement (input/textarea/contenteditable/role=textbox); prior Ctrl+A + Backspace silently no-opped on macOS.
Click no-op probe extended beyond DOM mutations to also check document.activeElement, scroll position, and Selection — stops false "no DOM change" warnings on focus-only clicks into inputs.
Pixel pipeline: virtual-cursor sprite, pixel-click density gate with descriptor-rich nearby-candidate overlays, no-op-click warning system, gated preview-confirm cycle for dense neighborhoods.

Eval results (140 runs, 35 tests × 4 -fast models, 2026-05-07_092809)

Model	Main pass	This branch	Δ
qwen3.5-plus	85.7% (30/35)	88.6% (31/35)	+2.9pp
qwen3.6-plus	74.3% (26/35)	74.3% (26/35)	0.0pp*
qwen3.5-flash	60.0% (21/35)	65.7% (23/35)	+5.7pp
qwen3.6-flash	80.0% (28/35)	85.7% (30/35)	+5.7pp
Aggregate	75.0% (105/140)	78.6% (110/140)	+3.6pp

*3.6-plus pass rate is flat but task score is +11.0 (more partial credit on harder tests). One DashScope infra failure on gmail_inbox_cleanup; once re-run, expected pass rate 77.1% (27/35).

Top per-test wins vs main: staybnb_book +13.0 (3.6-plus), github_issue_triage_deep +8.5 (3.5-plus), gmail_vendor_escalation +8.2 (3.6-plus), drive_project_reorg +7.5 (3.6-plus). 27 improvements ≥+1.0; 15 hard regressions (mostly Theme T2 preview-confirm time-budget on dense vertical UIs like Gmail inbox).

Pairs with agent-sdk commit 1ac8fff4 (coordinate-as-array prompts).

Test plan

Full benchmark run: python eval/evaluate_browser_agent.py across 4 -fast aliases — 110/140 pass
Schema parse smoke tests: coordinate=[x,y], in-place click, drag with start_coordinate/end_coordinate, reject [1]/[1,2,3]/[-5,100]/[1,1500]
model_fields_set rendering: only agent-emitted fields appear in Action.visualize
Extension build clean: npm run dev succeeds, dev-reload pushed to Chrome
keyboard_clear smoke-tested via Pydantic + extension command wiring
Reviewer to re-run on a clean Chrome profile if any test variance is suspected

🤖 Generated with Claude Code

Replace the highlight + element_id paradigm with a human-like pixel control loop: the agent sees a clean screenshot with a visible cursor sprite and drives the page via a virtual mouse and keyboard. Same toolset for fresh tasks and routine replay; the legacy highlight / element-interaction modules stay on disk for non-agent flows but are no longer exposed. Tools (live agent surface): tab, mouse, keyboard, dialog. - mouse: move (eased lerp), click (in-place — must move first), drag, scroll, reset. Coordinates in Qwen-VL [0,1000] normalized space; the server denormalizes to CSS pixels via the captured viewport. - keyboard: type (one char at a time, real keydown/keypress/input events for ASCII printables, fallback insertText for CJK/emoji), press (named keys + modifiers; Enter/Tab/Space carry text so keypress fires and form-submit works), clear (Ctrl+A → Backspace). - tab: clean screenshots with the cursor in-frame on every action; refresh/view/back/forward auto-fill the active tab_id. Cursor sprite is a 36x36 white-and-black arrow with a red dot and pulsing red ring at the click point, injected via preCaptureScript so it lands in the captured frame even after navigation. Schema: extend MouseClickCommand with optional x/y (now ignored), add MouseDragCommand, drop le=1280/le=720 bounds on MouseMoveCommand, add live_mode flag to BaseCommand for extension routing. Pixel commands finally reach the wire — added MouseDragCommand routing in CommandProcessor.execute and case handlers in the extension switch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eaks Rewrite the mouse and keyboard tool prompts and Field descriptions to describe only the canonical use, instead of warning the model away from shapes that the executor already silently handles. - mouse: drop "click does not take coordinates" / "MUST NOT supply x/y" language. Field descriptions now state the affirmative use only. Examples already only show the in-place form. - keyboard: lead with "focus an input first" as a positive rule, drop the per-character / "convenience wrapper" implementation notes. Add a small_model `type into a field` pattern that pairs with mouse. - mouse scroll (small_model): document that the wheel hits the container under the cursor — `move` first to scroll an inner panel. - mouse_tool.py: revert the validator that rejected x/y on click; the executor already drops them silently. The schema reflects only what the agent should learn to send. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ompts) Picks up the rewrite of system_prompt_{large,small}.j2 in softpudding/agent-sdk@37227545 — drops element_id/highlight/replay language and aligns the top-level system prompts with this repo's mouse + keyboard tools. uv.lock regenerated by `uv sync`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…file picker Two pixel-paradigm bug fixes for the live agent. 1. Mouse/keyboard CDP events were sluggish because Chrome throttles the agent's background tab while the user works in their own. After `chrome.debugger.attach`, send `Emulation.setFocusEmulationEnabled` and `Page.setWebLifecycleState({state: 'active'})` so the renderer runs at foreground priority — CDP-only, never changes OS-level tab focus, never interrupts the user. Re-asserted on every attach call so SPA navigation doesn't drop the flags. 2. Native `<select>` dropdowns and OS file pickers don't render into CDP screenshots, so a click on one left the agent blind. Now `mouse_click` hit-tests the cursor (walking through `<label>.control` for styled file inputs); on a hit, the native click is suppressed, the element is tagged with `data-ob-pending-form-target`, and the click observation surfaces the option list (for `<select>`) or input metadata (for `<input type=file>`). The agent then completes the action via two new tools: `select_option(values)` matches by `value` → exact label → case-insensitive substring; `upload_file(paths)` uses `DOM.setFileInputFiles`. Rich error messages on `option_not_found` / `no_pending_*` so the agent can self-correct. System prompts (large + small) updated in agent-sdk to introduce the new tools and a NATIVE_FORM_CONTROLS note; mirrored into the venv copy used by the running server. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ile prompts) Picks up the system-prompt updates that introduce the two new tools and the NATIVE_FORM_CONTROLS guidance for native <select> / file pickers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…didates Before dispatching a pixel-paradigm `mouse(click)` or `mouse(drag)`, the executor probes the click point for the hit element and any interactables within a 30-CSS-px edge-distance window. When two or more interactables sit near the click, the action enters a confirmation stage: the extension renders a zoomed crop with a YELLOW box on the hit control plus orange dashed outlines on nearby candidates within 140 CSS px center-distance. The agent commits via `mouse(action="confirm")` or re-aims using the candidate centers (Qwen [0,1000] space) listed in the message. Implementation: - New extension command `analyze_pixel_targets` reuses the highlight detection engine to compute hit + neighborhood + verdict. - New extension command `render_pixel_confirm` produces the zoomed crop with bright orange dashed outlines on neighbors. - `BrowserExecutor` tracks the virtual cursor on the server side so the in-place pixel `click` knows where to gate. `mouse_click_pixel` / `mouse_drag_pixel` pendings reuse the existing 2PC machinery. - Non-confirm actions taken while a pixel pending is set count as an implicit rejection: the rejected candidates are appended to the new observation's message so course correction has them fresh. - Candidate rendering reuses `_format_highlighted_element_lines` so the agent reads the same descriptor-first phrasing the element paradigm produced; bare `<tag>` fallback surfaces a truncated HTML snippet. - `mouse_move` / `reset_mouse` post-action capture waits 150 ms so the cursor sprite's CSS transition completes before the screenshot. - Tool prompts (`mouse_tool.j2`) describe the confirmation in the affirmative voice; system prompt update ships via the bumped agent-sdk pin. Bumps agent-sdk pin to 2ea1956a so the system prompt teaches the agent about the pixel-click preview shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up the rewritten system_prompt_large/small that drop highlight, element_interaction, element_id, BLUE/YELLOW staging, and the corner-badge label rule. The runtime hasn't exposed those tools for a while, so the prompt residue was the last source of `highlight`-tool hallucinations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…observation - mouse `click` now accepts optional `x, y`. With both supplied the cursor moves there and clicks in one step; the gate runs against the new cursor. Without coords it still clicks at the cursor's current position for hover-then-click flows. Half-supplied coords are rejected so the agent re-emits cleanly. - The dense-neighborhood gate now paints its overlay on the live page DOM in addition to the screenshot crop: a single absolute-positioned div per box (yellow solid for the hit, orange dashed for nearby candidates) lives inside one overlay container, so a human watching the browser sees what the agent sees and removing the container cleans everything up. Drawing as floating divs (not via `outline` on the element itself) avoids stacking with the page's own focus or hover ring, fixing the double-yellow rectangle on Xiaohongshu's favorite button. - The new `clear_pixel_overlay` command fires before any non-confirm mouse action and at the top of `_commit_pending_pixel_action` so the overlay disappears whether the agent confirms or re-aims, and the post-commit screenshot is unhighlighted. - Tool/observation prompts updated for the new shapes: `click {x, y}` shown as the canonical form; gated-preview observation is one line ("Click previewed on the yellow target. Confirm to commit, or re-emit `click` ...") with the verification step taught in the system prompt's gated-preview block instead of an in-page banner. - Non-gated click message includes the actual click point in CSS pixels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… targets Arms a MutationObserver in `performMouseClick` before dispatch, reads the count ~250 ms after the click, and returns `triggered_anything: false` when no non-cursor mutations were recorded — i.e. the click did nothing the page reacted to. The observer filters out the cursor sprite (`#__ob_cursor__`) and the pixel-confirm overlay so neither their post-click refreshes nor their teardown counts as page activity. DOM mutations cover CSS-class flips, aria toggles, child appends, and text changes, so a like-heart toggle that only swaps an SVG color still registers as a real click. When the server sees `triggered_anything: false`, the click observation gets a one-line warning plus the same nearby-interactable candidate block the dense-gate preview uses (descriptor + center in [0,1000] space), so the agent has somewhere concrete to re-aim instead of looping on the same dead pixel. Validated on the Gmail Finance Follow-up eval test that previously hit the 660 s wall-clock cap stuck in a click-Escape-click loop on the reply-draft modal: with the warning in place the agent re-aims on each no-op, finishes the workflow in 258 s, and scores 8.0/8.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rlays extension-side Two follow-ups to the no-op-click warning: 1. **Visualize candidates on the live page.** When `triggered_anything: false` comes back from the click probe, the server now also fires `render_pixel_confirm` (mode `pixel_miss`, no banner) so the same orange-dashed boxes the agent reads as text in its observation are painted onto the live page DOM. A human watching the browser sees exactly what alternatives the agent is being told to consider. Serialized candidates gained a `bbox_css` field so the confirm path can draw the overlay without re-running the gate. 2. **Tear overlays down extension-side.** Cleanup moved from the server (which only fired before mouse actions) into the extension's pixel-action dispatcher, so every incoming agent action — mouse, keyboard, select, upload — clears any leftover overlay before running. `_execute_mouse_action` and `_commit_pending_pixel_action` no longer send their own `clear_pixel_overlay` (the local cleanup covers both gated-preview and no-op-warning overlays uniformly). Also fix a leak in the Gmail eval mock: the "Create a new label" input had `placeholder="Finance/Board-Prep"` — the exact label the test asks the agent to create — and no `autocomplete="off"`, so Chrome also surfaced the value as a history suggestion. Replaced the placeholder with a generic `e.g. Marketing/Q4-Launch` and disabled autocomplete. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hat doesn't false-alarm on focus `keyboard clear` previously sent Ctrl+A + Backspace via two CDP key events. On macOS Cmd is the select-all modifier; Ctrl+A no-ops, the Backspace deletes one character, and the SUCCESS message lies. Replace with a dedicated `keyboard_clear` command that runs JS against `document.activeElement` (input/textarea value setter, contenteditable textContent, role=textbox) and dispatches input+change so framework- controlled widgets observe the reset. Reports SUCCESS only when the field actually ended up empty; otherwise surfaces the focused element descriptor so the agent can recover. The post-click no-op probe only counted DOM mutations, so clicking into an input fired a false "no DOM change" warning — :focus is a CSS state that updates `document.activeElement` without mutating the tree. Extend the probe to also capture activeElement, scrollY, and Selection state at arm time and diff at read time. Trivial body/null transitions are filtered so initial-load focus shifts don't false-positive in the other direction. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er only set fields The previous prompt taught coordinates with `{x, y}` set notation and `(x, y)` tuple notation. Plus-tier models read those literally and emitted `{"x": [519, 669], "y": [519, 669]}` arrays — 110 events across 50 conversations on the last full-eval, the biggest single error cluster. Replace x/y/x2/y2 scalar fields with `coordinate`, `start_coordinate`, `end_coordinate` arrays so the agent's natural representation matches the schema instead of fighting it. Make `action` default to "move" so a bare `{"coordinate":[x,y]}` works when flash models forget the action key (14 events, 100% flash on the prior eval). Tighten the prompt examples so every snippet is a complete JSON envelope `{ "action": "...", "coordinate": [...] }` with no ambient prose like `mouse click {x, y}` that small models picked up as a tool-name pattern (`Tool 'click' not found` — 65 events, 100% flash). Document right-click and double-click explicitly so agents stop inventing `"action": "right_click"` / `"double_click"` literals. Override OpenBrowserAction.visualize so persisted ActionEvent text only contains fields the agent actually set (model_fields_set + action). Stops `text:null, key:null, modifiers:[], kind:KeyboardAction` and `start_coordinate:null, end_coordinate:null, button:left, count:1, direction:down, amount:300, steps:10` from polluting the rendered arguments shown to humans, the compiler agent, and any condense path. Mirrors mouse-tool prompt edits in the agent-sdk system prompts (system_prompt_{large,small}.j2) and the venv copy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pts) Picks up agent-sdk commit that drops `{x,y}`/`(x,y)` set/tuple notation from the system prompts in favor of explicit `coordinate: [x, y]` arrays and drag's `start_coordinate`/`end_coordinate`. Pairs with the in-repo MouseAction schema rewrite (4ce0f41). Refresh evaluation_report.json with the post-fix run (20260507_092809): 255 → 36 tool-call validation errors (86% reduction); pattern E `x:[a,b]` 110 → 1 events (99%). Aggregate pass rate 75.0% → 78.6% vs main; net +35.7 task-score across the four `*-fast` aliases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI surfaced two issues: 1. Pre-commit (black + extension prettier) wanted to reformat several files touched by recent commits. Apply the formatter pass. 2. Three unit tests still asserted the pre-pixel-paradigm contract (`['tab', 'highlight', 'element_interaction', ...]` toolset; tab prompt mentioning `default highlight element_type:"any" page 1`). Update them to the current pixel-paradigm shape: `['tab', 'mouse', 'keyboard', 'dialog', 'select_option', 'upload_file', ...]` toolset and the new "virtual cursor visible" contract from the rewritten tab prompt. 499 pytest pass / 4 skipped locally. No production code changed beyond the formatter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding and others added 14 commits April 29, 2026 20:03

softpudding merged commit be449e6 into main May 7, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema#68

feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema#68
softpudding merged 14 commits into
mainfrom
feat/exp-pixel-control

softpudding commented May 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented May 7, 2026

Summary

Eval results (140 runs, 35 tests × 4 -fast models, 2026-05-07_092809)

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant