Skip to content

feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema#68

Merged
softpudding merged 14 commits into
mainfrom
feat/exp-pixel-control
May 7, 2026
Merged

feat(agent): pixel-level pixel-paradigm with coordinate:[x,y] schema#68
softpudding merged 14 commits into
mainfrom
feat/exp-pixel-control

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

Pixel-level browser interaction (virtual cursor, pixel-click density gate, gated preview-confirm overlay) plus the coordinate: [x, y] schema refactor that reduces tool-call validation errors by 86% (255 → 36 events on 140-run benchmark).

  • New coordinate/start_coordinate/end_coordinate array fields replace x,y,x2,y2 scalars on MouseAction. Pattern E (x: [a, b] array form, 110 events on the prior eval) is now ~0.
  • action defaults to "move" so a bare { "coordinate": [x, y] } slides the cursor; eliminates pattern C (missing-action).
  • Action visualization renders only fields the agent actually set — no more text:null, key:null, modifiers:[], kind:KeyboardAction clutter in observations / DB / condense paths.
  • keyboard clear rewritten as JS-based reset on document.activeElement (input/textarea/contenteditable/role=textbox); prior Ctrl+A + Backspace silently no-opped on macOS.
  • Click no-op probe extended beyond DOM mutations to also check document.activeElement, scroll position, and Selection — stops false "no DOM change" warnings on focus-only clicks into inputs.
  • Pixel pipeline: virtual-cursor sprite, pixel-click density gate with descriptor-rich nearby-candidate overlays, no-op-click warning system, gated preview-confirm cycle for dense neighborhoods.

Eval results (140 runs, 35 tests × 4 -fast models, 2026-05-07_092809)

Model Main pass This branch Δ
qwen3.5-plus 85.7% (30/35) 88.6% (31/35) +2.9pp
qwen3.6-plus 74.3% (26/35) 74.3% (26/35) 0.0pp*
qwen3.5-flash 60.0% (21/35) 65.7% (23/35) +5.7pp
qwen3.6-flash 80.0% (28/35) 85.7% (30/35) +5.7pp
Aggregate 75.0% (105/140) 78.6% (110/140) +3.6pp

*3.6-plus pass rate is flat but task score is +11.0 (more partial credit on harder tests). One DashScope infra failure on gmail_inbox_cleanup; once re-run, expected pass rate 77.1% (27/35).

Top per-test wins vs main: staybnb_book +13.0 (3.6-plus), github_issue_triage_deep +8.5 (3.5-plus), gmail_vendor_escalation +8.2 (3.6-plus), drive_project_reorg +7.5 (3.6-plus). 27 improvements ≥+1.0; 15 hard regressions (mostly Theme T2 preview-confirm time-budget on dense vertical UIs like Gmail inbox).

Pairs with agent-sdk commit 1ac8fff4 (coordinate-as-array prompts).

Test plan

  • Full benchmark run: python eval/evaluate_browser_agent.py across 4 -fast aliases — 110/140 pass
  • Schema parse smoke tests: coordinate=[x,y], in-place click, drag with start_coordinate/end_coordinate, reject [1]/[1,2,3]/[-5,100]/[1,1500]
  • model_fields_set rendering: only agent-emitted fields appear in Action.visualize
  • Extension build clean: npm run dev succeeds, dev-reload pushed to Chrome
  • keyboard_clear smoke-tested via Pydantic + extension command wiring
  • Reviewer to re-run on a clean Chrome profile if any test variance is suspected

🤖 Generated with Claude Code

softpudding and others added 14 commits April 29, 2026 20:03
Replace the highlight + element_id paradigm with a human-like pixel
control loop: the agent sees a clean screenshot with a visible cursor
sprite and drives the page via a virtual mouse and keyboard. Same
toolset for fresh tasks and routine replay; the legacy highlight /
element-interaction modules stay on disk for non-agent flows but are
no longer exposed.

Tools (live agent surface): tab, mouse, keyboard, dialog.
- mouse: move (eased lerp), click (in-place — must move first), drag,
  scroll, reset. Coordinates in Qwen-VL [0,1000] normalized space; the
  server denormalizes to CSS pixels via the captured viewport.
- keyboard: type (one char at a time, real keydown/keypress/input
  events for ASCII printables, fallback insertText for CJK/emoji),
  press (named keys + modifiers; Enter/Tab/Space carry text so
  keypress fires and form-submit works), clear (Ctrl+A → Backspace).
- tab: clean screenshots with the cursor in-frame on every action;
  refresh/view/back/forward auto-fill the active tab_id.

Cursor sprite is a 36x36 white-and-black arrow with a red dot and
pulsing red ring at the click point, injected via preCaptureScript so
it lands in the captured frame even after navigation.

Schema: extend MouseClickCommand with optional x/y (now ignored), add
MouseDragCommand, drop le=1280/le=720 bounds on MouseMoveCommand, add
live_mode flag to BaseCommand for extension routing. Pixel commands
finally reach the wire — added MouseDragCommand routing in
CommandProcessor.execute and case handlers in the extension switch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eaks

Rewrite the mouse and keyboard tool prompts and Field descriptions to
describe only the canonical use, instead of warning the model away from
shapes that the executor already silently handles.

- mouse: drop "click does not take coordinates" / "MUST NOT supply x/y"
  language. Field descriptions now state the affirmative use only.
  Examples already only show the in-place form.
- keyboard: lead with "focus an input first" as a positive rule, drop
  the per-character / "convenience wrapper" implementation notes. Add a
  small_model `type into a field` pattern that pairs with mouse.
- mouse scroll (small_model): document that the wheel hits the
  container under the cursor — `move` first to scroll an inner panel.
- mouse_tool.py: revert the validator that rejected x/y on click; the
  executor already drops them silently. The schema reflects only what
  the agent should learn to send.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ompts)

Picks up the rewrite of system_prompt_{large,small}.j2 in
softpudding/agent-sdk@37227545 — drops element_id/highlight/replay
language and aligns the top-level system prompts with this repo's
mouse + keyboard tools.

uv.lock regenerated by `uv sync`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…file picker

Two pixel-paradigm bug fixes for the live agent.

1. Mouse/keyboard CDP events were sluggish because Chrome throttles the
   agent's background tab while the user works in their own. After
   `chrome.debugger.attach`, send `Emulation.setFocusEmulationEnabled` and
   `Page.setWebLifecycleState({state: 'active'})` so the renderer runs at
   foreground priority — CDP-only, never changes OS-level tab focus, never
   interrupts the user. Re-asserted on every attach call so SPA navigation
   doesn't drop the flags.

2. Native `<select>` dropdowns and OS file pickers don't render into CDP
   screenshots, so a click on one left the agent blind. Now `mouse_click`
   hit-tests the cursor (walking through `<label>.control` for styled
   file inputs); on a hit, the native click is suppressed, the element is
   tagged with `data-ob-pending-form-target`, and the click observation
   surfaces the option list (for `<select>`) or input metadata (for
   `<input type=file>`). The agent then completes the action via two new
   tools: `select_option(values)` matches by `value` → exact label →
   case-insensitive substring; `upload_file(paths)` uses
   `DOM.setFileInputFiles`. Rich error messages on `option_not_found` /
   `no_pending_*` so the agent can self-correct.

System prompts (large + small) updated in agent-sdk to introduce the new
tools and a NATIVE_FORM_CONTROLS note; mirrored into the venv copy used
by the running server.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ile prompts)

Picks up the system-prompt updates that introduce the two new tools and
the NATIVE_FORM_CONTROLS guidance for native <select> / file pickers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…didates

Before dispatching a pixel-paradigm `mouse(click)` or `mouse(drag)`, the
executor probes the click point for the hit element and any interactables
within a 30-CSS-px edge-distance window. When two or more interactables
sit near the click, the action enters a confirmation stage: the extension
renders a zoomed crop with a YELLOW box on the hit control plus orange
dashed outlines on nearby candidates within 140 CSS px center-distance.
The agent commits via `mouse(action="confirm")` or re-aims using the
candidate centers (Qwen [0,1000] space) listed in the message.

Implementation:
- New extension command `analyze_pixel_targets` reuses the highlight
  detection engine to compute hit + neighborhood + verdict.
- New extension command `render_pixel_confirm` produces the zoomed crop
  with bright orange dashed outlines on neighbors.
- `BrowserExecutor` tracks the virtual cursor on the server side so the
  in-place pixel `click` knows where to gate. `mouse_click_pixel` /
  `mouse_drag_pixel` pendings reuse the existing 2PC machinery.
- Non-confirm actions taken while a pixel pending is set count as an
  implicit rejection: the rejected candidates are appended to the new
  observation's message so course correction has them fresh.
- Candidate rendering reuses `_format_highlighted_element_lines` so the
  agent reads the same descriptor-first phrasing the element paradigm
  produced; bare `<tag>` fallback surfaces a truncated HTML snippet.
- `mouse_move` / `reset_mouse` post-action capture waits 150 ms so the
  cursor sprite's CSS transition completes before the screenshot.
- Tool prompts (`mouse_tool.j2`) describe the confirmation in the
  affirmative voice; system prompt update ships via the bumped
  agent-sdk pin.

Bumps agent-sdk pin to 2ea1956a so the system prompt teaches the agent
about the pixel-click preview shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up the rewritten system_prompt_large/small that drop highlight,
element_interaction, element_id, BLUE/YELLOW staging, and the
corner-badge label rule. The runtime hasn't exposed those tools for a
while, so the prompt residue was the last source of `highlight`-tool
hallucinations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…observation

- mouse `click` now accepts optional `x, y`. With both supplied the
  cursor moves there and clicks in one step; the gate runs against the
  new cursor. Without coords it still clicks at the cursor's current
  position for hover-then-click flows. Half-supplied coords are
  rejected so the agent re-emits cleanly.

- The dense-neighborhood gate now paints its overlay on the live page
  DOM in addition to the screenshot crop: a single absolute-positioned
  div per box (yellow solid for the hit, orange dashed for nearby
  candidates) lives inside one overlay container, so a human watching
  the browser sees what the agent sees and removing the container
  cleans everything up. Drawing as floating divs (not via `outline`
  on the element itself) avoids stacking with the page's own focus or
  hover ring, fixing the double-yellow rectangle on Xiaohongshu's
  favorite button.

- The new `clear_pixel_overlay` command fires before any non-confirm
  mouse action and at the top of `_commit_pending_pixel_action` so
  the overlay disappears whether the agent confirms or re-aims, and
  the post-commit screenshot is unhighlighted.

- Tool/observation prompts updated for the new shapes: `click {x, y}`
  shown as the canonical form; gated-preview observation is one line
  ("Click previewed on the yellow target. Confirm to commit, or
  re-emit `click` ...") with the verification step taught in the
  system prompt's gated-preview block instead of an in-page banner.

- Non-gated click message includes the actual click point in CSS
  pixels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… targets

Arms a MutationObserver in `performMouseClick` before dispatch, reads
the count ~250 ms after the click, and returns `triggered_anything:
false` when no non-cursor mutations were recorded — i.e. the click did
nothing the page reacted to. The observer filters out the cursor
sprite (`#__ob_cursor__`) and the pixel-confirm overlay so neither
their post-click refreshes nor their teardown counts as page activity.
DOM mutations cover CSS-class flips, aria toggles, child appends, and
text changes, so a like-heart toggle that only swaps an SVG color
still registers as a real click.

When the server sees `triggered_anything: false`, the click
observation gets a one-line warning plus the same nearby-interactable
candidate block the dense-gate preview uses (descriptor + center in
[0,1000] space), so the agent has somewhere concrete to re-aim
instead of looping on the same dead pixel.

Validated on the Gmail Finance Follow-up eval test that previously
hit the 660 s wall-clock cap stuck in a click-Escape-click loop on
the reply-draft modal: with the warning in place the agent re-aims
on each no-op, finishes the workflow in 258 s, and scores 8.0/8.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rlays extension-side

Two follow-ups to the no-op-click warning:

1. **Visualize candidates on the live page.** When `triggered_anything:
   false` comes back from the click probe, the server now also fires
   `render_pixel_confirm` (mode `pixel_miss`, no banner) so the same
   orange-dashed boxes the agent reads as text in its observation are
   painted onto the live page DOM. A human watching the browser sees
   exactly what alternatives the agent is being told to consider.
   Serialized candidates gained a `bbox_css` field so the confirm path
   can draw the overlay without re-running the gate.

2. **Tear overlays down extension-side.** Cleanup moved from the server
   (which only fired before mouse actions) into the extension's
   pixel-action dispatcher, so every incoming agent action — mouse,
   keyboard, select, upload — clears any leftover overlay before
   running. `_execute_mouse_action` and `_commit_pending_pixel_action`
   no longer send their own `clear_pixel_overlay` (the local cleanup
   covers both gated-preview and no-op-warning overlays uniformly).

Also fix a leak in the Gmail eval mock: the "Create a new label" input
had `placeholder="Finance/Board-Prep"` — the exact label the test
asks the agent to create — and no `autocomplete="off"`, so Chrome
also surfaced the value as a history suggestion. Replaced the
placeholder with a generic `e.g. Marketing/Q4-Launch` and disabled
autocomplete.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hat doesn't false-alarm on focus

`keyboard clear` previously sent Ctrl+A + Backspace via two CDP key
events. On macOS Cmd is the select-all modifier; Ctrl+A no-ops, the
Backspace deletes one character, and the SUCCESS message lies. Replace
with a dedicated `keyboard_clear` command that runs JS against
`document.activeElement` (input/textarea value setter, contenteditable
textContent, role=textbox) and dispatches input+change so framework-
controlled widgets observe the reset. Reports SUCCESS only when the
field actually ended up empty; otherwise surfaces the focused element
descriptor so the agent can recover.

The post-click no-op probe only counted DOM mutations, so clicking into
an input fired a false "no DOM change" warning — :focus is a CSS state
that updates `document.activeElement` without mutating the tree. Extend
the probe to also capture activeElement, scrollY, and Selection state
at arm time and diff at read time. Trivial body/null transitions are
filtered so initial-load focus shifts don't false-positive in the other
direction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er only set fields

The previous prompt taught coordinates with `{x, y}` set notation and
`(x, y)` tuple notation. Plus-tier models read those literally and
emitted `{"x": [519, 669], "y": [519, 669]}` arrays — 110 events across
50 conversations on the last full-eval, the biggest single error
cluster. Replace x/y/x2/y2 scalar fields with `coordinate`,
`start_coordinate`, `end_coordinate` arrays so the agent's natural
representation matches the schema instead of fighting it.

Make `action` default to "move" so a bare `{"coordinate":[x,y]}` works
when flash models forget the action key (14 events, 100% flash on the
prior eval). Tighten the prompt examples so every snippet is a complete
JSON envelope `{ "action": "...", "coordinate": [...] }` with no
ambient prose like `mouse click {x, y}` that small models picked up as
a tool-name pattern (`Tool 'click' not found` — 65 events, 100% flash).
Document right-click and double-click explicitly so agents stop
inventing `"action": "right_click"` / `"double_click"` literals.

Override OpenBrowserAction.visualize so persisted ActionEvent text only
contains fields the agent actually set (model_fields_set + action).
Stops `text:null, key:null, modifiers:[], kind:KeyboardAction` and
`start_coordinate:null, end_coordinate:null, button:left, count:1,
direction:down, amount:300, steps:10` from polluting the rendered
arguments shown to humans, the compiler agent, and any condense path.

Mirrors mouse-tool prompt edits in the agent-sdk system prompts
(system_prompt_{large,small}.j2) and the venv copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pts)

Picks up agent-sdk commit that drops `{x,y}`/`(x,y)` set/tuple notation
from the system prompts in favor of explicit `coordinate: [x, y]` arrays
and drag's `start_coordinate`/`end_coordinate`. Pairs with the in-repo
MouseAction schema rewrite (4ce0f41).

Refresh evaluation_report.json with the post-fix run (20260507_092809):
255 → 36 tool-call validation errors (86% reduction); pattern E
`x:[a,b]` 110 → 1 events (99%). Aggregate pass rate 75.0% → 78.6% vs
main; net +35.7 task-score across the four `*-fast` aliases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI surfaced two issues:
1. Pre-commit (black + extension prettier) wanted to reformat several
   files touched by recent commits. Apply the formatter pass.
2. Three unit tests still asserted the pre-pixel-paradigm contract
   (`['tab', 'highlight', 'element_interaction', ...]` toolset; tab
   prompt mentioning `default highlight element_type:"any" page 1`).
   Update them to the current pixel-paradigm shape: `['tab', 'mouse',
   'keyboard', 'dialog', 'select_option', 'upload_file', ...]` toolset
   and the new "virtual cursor visible" contract from the rewritten
   tab prompt.

499 pytest pass / 4 skipped locally. No production code changed beyond
the formatter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding merged commit be449e6 into main May 7, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant