Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening by softpudding · Pull Request #65 · softpudding/OpenBrowser

softpudding · 2026-04-19T11:45:18Z

Summary

Three themes in this branch:

Highlight / element-descriptor pipeline: the LLM now sees a compact structured descriptor per highlighted element instead of raw outerHTML, with a corner-badge label overlay that is always "above" (viewport-top exception), defers on collision, and clamps its x-offset inside the element's width.
Routine-replay attention hardening: the replay prompt no longer duplicates the routine identity into the user turn, and the condenser now fires on small models with a 1M-advertised context by explicit token cap. SDK pinned to versions that carry matching agent-side fixes.
Eval harness isolation + new 35-task × 4-model benchmark: each test now spawns its own eval/server.py on an OS-assigned port, eliminating cross-conversation event commingling and mid-conversation sessionId rotation bugs that were silently suppressing scores on --parallel > 1 runs. Fresh benchmark snapshot checked in.

Headline commits

1. `3b052cf` — Harden routine replay against long-context attention decay

frontend/index.html::buildRoutinePrompt sends only the SOP markdown; routine name/goal/framing lives in the ROUTINE_REPLAY system-prompt block.
server/agent/browser_condenser.py gains SMALL_MODEL_TOKEN_OVERRIDES; seeded with qwen3.5-flash → 100k so condensation actually fires.
Pinned agent-sdk to 66ed257b (task_tracker plan pinning in routine_replay mode, small-model confirmation-reasoning gate, refocused condenser prompt).
Tests: new test_browser_condenser.py covers override/substring/fallback/configure paths.

2. `e445430` — Bump agent-sdk to `c92a185a`

Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE → <ACTION_PROTOCOL> tag rename so the model-tier identity no longer leaks into rendered prompts.

3. `0bc4c16` — Persist highlights across commands + outline for better coverage

Keep injected highlights until the next user-visible command so users can see what was highlighted.
Element borders switch from inset box-shadow to outline with negative offset — opaque full-bleed children no longer hide the border.
Labels move into document-coordinate space so they scroll with the element.

4. `728eafa` — Corner-badge label overlay for unambiguous element binding

4-side placement replaced by a strict top-or-bottom corner badge anchored at the element's top-left, touching the bbox edge. Sideways placements are disabled.
When a label fits neither above nor below, the element defers to a later page rather than being placed ambiguously; total_pages absorbs overflow.
Opaque darker-shade fill + 11 px font so badges recede into the visual hierarchy.

5. `f915ae4` — Labels always "above", defer on collision; tolerate shared-border bbox overlap

getFeasiblePositions returns ['above'] (or ['below'] at the viewport top). Collisions defer to a later page instead of flipping sides.
bboxesPartiallyOverlap requires ≥ 3 px on both axes before treating an intersection as an overlap → 1–2 px shared-border artefacts no longer force pagination.
Label-vs-neighbor uses strict bboxesIntersect so labels touching an adjacent element's top edge at a shared row border are no longer collisions.
On https://finviz.com/screener.ashx?v=121: 7 pages → 4 pages; all 6 filter tabs land on page 1.

6. `ea8cf32` — Emit structured element descriptors instead of raw HTML

The "Highlighted Elements" LLM observation now emits one line per element built in the page world from the live DOM:
```
id(type): <tag> "text" · attr=val … flags
```
<select> appends an indented options: block listing every <option> in full.
Anonymous nodes harvest up to 3 filtered semantic class tokens and an icon hint (<use xlink:href>, <img alt>, [aria-label]).
InteractiveElement.html stays internal; the descriptor is the single source of truth on the wire.
Shared extractor (extension/src/commands/element-descriptor.injected.js) is inlined into both highlight-detection and drag-drop scripts.
Server-side formatter (server/agent/tools/base.py::_format_highlighted_element_lines) is shared between the main list and drag-and-drop inner-elements.
Prompts updated: big_model/highlight_tool.j2, big_model/element_interaction_tool.j2, and select_element description in server/models/commands.py.

Follow-up fixes and refinements

728eafa follow-ups:
- f915ae4 / 08206ca / 43f2643 / 77f9e01 — iterative hardening of label placement: always-above rule, horizontal shift of corner badges, tiny-element filtering before the collision planner, and per-tab scope on deferred highlight cleanup (so switching tabs doesn't nuke the other tab's highlights).
Eval / small-model stability (addressing regressions the descriptor rewrite surfaced):
- d492019 fix(agent): reduce eval regressions from highlight/descriptor branch
- 64c14e1 fix(extension): hash element ids on stable fingerprint, not volatile outerHTML — element IDs now hash on a fingerprint (type + tag + stable attrs + text), so focus / ephemeral-value / aria-expanded flips don't reassign IDs mid-conversation.
Per-test mock-server isolation in the eval harness (74dcddc fix(eval): per-test mock-site server isolation):
- Each test spawns its own eval/server.py on an OS-assigned port; localhost:16605 in start_url / instruction is rewritten to that port via _rewrite_eval_server_urls.
- New EvalServerProcess class owns spawn / handshake / drain / SIGTERM→SIGKILL teardown with an atexit hook and per-instance process-group.
- EvalServerClient is now port-addressed; the ?site= filter is removed.
- ServiceManager.start_eval_server is gone; ensure_services only checks OpenBrowser (8765).
- Fixes an entire bug class: under --parallel > 1, the old shared events_store let parallel conversations commingle, and within a single conversation a page-reload sessionId rotation could silently hide the early events from the scorer. Isolating the server per test makes the track file contain exactly one sessionId.
New benchmark snapshot (78dd2ce eval: refresh benchmark to the 35-task × 4-model suite (2026-04-21)):
- 140 runs across qwen3.5-{flash,plus} + qwen3.6-{flash,plus}, 111 passed (79.3 %). See eval/evaluation_report.json and the updated README eval section.
- qwen3.5-plus 30/35 (85.7 %), qwen3.6-flash 29/35 (82.9 %), qwen3.6-plus 28/35 (80.0 %), qwen3.5-flash 24/35 (68.6 %). qwen3.6-flash is the fastest model of the four (252 s avg vs 310-338 s for the others).
Dependency pins: c374403 / d559ea7 bump litellm + openhands-sdk to rev c92a185a (Qwen prompt-cache enablement, <ACTION_PROTOCOL> rename).
Skill packaging: e29e65a moves Claude skills under ~/.claude/skills/ (user-scoped) instead of repo-scoped, so they survive checkout.
CI / lint: 4102276 + 316a4a2 + 61e3868 apply pre-commit prettier/black formatting flagged by CI; 71df427 repoints two lockfile entries (ws-8.20.0, @types/ws-8.18.1) from registry.anpm.alibaba-inc.com (Alibaba internal, unreachable from GitHub Actions runners) to registry.npmjs.org. Integrity hashes preserved because the mirror tarballs are bit-identical.

Test plan

cd extension && bun test — 195/195 pass.
uv run pytest server/tests/unit — 411/411 pass.
Full end-to-end eval, 4 models × 35 tasks, with per-test isolation — 111 / 140 passed (79.3 %). Report checked in at eval/evaluation_report.json.
Per-test isolation verified live: two concurrent EvalServerProcess instances get distinct OS-assigned ports, their events_store content is independent, and both ports are unreachable after stop().
Exercise a page with <select> (e.g. country picker) and confirm the indented options: block lists every option with group/selected/disabled flags.
Exercise a page with icon-only controls (Finviz filter tabs, BlueBook like button) and confirm the descriptor surfaces class=... + icon=... rather than a bare <span>.
Load the extension in Chrome, run /ob-routines execute value-stocks-monthly-drop, and confirm the replay completes without please_help_me, with task_tracker stepping through end-to-end.

🤖 Generated with Claude Code

… better coverage Keep injected highlights on the page until the next user-visible command (read-only follow-ups like get_tabs are excluded) so the user can see what was highlighted. Switch element borders from inset box-shadow to outline with negative offset so opaque child content (e.g. <a class="cover mask ld"> wrapping a full-bleed image) can no longer hide the border. Move labels into document-coordinate space so they scroll with the outlined elements. Single-highlight confirmation keeps the yellow "Is this the element you wanted to …" design via a dedicated DOM-injection script + canvas crop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…inding Replace the 4-side (above/below/left/right) label placement algorithm with a strict top-or-bottom corner badge: every label is anchored at the top-left corner of its own element's bbox, sitting fully outside the box and touching the edge. Sideways placements are disabled entirely because they routinely produced labels between two adjacent elements that read as belonging to the wrong one (session 444122cb: `UHT` between Fundamental and Technical tabs looked like it labeled Fundamental). When a label cannot fit above OR below its element (collision or viewport edge), the element defers to a later highlight page rather than being placed ambiguously. `total_pages` absorbs the overflow. Label fill is now an opaque darker shade of the border color so the filled badge visually separates from the bright bbox outline even when they share a touching edge. Font size reduced 16 -> 11px (height 22 -> 15px) so the badge is no taller than page body text — the labels recede into the visual hierarchy instead of dominating it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…hared-border bbox overlap Two related placement fixes so the corner-badge invariant is reliably readable across dense layouts: 1. **Always 'above' unless viewport-clipped.** `getFeasiblePositions` now returns only `['above']` when 'above' fits the viewport, or only `['below']` when 'above' would be clipped by the viewport's top edge. Collision with a same-page neighbor no longer triggers a side-flip — the element is deferred to a later highlight page instead. Result: any label the viewer sees is always directly above its element, with a single exception (viewport-top elements get their label directly below). This removes the "is this label for the element on its left or the one on its right" ambiguity entirely. 2. **Tolerate 1-2 px shared-border bbox overlaps.** On finviz's filter-tab row, Fundamental (x=754..852) and Technical (x=851..928) share a 1px border at x=851..852 — a DOM rendering artifact, not a real occlusion. `bboxesPartiallyOverlap` now requires ≥ 3px on BOTH axes before treating an intersection as a real overlap, so adjacent tabs/buttons can coexist on the same highlight page. Label-vs-neighbor-bbox and bbox-vs-neighbor-label checks use strict `bboxesIntersect` (not clearance-inflated), so a label touching a horizontally-adjacent element's top edge at a shared row border is not treated as a collision — only actual pixel intrusion blocks placement. End-to-end on https://finviz.com/screener.ashx?v=121: - Before: 7 highlight pages, Fundamental deferred to page 3, Descriptive/News/All alternating above/below on page 1. - After: 4 highlight pages, all 6 filter tabs (Descriptive, Fundamental, Technical, News, ETF, All) on page 1 with 'above' labels. Page 1 has 256 elements, 255 'above' + 1 'below' (the one 'below' is a viewport-top element). Tests updated to reflect the new doctrine: - `viewport-top element uses "below" while interior element uses "above"` - `colliding "above" labels defer one element to a later page (no side-flip)` - `"above" blocked by a neighbor defers the element to a later page` - `center element surrounded above and below eventually gets placed` - `two elements separated vertically beyond the corner-badge footprint do not collide` - `tight label-to-element proximity under the corner-badge geometry is blocked` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the `outerHTML` dump in the "Highlighted Elements" LLM observation with a compact per-element descriptor built in the page world from the live DOM. Each element renders on a single line like: id(type): <tag> "text" · attr=val … flags with an indented `options:` block for `<select>` that lists every `<option>` in full (value, label, optgroup, selected/disabled) — the agent depends on that full inventory to pick a value for the `select` action. For anonymous nodes (no text, no accessible name), the descriptor also harvests up to 3 semantic class tokens and an icon hint (`<use xlink:href>` / `<img alt>`) so the LLM still has a handle (e.g. `class="like-wrapper like-active" · icon=like`). Wire-side `html` is dropped from the response payload; the field stays on the in-extension `InteractiveElement` because element-id fingerprinting and the element cache still hash/search over it. Also refreshes the big-model highlight/element-interaction prompts and the `select_element` tool description to match the new format, and pins openhands-sdk/openhands-tools back to the git rev 7e7766fa203be8ce29eb2ed3adf2fec0262f5fb3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

No logic changes. Runs the repo's pre-commit hooks over: - server/agent/tools/base.py (black) - server/tests/unit/test_base_classes.py (black) - extension descriptor test/source + a few files that had pre-existing prettier drift from earlier branch commits (collapse-on-single-line function signatures, list/array layout, etc.) Pre-commit CI was failing on the initial push because black and prettier disagreed with the line-length choices in the just-landed descriptor change; this brings everything back to the canonical style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Loosen the "top-left corner" invariant to "top edge, with planner- chosen horizontal offset." A label may now slide along its element's top (or bottom) edge to clear collisions with neighbors, but its x-range is clamped inside the element's x-range whenever the element is wide enough (labelWidth <= bbox.width). Narrow elements keep the existing left-aligned behavior — the label is allowed to extend past the element edges only when no offset could avoid it. The clamp preserves the "directly above/below me" binding cue that previously motivated the strict left-corner rule: a shifted label never drifts over a neighbor's territory to avoid a collision; instead the element is deferred to a later page, same as before. Changes: - types.ts: LabelPosition narrowed to 'above' | 'below' (left/right were already dead after the corner-badge commit); add optional labelXOffset?: number. - collision-detection.ts: getLabelBBox / expandBBoxWithLabel / isLabelWithinViewport / elementsCollide accept xOffset. clampLabelXOffset enforces the x-range invariant. New getCandidateXOffsets returns [0, slack] when slack > 0, else [0]. getFeasiblePositions → getFeasiblePlacements returning {position, xOffset}[]. Tie-break: prefer above, then xOffset=0. - background/index.ts: drop left/right from the fallback order, pass labelXOffset through the in-page render payload, apply the clamp in the DOM renderer too. - Tests updated: drop left/right cases; add tests covering the x-range clamp, narrow-element fallback, and an adjacent-element scenario where tight label clearance is resolved by shifting the right-hand label. All 189 extension tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion labels on resize - Filter elements with both dimensions <8px before they enter the collision planner. Previously, tiny decorative dots (e.g. bullet indicators) were detected, placed on page 1, and blocked adjacent meaningful elements (links) from fitting — deferring them to later pages unnecessarily. - Reposition highlight labels on window resize so they stay attached to their outlined elements instead of drifting when page layout changes. - Clean up resize listener in both highlight cleanup and re-highlight paths.

Update ob-routines SKILL.md paths from project-relative skill/claude/ to ~/.claude/skills/ (matching open-browser). Add Claude skill install instructions to README.

Three targeted fixes landed from the 20260420 full-eval regression analysis (flash 80→57% pass, plus 88→68%). - highlight_tool prompts (big + small): add a canonical pagination rule with a generic positive example, and remove redundant pagination bullets so the rule has one source of truth. Flash had been picking an approximate id from page 1 instead of paginating when the exact target wasn't present yet. - _format_highlighted_element_lines: cap rendered <select> options at 20 with a trailer directing the agent to re-highlight with element_type="selectable" for the full list; always include the currently-selected option even when past the cap. - OpenBrowserAction: accept `summary` as SkipJsonSchema[Optional[str]] (same pattern as conversation_id). The SDK's Schema(extra="forbid") was rejecting tool calls where the LLM emitted `summary` as a tool arg; now accepted-and-ignored with exclude=True so it stays out of both the JSON schema and serialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…outerHTML Previously `generateShortHash` and `buildElementIdentityKey` hashed on `cssPath + outerHTML`. outerHTML changes on common interactions — class flips on :focus, `value` attr update per keystroke, `aria-expanded` toggle on <select>/<details> — causing the element's short id (R6Y, QX6, etc.) to churn between highlight refreshes. The agent then loses the label it was about to act on. Introduce `getStableIdentityInput(element)` that prefers the existing detection-time `fingerprint` (tag + semantic attrs + text, built by `getElementFingerprint` in highlight-detection.injected.js) over raw outerHTML. Falls back to outerHTML for legacy producers/tests that haven't populated fingerprint yet. Covered by 5 new tests in element-id-stability.test.ts: - <input> gains class="focused" on click - typing updates <input> value attribute - <select> aria-expanded flips true/false - genuinely distinct fingerprints still get distinct ids - fallback path works when fingerprint is absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- litellm: 2eb7db59 -> 363075400d (qwen3.6-flash pricing + DashScope cache_control passthrough for explicit context cache hits). - openhands-sdk / openhands-tools: 7e7766fa -> bf7ffb96 (picks up the new litellm pin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Picks up agent-sdk@df0056f1 which adds qwen3.5/3.6 families to PROMPT_CACHE_MODELS so `LLM(model="dashscope/qwen3.6-flash", ...)` emits cache_control and actually hits DashScope's context cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously any non-get_tabs command flushed every tab's pending highlight cleanup, wiping overlays on tabs the command never touched. Scope the flush to command.tab_id so each tab's debug overlay only clears when that tab receives its next operation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CLAassistant · 2026-04-20T14:07:04Z

All committers have signed the CLA.

Each test now spawns its own `eval/server.py` on an OS-assigned port and only the conversation under test talks to it. Rewriting `localhost:16605` in `start_url` / `instruction` points the agent at the right port, and the tracker's relative `/api/track` POST naturally lands on the same server. This eliminates two bug classes the prior shared-server setup exposed in multi-model parallel runs: - Cross-conversation commingling in `events_store["sites"][bucket]` - Mid-conversation `sessionId` rotation making scoring miss early events Changes: - `eval/server.py`: accept `--port=N` (0 = OS-assigned); print `EVAL_SERVER_LISTENING_PORT=<port>` as a stdout handshake line so the parent can parse the bound port. - `eval/evaluate_browser_agent.py`: add `EvalServerProcess` that spawns, reads the handshake, drains stdout, and tears down (SIGTERM → SIGKILL) with `atexit` + per-instance process group. `EvalServerClient` takes `port=`; `?site=` filter is dropped. `run_test` and `run_manual_test` own the server's lifecycle in a `finally`. `ServiceManager` no longer health-checks port 16605. - `server/tests/unit/test_eval_client.py`: the teardown test now monkeypatches `EvalServerProcess` / `EvalServerClient` and asserts the per-test server is stopped. All 411 unit tests pass. Verified: two processes spawned concurrently get distinct ports, each client only sees its own server's events, both ports unreachable after stop(). Full 4-model eval run (140 runs) confirms scoring now tracks the agent's actual behaviour — previously cross-talked tests moved up by up to +9 points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Line-wrap fixes from prettier that the pre-commit hook flagged in CI on the last push. Pure formatting — no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Check in the latest `eval/evaluation_report.json` and update both README files to reflect it. The current suite is 35 tasks × 4 models = 140 runs (up from the old 12-task × 2-model snapshot) and includes the Qwen3.6 family alongside the 3.5 family. Snapshot (2026-04-21 02:09:48), overall 111/140 passed (79.3%): - qwen3.5-plus : 30/35 passed, 276.2/304.8 task, 309.5s avg, ¥0.598 - qwen3.6-flash: 29/35 passed, 273.0/304.8 task, 252.3s avg, ¥0.804 - qwen3.6-plus : 28/35 passed, 262.4/304.8 task, 337.6s avg, ¥1.605 - qwen3.5-flash: 24/35 passed, 243.1/304.8 task, 308.8s avg, ¥0.144 Run was the first under the new per-test mock-server isolation, so scores reflect actual agent behaviour rather than cross-talk artefacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…registry Two entries (@types/ws-8.18.1, ws-8.20.0) resolved to `registry.anpm.alibaba-inc.com`, Alibaba's internal mirror. That host is unreachable from the GitHub Actions runners, so CI's `npm ci` timed out on both Pre-commit and Extension Tests. Integrity hashes are unchanged because the mirrored tarballs are bit-identical; only the resolved URLs move to `registry.npmjs.org`. Verified by a clean `npm ci` + `npm test` locally (195 tests pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Black-only line-wrap changes the pre-commit hook flagged in CI for the three files added / modified under `server/agent/tools/` and the related unit tests. Pure formatting — no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding and others added 14 commits April 18, 2026 21:58

fix github eval-mock long line issue

0f16c71

chore(skills): make Claude skills user-scoped (~/.claude/skills/)

e29e65a

Update ob-routines SKILL.md paths from project-relative skill/claude/ to ~/.claude/skills/ (matching open-browser). Add Claude skill install instructions to README.

softpudding and others added 5 commits April 21, 2026 08:27

chore(extension): apply pre-commit prettier formatting

316a4a2

Line-wrap fixes from prettier that the pre-commit hook flagged in CI on the last push. Pure formatting — no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

softpudding force-pushed the fix/routine-replay-attention-decay branch from 66ccdc8 to 61e3868 Compare April 21, 2026 01:08

softpudding merged commit 77b8239 into main Apr 21, 2026
4 checks passed

softpudding deleted the fix/routine-replay-attention-decay branch April 21, 2026 01:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening#65

Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening#65
softpudding merged 19 commits into
mainfrom
fix/routine-replay-attention-decay

softpudding commented Apr 19, 2026 •

edited

Loading

Uh oh!

CLAassistant commented Apr 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

softpudding commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Headline commits

1. 3b052cf — Harden routine replay against long-context attention decay

2. e445430 — Bump agent-sdk to c92a185a

3. 0bc4c16 — Persist highlights across commands + outline for better coverage

4. 728eafa — Corner-badge label overlay for unambiguous element binding

5. f915ae4 — Labels always "above", defer on collision; tolerate shared-border bbox overlap

6. ea8cf32 — Emit structured element descriptors instead of raw HTML

Follow-up fixes and refinements

Test plan

Uh oh!

CLAassistant commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

softpudding commented Apr 19, 2026 •

edited

Loading

1. `3b052cf` — Harden routine replay against long-context attention decay

2. `e445430` — Bump agent-sdk to `c92a185a`

3. `0bc4c16` — Persist highlights across commands + outline for better coverage

4. `728eafa` — Corner-badge label overlay for unambiguous element binding

5. `f915ae4` — Labels always "above", defer on collision; tolerate shared-border bbox overlap

6. `ea8cf32` — Emit structured element descriptors instead of raw HTML

CLAassistant commented Apr 20, 2026 •

edited

Loading