Skip to content

Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening#65

Merged
softpudding merged 19 commits into
mainfrom
fix/routine-replay-attention-decay
Apr 21, 2026
Merged

Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening#65
softpudding merged 19 commits into
mainfrom
fix/routine-replay-attention-decay

Conversation

@softpudding

@softpudding softpudding commented Apr 19, 2026

Copy link
Copy Markdown
Owner

Summary

Three themes in this branch:

  1. Highlight / element-descriptor pipeline: the LLM now sees a compact structured descriptor per highlighted element instead of raw outerHTML, with a corner-badge label overlay that is always "above" (viewport-top exception), defers on collision, and clamps its x-offset inside the element's width.
  2. Routine-replay attention hardening: the replay prompt no longer duplicates the routine identity into the user turn, and the condenser now fires on small models with a 1M-advertised context by explicit token cap. SDK pinned to versions that carry matching agent-side fixes.
  3. Eval harness isolation + new 35-task × 4-model benchmark: each test now spawns its own eval/server.py on an OS-assigned port, eliminating cross-conversation event commingling and mid-conversation sessionId rotation bugs that were silently suppressing scores on --parallel > 1 runs. Fresh benchmark snapshot checked in.

Headline commits

1. 3b052cf — Harden routine replay against long-context attention decay

  • frontend/index.html::buildRoutinePrompt sends only the SOP markdown; routine name/goal/framing lives in the ROUTINE_REPLAY system-prompt block.
  • server/agent/browser_condenser.py gains SMALL_MODEL_TOKEN_OVERRIDES; seeded with qwen3.5-flash → 100k so condensation actually fires.
  • Pinned agent-sdk to 66ed257b (task_tracker plan pinning in routine_replay mode, small-model confirmation-reasoning gate, refocused condenser prompt).
  • Tests: new test_browser_condenser.py covers override/substring/fallback/configure paths.

2. e445430 — Bump agent-sdk to c92a185a

Picks up the SMALL_MODEL_GUIDANCE / LARGE_MODEL_GUIDANCE<ACTION_PROTOCOL> tag rename so the model-tier identity no longer leaks into rendered prompts.

3. 0bc4c16 — Persist highlights across commands + outline for better coverage

  • Keep injected highlights until the next user-visible command so users can see what was highlighted.
  • Element borders switch from inset box-shadow to outline with negative offset — opaque full-bleed children no longer hide the border.
  • Labels move into document-coordinate space so they scroll with the element.

4. 728eafa — Corner-badge label overlay for unambiguous element binding

  • 4-side placement replaced by a strict top-or-bottom corner badge anchored at the element's top-left, touching the bbox edge. Sideways placements are disabled.
  • When a label fits neither above nor below, the element defers to a later page rather than being placed ambiguously; total_pages absorbs overflow.
  • Opaque darker-shade fill + 11 px font so badges recede into the visual hierarchy.

5. f915ae4 — Labels always "above", defer on collision; tolerate shared-border bbox overlap

  • getFeasiblePositions returns ['above'] (or ['below'] at the viewport top). Collisions defer to a later page instead of flipping sides.
  • bboxesPartiallyOverlap requires ≥ 3 px on both axes before treating an intersection as an overlap → 1–2 px shared-border artefacts no longer force pagination.
  • Label-vs-neighbor uses strict bboxesIntersect so labels touching an adjacent element's top edge at a shared row border are no longer collisions.
  • On https://finviz.com/screener.ashx?v=121: 7 pages → 4 pages; all 6 filter tabs land on page 1.

6. ea8cf32 — Emit structured element descriptors instead of raw HTML

  • The "Highlighted Elements" LLM observation now emits one line per element built in the page world from the live DOM:

    id(type): <tag> "text" · attr=val … flags
    
  • <select> appends an indented options: block listing every <option> in full.

  • Anonymous nodes harvest up to 3 filtered semantic class tokens and an icon hint (<use xlink:href>, <img alt>, [aria-label]).

  • InteractiveElement.html stays internal; the descriptor is the single source of truth on the wire.

  • Shared extractor (extension/src/commands/element-descriptor.injected.js) is inlined into both highlight-detection and drag-drop scripts.

  • Server-side formatter (server/agent/tools/base.py::_format_highlighted_element_lines) is shared between the main list and drag-and-drop inner-elements.

  • Prompts updated: big_model/highlight_tool.j2, big_model/element_interaction_tool.j2, and select_element description in server/models/commands.py.

Follow-up fixes and refinements

  • 728eafa follow-ups:
    • f915ae4 / 08206ca / 43f2643 / 77f9e01 — iterative hardening of label placement: always-above rule, horizontal shift of corner badges, tiny-element filtering before the collision planner, and per-tab scope on deferred highlight cleanup (so switching tabs doesn't nuke the other tab's highlights).
  • Eval / small-model stability (addressing regressions the descriptor rewrite surfaced):
    • d492019 fix(agent): reduce eval regressions from highlight/descriptor branch
    • 64c14e1 fix(extension): hash element ids on stable fingerprint, not volatile outerHTML — element IDs now hash on a fingerprint (type + tag + stable attrs + text), so focus / ephemeral-value / aria-expanded flips don't reassign IDs mid-conversation.
  • Per-test mock-server isolation in the eval harness (74dcddc fix(eval): per-test mock-site server isolation):
    • Each test spawns its own eval/server.py on an OS-assigned port; localhost:16605 in start_url / instruction is rewritten to that port via _rewrite_eval_server_urls.
    • New EvalServerProcess class owns spawn / handshake / drain / SIGTERM→SIGKILL teardown with an atexit hook and per-instance process-group.
    • EvalServerClient is now port-addressed; the ?site= filter is removed.
    • ServiceManager.start_eval_server is gone; ensure_services only checks OpenBrowser (8765).
    • Fixes an entire bug class: under --parallel > 1, the old shared events_store let parallel conversations commingle, and within a single conversation a page-reload sessionId rotation could silently hide the early events from the scorer. Isolating the server per test makes the track file contain exactly one sessionId.
  • New benchmark snapshot (78dd2ce eval: refresh benchmark to the 35-task × 4-model suite (2026-04-21)):
    • 140 runs across qwen3.5-{flash,plus} + qwen3.6-{flash,plus}, 111 passed (79.3 %). See eval/evaluation_report.json and the updated README eval section.
    • qwen3.5-plus 30/35 (85.7 %), qwen3.6-flash 29/35 (82.9 %), qwen3.6-plus 28/35 (80.0 %), qwen3.5-flash 24/35 (68.6 %). qwen3.6-flash is the fastest model of the four (252 s avg vs 310-338 s for the others).
  • Dependency pins: c374403 / d559ea7 bump litellm + openhands-sdk to rev c92a185a (Qwen prompt-cache enablement, <ACTION_PROTOCOL> rename).
  • Skill packaging: e29e65a moves Claude skills under ~/.claude/skills/ (user-scoped) instead of repo-scoped, so they survive checkout.
  • CI / lint: 4102276 + 316a4a2 + 61e3868 apply pre-commit prettier/black formatting flagged by CI; 71df427 repoints two lockfile entries (ws-8.20.0, @types/ws-8.18.1) from registry.anpm.alibaba-inc.com (Alibaba internal, unreachable from GitHub Actions runners) to registry.npmjs.org. Integrity hashes preserved because the mirror tarballs are bit-identical.

Test plan

  • cd extension && bun test — 195/195 pass.
  • uv run pytest server/tests/unit — 411/411 pass.
  • Full end-to-end eval, 4 models × 35 tasks, with per-test isolation — 111 / 140 passed (79.3 %). Report checked in at eval/evaluation_report.json.
  • Per-test isolation verified live: two concurrent EvalServerProcess instances get distinct OS-assigned ports, their events_store content is independent, and both ports are unreachable after stop().
  • Exercise a page with <select> (e.g. country picker) and confirm the indented options: block lists every option with group/selected/disabled flags.
  • Exercise a page with icon-only controls (Finviz filter tabs, BlueBook like button) and confirm the descriptor surfaces class=... + icon=... rather than a bare <span>.
  • Load the extension in Chrome, run /ob-routines execute value-stocks-monthly-drop, and confirm the replay completes without please_help_me, with task_tracker stepping through end-to-end.

🤖 Generated with Claude Code

softpudding and others added 14 commits April 18, 2026 21:58
… better coverage

Keep injected highlights on the page until the next user-visible command
(read-only follow-ups like get_tabs are excluded) so the user can see what
was highlighted. Switch element borders from inset box-shadow to outline
with negative offset so opaque child content (e.g. <a class="cover mask ld">
wrapping a full-bleed image) can no longer hide the border. Move labels
into document-coordinate space so they scroll with the outlined elements.

Single-highlight confirmation keeps the yellow "Is this the element you
wanted to …" design via a dedicated DOM-injection script + canvas crop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inding

Replace the 4-side (above/below/left/right) label placement algorithm with
a strict top-or-bottom corner badge: every label is anchored at the
top-left corner of its own element's bbox, sitting fully outside the box
and touching the edge. Sideways placements are disabled entirely because
they routinely produced labels between two adjacent elements that read as
belonging to the wrong one (session 444122cb: `UHT` between Fundamental
and Technical tabs looked like it labeled Fundamental).

When a label cannot fit above OR below its element (collision or viewport
edge), the element defers to a later highlight page rather than being
placed ambiguously. `total_pages` absorbs the overflow.

Label fill is now an opaque darker shade of the border color so the
filled badge visually separates from the bright bbox outline even when
they share a touching edge. Font size reduced 16 -> 11px (height 22 ->
15px) so the badge is no taller than page body text — the labels recede
into the visual hierarchy instead of dominating it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hared-border bbox overlap

Two related placement fixes so the corner-badge invariant is reliably
readable across dense layouts:

1. **Always 'above' unless viewport-clipped.** `getFeasiblePositions`
   now returns only `['above']` when 'above' fits the viewport, or
   only `['below']` when 'above' would be clipped by the viewport's
   top edge. Collision with a same-page neighbor no longer triggers
   a side-flip — the element is deferred to a later highlight page
   instead. Result: any label the viewer sees is always directly
   above its element, with a single exception (viewport-top
   elements get their label directly below). This removes the
   "is this label for the element on its left or the one on its
   right" ambiguity entirely.

2. **Tolerate 1-2 px shared-border bbox overlaps.** On finviz's
   filter-tab row, Fundamental (x=754..852) and Technical (x=851..928)
   share a 1px border at x=851..852 — a DOM rendering artifact, not
   a real occlusion. `bboxesPartiallyOverlap` now requires ≥ 3px on
   BOTH axes before treating an intersection as a real overlap, so
   adjacent tabs/buttons can coexist on the same highlight page.
   Label-vs-neighbor-bbox and bbox-vs-neighbor-label checks use
   strict `bboxesIntersect` (not clearance-inflated), so a label
   touching a horizontally-adjacent element's top edge at a shared
   row border is not treated as a collision — only actual pixel
   intrusion blocks placement.

End-to-end on https://finviz.com/screener.ashx?v=121:
- Before: 7 highlight pages, Fundamental deferred to page 3,
  Descriptive/News/All alternating above/below on page 1.
- After: 4 highlight pages, all 6 filter tabs (Descriptive,
  Fundamental, Technical, News, ETF, All) on page 1 with 'above'
  labels. Page 1 has 256 elements, 255 'above' + 1 'below' (the
  one 'below' is a viewport-top element).

Tests updated to reflect the new doctrine:
- `viewport-top element uses "below" while interior element uses "above"`
- `colliding "above" labels defer one element to a later page (no side-flip)`
- `"above" blocked by a neighbor defers the element to a later page`
- `center element surrounded above and below eventually gets placed`
- `two elements separated vertically beyond the corner-badge footprint do not collide`
- `tight label-to-element proximity under the corner-badge geometry is blocked`

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `outerHTML` dump in the "Highlighted Elements" LLM observation
with a compact per-element descriptor built in the page world from the live
DOM. Each element renders on a single line like:

  id(type): <tag> "text" · attr=val … flags

with an indented `options:` block for `<select>` that lists every `<option>`
in full (value, label, optgroup, selected/disabled) — the agent depends on
that full inventory to pick a value for the `select` action. For anonymous
nodes (no text, no accessible name), the descriptor also harvests up to 3
semantic class tokens and an icon hint (`<use xlink:href>` / `<img alt>`) so
the LLM still has a handle (e.g. `class="like-wrapper like-active" · icon=like`).

Wire-side `html` is dropped from the response payload; the field stays on
the in-extension `InteractiveElement` because element-id fingerprinting and
the element cache still hash/search over it. Also refreshes the big-model
highlight/element-interaction prompts and the `select_element` tool
description to match the new format, and pins openhands-sdk/openhands-tools
back to the git rev 7e7766fa203be8ce29eb2ed3adf2fec0262f5fb3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No logic changes. Runs the repo's pre-commit hooks over:
- server/agent/tools/base.py (black)
- server/tests/unit/test_base_classes.py (black)
- extension descriptor test/source + a few files that had pre-existing
  prettier drift from earlier branch commits (collapse-on-single-line
  function signatures, list/array layout, etc.)

Pre-commit CI was failing on the initial push because black and
prettier disagreed with the line-length choices in the just-landed
descriptor change; this brings everything back to the canonical style.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loosen the "top-left corner" invariant to "top edge, with planner-
chosen horizontal offset." A label may now slide along its element's
top (or bottom) edge to clear collisions with neighbors, but its
x-range is clamped inside the element's x-range whenever the element
is wide enough (labelWidth <= bbox.width). Narrow elements keep the
existing left-aligned behavior — the label is allowed to extend past
the element edges only when no offset could avoid it.

The clamp preserves the "directly above/below me" binding cue that
previously motivated the strict left-corner rule: a shifted label
never drifts over a neighbor's territory to avoid a collision;
instead the element is deferred to a later page, same as before.

Changes:
- types.ts: LabelPosition narrowed to 'above' | 'below' (left/right
  were already dead after the corner-badge commit); add optional
  labelXOffset?: number.
- collision-detection.ts: getLabelBBox / expandBBoxWithLabel /
  isLabelWithinViewport / elementsCollide accept xOffset.
  clampLabelXOffset enforces the x-range invariant. New
  getCandidateXOffsets returns [0, slack] when slack > 0, else [0].
  getFeasiblePositions → getFeasiblePlacements returning
  {position, xOffset}[]. Tie-break: prefer above, then xOffset=0.
- background/index.ts: drop left/right from the fallback order,
  pass labelXOffset through the in-page render payload, apply the
  clamp in the DOM renderer too.
- Tests updated: drop left/right cases; add tests covering the
  x-range clamp, narrow-element fallback, and an adjacent-element
  scenario where tight label clearance is resolved by shifting the
  right-hand label.

All 189 extension tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion labels on resize

- Filter elements with both dimensions <8px before they enter the collision
  planner. Previously, tiny decorative dots (e.g. bullet indicators) were
  detected, placed on page 1, and blocked adjacent meaningful elements
  (links) from fitting — deferring them to later pages unnecessarily.

- Reposition highlight labels on window resize so they stay attached to
  their outlined elements instead of drifting when page layout changes.

- Clean up resize listener in both highlight cleanup and re-highlight paths.
Update ob-routines SKILL.md paths from project-relative skill/claude/
to ~/.claude/skills/ (matching open-browser). Add Claude skill install
instructions to README.
Three targeted fixes landed from the 20260420 full-eval regression
analysis (flash 80→57% pass, plus 88→68%).

- highlight_tool prompts (big + small): add a canonical pagination rule
  with a generic positive example, and remove redundant pagination
  bullets so the rule has one source of truth. Flash had been picking
  an approximate id from page 1 instead of paginating when the exact
  target wasn't present yet.

- _format_highlighted_element_lines: cap rendered <select> options at
  20 with a trailer directing the agent to re-highlight with
  element_type="selectable" for the full list; always include the
  currently-selected option even when past the cap.

- OpenBrowserAction: accept `summary` as SkipJsonSchema[Optional[str]]
  (same pattern as conversation_id). The SDK's Schema(extra="forbid")
  was rejecting tool calls where the LLM emitted `summary` as a tool
  arg; now accepted-and-ignored with exclude=True so it stays out of
  both the JSON schema and serialization.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…outerHTML

Previously `generateShortHash` and `buildElementIdentityKey` hashed on
`cssPath + outerHTML`. outerHTML changes on common interactions — class
flips on :focus, `value` attr update per keystroke, `aria-expanded`
toggle on <select>/<details> — causing the element's short id (R6Y,
QX6, etc.) to churn between highlight refreshes. The agent then loses
the label it was about to act on.

Introduce `getStableIdentityInput(element)` that prefers the existing
detection-time `fingerprint` (tag + semantic attrs + text, built by
`getElementFingerprint` in highlight-detection.injected.js) over raw
outerHTML. Falls back to outerHTML for legacy producers/tests that
haven't populated fingerprint yet.

Covered by 5 new tests in element-id-stability.test.ts:
- <input> gains class="focused" on click
- typing updates <input> value attribute
- <select> aria-expanded flips true/false
- genuinely distinct fingerprints still get distinct ids
- fallback path works when fingerprint is absent

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- litellm: 2eb7db59 -> 363075400d (qwen3.6-flash pricing + DashScope
  cache_control passthrough for explicit context cache hits).
- openhands-sdk / openhands-tools: 7e7766fa -> bf7ffb96 (picks up the
  new litellm pin).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up agent-sdk@df0056f1 which adds qwen3.5/3.6 families to
PROMPT_CACHE_MODELS so `LLM(model="dashscope/qwen3.6-flash", ...)`
emits cache_control and actually hits DashScope's context cache.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously any non-get_tabs command flushed every tab's pending highlight
cleanup, wiping overlays on tabs the command never touched. Scope the
flush to command.tab_id so each tab's debug overlay only clears when
that tab receives its next operation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@CLAassistant

CLAassistant commented Apr 20, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

softpudding and others added 5 commits April 21, 2026 08:27
Each test now spawns its own `eval/server.py` on an OS-assigned port and
only the conversation under test talks to it. Rewriting `localhost:16605`
in `start_url` / `instruction` points the agent at the right port, and
the tracker's relative `/api/track` POST naturally lands on the same
server. This eliminates two bug classes the prior shared-server setup
exposed in multi-model parallel runs:

- Cross-conversation commingling in `events_store["sites"][bucket]`
- Mid-conversation `sessionId` rotation making scoring miss early events

Changes:
- `eval/server.py`: accept `--port=N` (0 = OS-assigned); print
  `EVAL_SERVER_LISTENING_PORT=<port>` as a stdout handshake line so the
  parent can parse the bound port.
- `eval/evaluate_browser_agent.py`: add `EvalServerProcess` that spawns,
  reads the handshake, drains stdout, and tears down (SIGTERM → SIGKILL)
  with `atexit` + per-instance process group. `EvalServerClient` takes
  `port=`; `?site=` filter is dropped. `run_test` and `run_manual_test`
  own the server's lifecycle in a `finally`. `ServiceManager` no longer
  health-checks port 16605.
- `server/tests/unit/test_eval_client.py`: the teardown test now
  monkeypatches `EvalServerProcess` / `EvalServerClient` and asserts the
  per-test server is stopped. All 411 unit tests pass.

Verified: two processes spawned concurrently get distinct ports, each
client only sees its own server's events, both ports unreachable after
stop(). Full 4-model eval run (140 runs) confirms scoring now tracks the
agent's actual behaviour — previously cross-talked tests moved up by up
to +9 points.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line-wrap fixes from prettier that the pre-commit hook flagged in CI on
the last push. Pure formatting — no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Check in the latest `eval/evaluation_report.json` and update both README
files to reflect it. The current suite is 35 tasks × 4 models = 140 runs
(up from the old 12-task × 2-model snapshot) and includes the Qwen3.6
family alongside the 3.5 family.

Snapshot (2026-04-21 02:09:48), overall 111/140 passed (79.3%):

- qwen3.5-plus : 30/35 passed, 276.2/304.8 task, 309.5s avg, ¥0.598
- qwen3.6-flash: 29/35 passed, 273.0/304.8 task, 252.3s avg, ¥0.804
- qwen3.6-plus : 28/35 passed, 262.4/304.8 task, 337.6s avg, ¥1.605
- qwen3.5-flash: 24/35 passed, 243.1/304.8 task, 308.8s avg, ¥0.144

Run was the first under the new per-test mock-server isolation, so
scores reflect actual agent behaviour rather than cross-talk artefacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…registry

Two entries (@types/ws-8.18.1, ws-8.20.0) resolved to
`registry.anpm.alibaba-inc.com`, Alibaba's internal mirror. That host is
unreachable from the GitHub Actions runners, so CI's `npm ci` timed out
on both Pre-commit and Extension Tests. Integrity hashes are unchanged
because the mirrored tarballs are bit-identical; only the resolved URLs
move to `registry.npmjs.org`. Verified by a clean `npm ci` + `npm test`
locally (195 tests pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Black-only line-wrap changes the pre-commit hook flagged in CI for the
three files added / modified under `server/agent/tools/` and the related
unit tests. Pure formatting — no behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding force-pushed the fix/routine-replay-attention-decay branch from 66ccdc8 to 61e3868 Compare April 21, 2026 01:08
@softpudding softpudding merged commit 77b8239 into main Apr 21, 2026
4 checks passed
@softpudding softpudding deleted the fix/routine-replay-attention-decay branch April 21, 2026 01:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants