Structured highlight descriptors + routine-replay attention decay fixes + label placement hardening#65
Merged
Merged
Conversation
… better coverage Keep injected highlights on the page until the next user-visible command (read-only follow-ups like get_tabs are excluded) so the user can see what was highlighted. Switch element borders from inset box-shadow to outline with negative offset so opaque child content (e.g. <a class="cover mask ld"> wrapping a full-bleed image) can no longer hide the border. Move labels into document-coordinate space so they scroll with the outlined elements. Single-highlight confirmation keeps the yellow "Is this the element you wanted to …" design via a dedicated DOM-injection script + canvas crop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…inding Replace the 4-side (above/below/left/right) label placement algorithm with a strict top-or-bottom corner badge: every label is anchored at the top-left corner of its own element's bbox, sitting fully outside the box and touching the edge. Sideways placements are disabled entirely because they routinely produced labels between two adjacent elements that read as belonging to the wrong one (session 444122cb: `UHT` between Fundamental and Technical tabs looked like it labeled Fundamental). When a label cannot fit above OR below its element (collision or viewport edge), the element defers to a later highlight page rather than being placed ambiguously. `total_pages` absorbs the overflow. Label fill is now an opaque darker shade of the border color so the filled badge visually separates from the bright bbox outline even when they share a touching edge. Font size reduced 16 -> 11px (height 22 -> 15px) so the badge is no taller than page body text — the labels recede into the visual hierarchy instead of dominating it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hared-border bbox overlap Two related placement fixes so the corner-badge invariant is reliably readable across dense layouts: 1. **Always 'above' unless viewport-clipped.** `getFeasiblePositions` now returns only `['above']` when 'above' fits the viewport, or only `['below']` when 'above' would be clipped by the viewport's top edge. Collision with a same-page neighbor no longer triggers a side-flip — the element is deferred to a later highlight page instead. Result: any label the viewer sees is always directly above its element, with a single exception (viewport-top elements get their label directly below). This removes the "is this label for the element on its left or the one on its right" ambiguity entirely. 2. **Tolerate 1-2 px shared-border bbox overlaps.** On finviz's filter-tab row, Fundamental (x=754..852) and Technical (x=851..928) share a 1px border at x=851..852 — a DOM rendering artifact, not a real occlusion. `bboxesPartiallyOverlap` now requires ≥ 3px on BOTH axes before treating an intersection as a real overlap, so adjacent tabs/buttons can coexist on the same highlight page. Label-vs-neighbor-bbox and bbox-vs-neighbor-label checks use strict `bboxesIntersect` (not clearance-inflated), so a label touching a horizontally-adjacent element's top edge at a shared row border is not treated as a collision — only actual pixel intrusion blocks placement. End-to-end on https://finviz.com/screener.ashx?v=121: - Before: 7 highlight pages, Fundamental deferred to page 3, Descriptive/News/All alternating above/below on page 1. - After: 4 highlight pages, all 6 filter tabs (Descriptive, Fundamental, Technical, News, ETF, All) on page 1 with 'above' labels. Page 1 has 256 elements, 255 'above' + 1 'below' (the one 'below' is a viewport-top element). Tests updated to reflect the new doctrine: - `viewport-top element uses "below" while interior element uses "above"` - `colliding "above" labels defer one element to a later page (no side-flip)` - `"above" blocked by a neighbor defers the element to a later page` - `center element surrounded above and below eventually gets placed` - `two elements separated vertically beyond the corner-badge footprint do not collide` - `tight label-to-element proximity under the corner-badge geometry is blocked` Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the `outerHTML` dump in the "Highlighted Elements" LLM observation with a compact per-element descriptor built in the page world from the live DOM. Each element renders on a single line like: id(type): <tag> "text" · attr=val … flags with an indented `options:` block for `<select>` that lists every `<option>` in full (value, label, optgroup, selected/disabled) — the agent depends on that full inventory to pick a value for the `select` action. For anonymous nodes (no text, no accessible name), the descriptor also harvests up to 3 semantic class tokens and an icon hint (`<use xlink:href>` / `<img alt>`) so the LLM still has a handle (e.g. `class="like-wrapper like-active" · icon=like`). Wire-side `html` is dropped from the response payload; the field stays on the in-extension `InteractiveElement` because element-id fingerprinting and the element cache still hash/search over it. Also refreshes the big-model highlight/element-interaction prompts and the `select_element` tool description to match the new format, and pins openhands-sdk/openhands-tools back to the git rev 7e7766fa203be8ce29eb2ed3adf2fec0262f5fb3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
No logic changes. Runs the repo's pre-commit hooks over: - server/agent/tools/base.py (black) - server/tests/unit/test_base_classes.py (black) - extension descriptor test/source + a few files that had pre-existing prettier drift from earlier branch commits (collapse-on-single-line function signatures, list/array layout, etc.) Pre-commit CI was failing on the initial push because black and prettier disagreed with the line-length choices in the just-landed descriptor change; this brings everything back to the canonical style. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Loosen the "top-left corner" invariant to "top edge, with planner-
chosen horizontal offset." A label may now slide along its element's
top (or bottom) edge to clear collisions with neighbors, but its
x-range is clamped inside the element's x-range whenever the element
is wide enough (labelWidth <= bbox.width). Narrow elements keep the
existing left-aligned behavior — the label is allowed to extend past
the element edges only when no offset could avoid it.
The clamp preserves the "directly above/below me" binding cue that
previously motivated the strict left-corner rule: a shifted label
never drifts over a neighbor's territory to avoid a collision;
instead the element is deferred to a later page, same as before.
Changes:
- types.ts: LabelPosition narrowed to 'above' | 'below' (left/right
were already dead after the corner-badge commit); add optional
labelXOffset?: number.
- collision-detection.ts: getLabelBBox / expandBBoxWithLabel /
isLabelWithinViewport / elementsCollide accept xOffset.
clampLabelXOffset enforces the x-range invariant. New
getCandidateXOffsets returns [0, slack] when slack > 0, else [0].
getFeasiblePositions → getFeasiblePlacements returning
{position, xOffset}[]. Tie-break: prefer above, then xOffset=0.
- background/index.ts: drop left/right from the fallback order,
pass labelXOffset through the in-page render payload, apply the
clamp in the DOM renderer too.
- Tests updated: drop left/right cases; add tests covering the
x-range clamp, narrow-element fallback, and an adjacent-element
scenario where tight label clearance is resolved by shifting the
right-hand label.
All 189 extension tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…tion labels on resize - Filter elements with both dimensions <8px before they enter the collision planner. Previously, tiny decorative dots (e.g. bullet indicators) were detected, placed on page 1, and blocked adjacent meaningful elements (links) from fitting — deferring them to later pages unnecessarily. - Reposition highlight labels on window resize so they stay attached to their outlined elements instead of drifting when page layout changes. - Clean up resize listener in both highlight cleanup and re-highlight paths.
Update ob-routines SKILL.md paths from project-relative skill/claude/ to ~/.claude/skills/ (matching open-browser). Add Claude skill install instructions to README.
Three targeted fixes landed from the 20260420 full-eval regression analysis (flash 80→57% pass, plus 88→68%). - highlight_tool prompts (big + small): add a canonical pagination rule with a generic positive example, and remove redundant pagination bullets so the rule has one source of truth. Flash had been picking an approximate id from page 1 instead of paginating when the exact target wasn't present yet. - _format_highlighted_element_lines: cap rendered <select> options at 20 with a trailer directing the agent to re-highlight with element_type="selectable" for the full list; always include the currently-selected option even when past the cap. - OpenBrowserAction: accept `summary` as SkipJsonSchema[Optional[str]] (same pattern as conversation_id). The SDK's Schema(extra="forbid") was rejecting tool calls where the LLM emitted `summary` as a tool arg; now accepted-and-ignored with exclude=True so it stays out of both the JSON schema and serialization. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…outerHTML Previously `generateShortHash` and `buildElementIdentityKey` hashed on `cssPath + outerHTML`. outerHTML changes on common interactions — class flips on :focus, `value` attr update per keystroke, `aria-expanded` toggle on <select>/<details> — causing the element's short id (R6Y, QX6, etc.) to churn between highlight refreshes. The agent then loses the label it was about to act on. Introduce `getStableIdentityInput(element)` that prefers the existing detection-time `fingerprint` (tag + semantic attrs + text, built by `getElementFingerprint` in highlight-detection.injected.js) over raw outerHTML. Falls back to outerHTML for legacy producers/tests that haven't populated fingerprint yet. Covered by 5 new tests in element-id-stability.test.ts: - <input> gains class="focused" on click - typing updates <input> value attribute - <select> aria-expanded flips true/false - genuinely distinct fingerprints still get distinct ids - fallback path works when fingerprint is absent Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- litellm: 2eb7db59 -> 363075400d (qwen3.6-flash pricing + DashScope cache_control passthrough for explicit context cache hits). - openhands-sdk / openhands-tools: 7e7766fa -> bf7ffb96 (picks up the new litellm pin). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Picks up agent-sdk@df0056f1 which adds qwen3.5/3.6 families to PROMPT_CACHE_MODELS so `LLM(model="dashscope/qwen3.6-flash", ...)` emits cache_control and actually hits DashScope's context cache. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously any non-get_tabs command flushed every tab's pending highlight cleanup, wiping overlays on tabs the command never touched. Scope the flush to command.tab_id so each tab's debug overlay only clears when that tab receives its next operation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each test now spawns its own `eval/server.py` on an OS-assigned port and only the conversation under test talks to it. Rewriting `localhost:16605` in `start_url` / `instruction` points the agent at the right port, and the tracker's relative `/api/track` POST naturally lands on the same server. This eliminates two bug classes the prior shared-server setup exposed in multi-model parallel runs: - Cross-conversation commingling in `events_store["sites"][bucket]` - Mid-conversation `sessionId` rotation making scoring miss early events Changes: - `eval/server.py`: accept `--port=N` (0 = OS-assigned); print `EVAL_SERVER_LISTENING_PORT=<port>` as a stdout handshake line so the parent can parse the bound port. - `eval/evaluate_browser_agent.py`: add `EvalServerProcess` that spawns, reads the handshake, drains stdout, and tears down (SIGTERM → SIGKILL) with `atexit` + per-instance process group. `EvalServerClient` takes `port=`; `?site=` filter is dropped. `run_test` and `run_manual_test` own the server's lifecycle in a `finally`. `ServiceManager` no longer health-checks port 16605. - `server/tests/unit/test_eval_client.py`: the teardown test now monkeypatches `EvalServerProcess` / `EvalServerClient` and asserts the per-test server is stopped. All 411 unit tests pass. Verified: two processes spawned concurrently get distinct ports, each client only sees its own server's events, both ports unreachable after stop(). Full 4-model eval run (140 runs) confirms scoring now tracks the agent's actual behaviour — previously cross-talked tests moved up by up to +9 points. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Line-wrap fixes from prettier that the pre-commit hook flagged in CI on the last push. Pure formatting — no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Check in the latest `eval/evaluation_report.json` and update both README files to reflect it. The current suite is 35 tasks × 4 models = 140 runs (up from the old 12-task × 2-model snapshot) and includes the Qwen3.6 family alongside the 3.5 family. Snapshot (2026-04-21 02:09:48), overall 111/140 passed (79.3%): - qwen3.5-plus : 30/35 passed, 276.2/304.8 task, 309.5s avg, ¥0.598 - qwen3.6-flash: 29/35 passed, 273.0/304.8 task, 252.3s avg, ¥0.804 - qwen3.6-plus : 28/35 passed, 262.4/304.8 task, 337.6s avg, ¥1.605 - qwen3.5-flash: 24/35 passed, 243.1/304.8 task, 308.8s avg, ¥0.144 Run was the first under the new per-test mock-server isolation, so scores reflect actual agent behaviour rather than cross-talk artefacts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…registry Two entries (@types/ws-8.18.1, ws-8.20.0) resolved to `registry.anpm.alibaba-inc.com`, Alibaba's internal mirror. That host is unreachable from the GitHub Actions runners, so CI's `npm ci` timed out on both Pre-commit and Extension Tests. Integrity hashes are unchanged because the mirrored tarballs are bit-identical; only the resolved URLs move to `registry.npmjs.org`. Verified by a clean `npm ci` + `npm test` locally (195 tests pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Black-only line-wrap changes the pre-commit hook flagged in CI for the three files added / modified under `server/agent/tools/` and the related unit tests. Pure formatting — no behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
66ccdc8 to
61e3868
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three themes in this branch:
outerHTML, with a corner-badge label overlay that is always "above" (viewport-top exception), defers on collision, and clamps its x-offset inside the element's width.eval/server.pyon an OS-assigned port, eliminating cross-conversation event commingling and mid-conversation sessionId rotation bugs that were silently suppressing scores on--parallel > 1runs. Fresh benchmark snapshot checked in.Headline commits
1.
3b052cf— Harden routine replay against long-context attention decayfrontend/index.html::buildRoutinePromptsends only the SOP markdown; routine name/goal/framing lives in theROUTINE_REPLAYsystem-prompt block.server/agent/browser_condenser.pygainsSMALL_MODEL_TOKEN_OVERRIDES; seeded withqwen3.5-flash → 100kso condensation actually fires.66ed257b(task_tracker plan pinning in routine_replay mode, small-model confirmation-reasoning gate, refocused condenser prompt).test_browser_condenser.pycovers override/substring/fallback/configure paths.2.
e445430— Bump agent-sdk toc92a185aPicks up the
SMALL_MODEL_GUIDANCE/LARGE_MODEL_GUIDANCE→<ACTION_PROTOCOL>tag rename so the model-tier identity no longer leaks into rendered prompts.3.
0bc4c16— Persist highlights across commands + outline for better coveragebox-shadowtooutlinewith negative offset — opaque full-bleed children no longer hide the border.4.
728eafa— Corner-badge label overlay for unambiguous element bindingtotal_pagesabsorbs overflow.5.
f915ae4— Labels always "above", defer on collision; tolerate shared-border bbox overlapgetFeasiblePositionsreturns['above'](or['below']at the viewport top). Collisions defer to a later page instead of flipping sides.bboxesPartiallyOverlaprequires ≥ 3 px on both axes before treating an intersection as an overlap → 1–2 px shared-border artefacts no longer force pagination.bboxesIntersectso labels touching an adjacent element's top edge at a shared row border are no longer collisions.https://finviz.com/screener.ashx?v=121: 7 pages → 4 pages; all 6 filter tabs land on page 1.6.
ea8cf32— Emit structured element descriptors instead of raw HTMLThe "Highlighted Elements" LLM observation now emits one line per element built in the page world from the live DOM:
<select>appends an indentedoptions:block listing every<option>in full.Anonymous nodes harvest up to 3 filtered semantic class tokens and an icon hint (
<use xlink:href>,<img alt>,[aria-label]).InteractiveElement.htmlstays internal; the descriptor is the single source of truth on the wire.Shared extractor (
extension/src/commands/element-descriptor.injected.js) is inlined into both highlight-detection and drag-drop scripts.Server-side formatter (
server/agent/tools/base.py::_format_highlighted_element_lines) is shared between the main list and drag-and-drop inner-elements.Prompts updated:
big_model/highlight_tool.j2,big_model/element_interaction_tool.j2, andselect_elementdescription inserver/models/commands.py.Follow-up fixes and refinements
728eafafollow-ups:f915ae4/08206ca/43f2643/77f9e01— iterative hardening of label placement: always-above rule, horizontal shift of corner badges, tiny-element filtering before the collision planner, and per-tab scope on deferred highlight cleanup (so switching tabs doesn't nuke the other tab's highlights).d492019 fix(agent): reduce eval regressions from highlight/descriptor branch64c14e1 fix(extension): hash element ids on stable fingerprint, not volatile outerHTML— element IDs now hash on a fingerprint (type + tag + stable attrs + text), so focus / ephemeral-value /aria-expandedflips don't reassign IDs mid-conversation.74dcddc fix(eval): per-test mock-site server isolation):eval/server.pyon an OS-assigned port;localhost:16605instart_url/instructionis rewritten to that port via_rewrite_eval_server_urls.EvalServerProcessclass owns spawn / handshake / drain / SIGTERM→SIGKILL teardown with anatexithook and per-instance process-group.EvalServerClientis now port-addressed; the?site=filter is removed.ServiceManager.start_eval_serveris gone;ensure_servicesonly checks OpenBrowser (8765).--parallel > 1, the old sharedevents_storelet parallel conversations commingle, and within a single conversation a page-reload sessionId rotation could silently hide the early events from the scorer. Isolating the server per test makes the track file contain exactly one sessionId.78dd2ce eval: refresh benchmark to the 35-task × 4-model suite (2026-04-21)):qwen3.5-{flash,plus}+qwen3.6-{flash,plus}, 111 passed (79.3 %). Seeeval/evaluation_report.jsonand the updated README eval section.qwen3.5-plus30/35 (85.7 %),qwen3.6-flash29/35 (82.9 %),qwen3.6-plus28/35 (80.0 %),qwen3.5-flash24/35 (68.6 %).qwen3.6-flashis the fastest model of the four (252 s avg vs 310-338 s for the others).c374403/d559ea7bump litellm + openhands-sdk to revc92a185a(Qwen prompt-cache enablement,<ACTION_PROTOCOL>rename).e29e65amoves Claude skills under~/.claude/skills/(user-scoped) instead of repo-scoped, so they survive checkout.4102276+316a4a2+61e3868apply pre-commit prettier/black formatting flagged by CI;71df427repoints two lockfile entries (ws-8.20.0,@types/ws-8.18.1) fromregistry.anpm.alibaba-inc.com(Alibaba internal, unreachable from GitHub Actions runners) toregistry.npmjs.org. Integrity hashes preserved because the mirror tarballs are bit-identical.Test plan
cd extension && bun test— 195/195 pass.uv run pytest server/tests/unit— 411/411 pass.eval/evaluation_report.json.EvalServerProcessinstances get distinct OS-assigned ports, theirevents_storecontent is independent, and both ports are unreachable afterstop().<select>(e.g. country picker) and confirm the indentedoptions:block lists every option with group/selected/disabled flags.class=...+icon=...rather than a bare<span>./ob-routines execute value-stocks-monthly-drop, and confirm the replay completes withoutplease_help_me, with task_tracker stepping through end-to-end.🤖 Generated with Claude Code