Skip to content

skill(ob-routines): record → compile → replay browser routines#62

Merged
softpudding merged 10 commits into
feat/image-input-and-file-uploadfrom
feat/ob-routines-skill
Apr 18, 2026
Merged

skill(ob-routines): record → compile → replay browser routines#62
softpudding merged 10 commits into
feat/image-input-and-file-uploadfrom
feat/ob-routines-skill

Conversation

@softpudding

Copy link
Copy Markdown
Owner

Summary

New `ob-routines` skill: record a browser session, compile it into a routine SOP, and replay it efficiently. Also bumps the openhands-sdk dependency required by the replay path and includes a small open-browser skill doc fix.

Commits (oldest → newest):

  • `a5a68de` skill(open-browser): document portable `~/.claude/skills` invocation path
  • `fffaf65` chore: update openhands-sdk and openhands-tools to bd4cb29
  • `90992e5` skill(ob-routines): add Browser Routines skill (record → compile → replay)
  • `45a103e` perf(replay): auto-confirm unique clicks and trim image window to 1

What's in the skill

  • `start_recording.py` / `stop_recording.py` — drive the extension's recording mode
  • `compile.py` — turn a recording into a deterministic routine SOP
  • `replay.py` — replay with the agent in the loop, with auto-confirm + image-window trim for speed
  • `list_routines.py` — inventory of saved routines
  • `SKILL.md` — invocation + workflow docs

Test plan

  • Record a short session, compile, replay end-to-end; confirm step-by-step parity
  • `uv sync` resolves the new sdk pin cleanly
  • Existing server unit tests pass

Stack

This is PR 2 of 3:

  1. `feat/image-input-and-file-upload` → `main` (feat: image input for prompts + upload_file browser action #61)
  2. This PR → `feat/image-input-and-file-upload`
  3. `perf/tab-init-highlight-cache` → this branch

After #61 merges to main, retarget this PR to `main`.

🤖 Generated with Claude Code

softpudding and others added 9 commits April 18, 2026 10:21
The skill is now symlinked into ~/.claude/skills/open-browser/ for
global use. Update every `python3 skill/claude/open-browser/...`
reference to `python3 ~/.claude/skills/open-browser/...` so the same
command works from any project's CWD (including inside the OpenBrowser
repo, where the symlink still resolves back here).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…411883e78527b1915fa8c4

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
…play)

Introduces the ob-routines skill (alias for openbrowser-routines) for
capturing, compiling, and replaying named Chrome workflows. Previously lived
only in ~/.claude/skills/routines/; now versioned under skill/claude/ob-routines/
so it can be installed via symlink alongside open-browser.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In routine-replay mode, where the compiled SOP gives the agent precise
element keywords, the 2-phase click/select/keyboard_input confirmation
round-trip and the 3-frame screenshot history both pay for ambiguity
that does not exist.

- BrowserExecutor now tracks the most recent highlight result per
  conversation. When the agent targets the unique element that highlight
  just returned, click/select/keyboard_input skip the pending-confirmation
  round-trip and execute directly. Falls back to 2PC in any other case.
- get_context_image_window(routine_replay=True) returns 1, overriding
  the default of 3 for replay conversations only.
- ob-routines SKILL.md: tighten /ob-routines new to ask only for the
  one-line goal and defer URL/site/parameter questions to the compiler.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tab init's two heaviest phases share the same shape: per-element loops that
re-do work the previous step already paid for. Cut both.

Scanner (highlight-detection.injected.js): wrap collectHighlightCandidates in
withScanLayoutCache, which monkey-patches Element.prototype.getBoundingClientRect,
SVGGraphicsElement.prototype.getBoundingClientRect, window.getComputedStyle, and
Document.prototype.elementsFromPoint with per-scan WeakMap/Map caches. The scan
runs in one synchronous Runtime.evaluate, so layout cannot change mid-task and
caching is safe; originals are restored in finally. Also skip inert tags
(script/style/meta/...) before the first layout read.

Pagination (collision-detection.ts): SelectedSpatialIndex (96px grid) keyed on
union(bbox, labelBBox) of placed elements. isPlacementFeasible now iterates
only nearby placed elements via nearbySelectedFor, which queries by inflate
(union(candidate.bbox, candidate.labelBBox), CLEARANCE) — covering all four
collision tests. chooseLeastBlockingPlacement also uses an "influence rect"
to skip re-evaluating spatially-far future candidates when a hypothetical
placement cannot affect them.

Measured (best run, fresh tab init):
- finviz.com (349 elements):   17.8s -> 13.7s  (-23%)
- bluebook mock (50):           6.3s -> 5.4s   (-14%)
- techforum mock (34):          4.3s -> 3.9s   (-11%)
- 16 mock sites aggregate:    -4% to -14%

Correctness:
- 181/181 extension unit tests pass.
- Strict integration check (selector + type + labelPosition + bbox + element
  ORDER) passes on all 16 deterministic mock sites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pagination win revealed that scan-phase resolve was the new bottleneck:
on finviz the resolve phase alone was 6.1s of a 6.5s scan. Per candidate,
resolveClickableCandidate walks up to 5 ancestors, each calling
isClickableCandidate, which calls hasExplicitClickableAncestor that walks
ALL ancestors back to body, calling getSemanticClickableSignal at each.
For deep DOM (finviz tables) the same elements were classified dozens of
times per scan.

Add per-scan WeakMap memoization (cleared by withScanLayoutCache) for the
classifiers that are pure functions of element + DOM state:
- getSemanticClickableSignal
- isClickableCandidate
- getBaseClickableSignal
- hasExplicitClickableAncestor
- getElementTextForDetection (textContent walk)
- getElementSearchText

Also add scan_stats / scan_times to the response payload so harness/tooling
can attribute time per phase without parsing console output.

Measured (best run, finviz.com/screener.ashx, ~349 candidates):
- in-page scan:    6537ms -> 585ms     (-91%, ~11x)
- pagination:       397ms -> 300ms     (already optimized in prior commit)
- end-to-end:    17787ms -> 4975ms     (-72%, ~3.6x)

Resolve-phase breakdown after caching: 6121ms -> 51ms.

Correctness: 181/181 unit tests pass. Strict integration check (selector,
type, labelPosition, bbox, element ORDER) passes on all 16 deterministic
mock sites — same elements, same labels, same order. finviz_real returns
identical 336/6/138 element/page/page1 counts.

Caching is safe because the scan runs in one synchronous Runtime.evaluate
call and these classifiers depend only on DOM state that cannot mutate
during the scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@softpudding softpudding force-pushed the feat/ob-routines-skill branch from 45a103e to a9c0c7c Compare April 18, 2026 02:22
perf(highlight): cut tab init from ~20s to ~5s on heavy pages
@softpudding softpudding merged commit 1a8aa5c into feat/image-input-and-file-upload Apr 18, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant