feat: image input for prompts + upload_file browser action by softpudding · Pull Request #61 · softpudding/OpenBrowser

softpudding · 2026-04-18T02:19:32Z

Summary

Splits off the original two commits from the long-lived branch into a focused, reviewable change.

Add image input for user prompts and upload_file browser action (9b9eedb) — server-side ingestion of images attached to prompts; new upload_file browser action wired through extension command surface.
skill(open-browser): document and support image attachments (f34f60b) — adds image-attachment docs and send_task.py plumbing for the open-browser skill.

Test plan

Send a task to /agent with image attachments and confirm the model receives them
Trigger `upload_file` against a real `<input type="file">` and confirm the file is set on the input via CDP
Existing extension test suite (`bun test`) passes
Existing server unit tests pass

Stack

This is the first PR in a 3-PR stack:

This PR → `main`
`feat/ob-routines-skill` → this branch
`perf/tab-init-highlight-cache` → `feat/ob-routines-skill`

🤖 Generated with Claude Code

Users can now attach images to a task (paste, drag-drop, or paperclip) and the agent can attach local files to <input type=file> controls via CDP DOM.setFileInputFiles, bypassing the native OS picker. - Frontend: paste/drop/picker → data URIs in POST body, up to 8 images at 10MB each; thumbnails with × remove. - Server: validates data URIs and size, builds multimodal Message (TextContent + ImageContent) in both in-process and multi-process dispatch paths. - Extension: new `uploadable` element type with a dedicated detection pass that surfaces display:none file inputs and anchors the overlay on the nearest visible label/button. New `upload_file` action on ElementInteractionTool resolves selector → CDP nodeId and calls DOM.setFileInputFiles. - Prompts + highlight tool updated for big-model and small-model to advertise the new element type and action. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds a --image PATH flag to send_task.py that reads a local image, base64-encodes it, and sends it alongside the user message as a data URI. Matches the server's 10MB / 8-image limits so the agent sees the same 400s the UI would. SKILL.md gains an "Attaching images" section covering the typical multimodal use cases (visual regression, reproducing a bug from a screenshot, asset matching). api_reference.md documents the `images` field on POST /agent/conversations/{id}/messages with field descriptions, limits, and a curl example. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The skill is now symlinked into ~/.claude/skills/open-browser/ for global use. Update every `python3 skill/claude/open-browser/...` reference to `python3 ~/.claude/skills/open-browser/...` so the same command works from any project's CWD (including inside the OpenBrowser repo, where the symlink still resolves back here). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…411883e78527b1915fa8c4 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

…play) Introduces the ob-routines skill (alias for openbrowser-routines) for capturing, compiling, and replaying named Chrome workflows. Previously lived only in ~/.claude/skills/routines/; now versioned under skill/claude/ob-routines/ so it can be installed via symlink alongside open-browser. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In routine-replay mode, where the compiled SOP gives the agent precise element keywords, the 2-phase click/select/keyboard_input confirmation round-trip and the 3-frame screenshot history both pay for ambiguity that does not exist. - BrowserExecutor now tracks the most recent highlight result per conversation. When the agent targets the unique element that highlight just returned, click/select/keyboard_input skip the pending-confirmation round-trip and execute directly. Falls back to 2PC in any other case. - get_context_image_window(routine_replay=True) returns 1, overriding the default of 3 for replay conversations only. - ob-routines SKILL.md: tighten /ob-routines new to ask only for the one-line goal and defer URL/site/parameter questions to the compiler. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tab init's two heaviest phases share the same shape: per-element loops that re-do work the previous step already paid for. Cut both. Scanner (highlight-detection.injected.js): wrap collectHighlightCandidates in withScanLayoutCache, which monkey-patches Element.prototype.getBoundingClientRect, SVGGraphicsElement.prototype.getBoundingClientRect, window.getComputedStyle, and Document.prototype.elementsFromPoint with per-scan WeakMap/Map caches. The scan runs in one synchronous Runtime.evaluate, so layout cannot change mid-task and caching is safe; originals are restored in finally. Also skip inert tags (script/style/meta/...) before the first layout read. Pagination (collision-detection.ts): SelectedSpatialIndex (96px grid) keyed on union(bbox, labelBBox) of placed elements. isPlacementFeasible now iterates only nearby placed elements via nearbySelectedFor, which queries by inflate (union(candidate.bbox, candidate.labelBBox), CLEARANCE) — covering all four collision tests. chooseLeastBlockingPlacement also uses an "influence rect" to skip re-evaluating spatially-far future candidates when a hypothetical placement cannot affect them. Measured (best run, fresh tab init): - finviz.com (349 elements): 17.8s -> 13.7s (-23%) - bluebook mock (50): 6.3s -> 5.4s (-14%) - techforum mock (34): 4.3s -> 3.9s (-11%) - 16 mock sites aggregate: -4% to -14% Correctness: - 181/181 extension unit tests pass. - Strict integration check (selector + type + labelPosition + bbox + element ORDER) passes on all 16 deterministic mock sites. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The pagination win revealed that scan-phase resolve was the new bottleneck: on finviz the resolve phase alone was 6.1s of a 6.5s scan. Per candidate, resolveClickableCandidate walks up to 5 ancestors, each calling isClickableCandidate, which calls hasExplicitClickableAncestor that walks ALL ancestors back to body, calling getSemanticClickableSignal at each. For deep DOM (finviz tables) the same elements were classified dozens of times per scan. Add per-scan WeakMap memoization (cleared by withScanLayoutCache) for the classifiers that are pure functions of element + DOM state: - getSemanticClickableSignal - isClickableCandidate - getBaseClickableSignal - hasExplicitClickableAncestor - getElementTextForDetection (textContent walk) - getElementSearchText Also add scan_stats / scan_times to the response payload so harness/tooling can attribute time per phase without parsing console output. Measured (best run, finviz.com/screener.ashx, ~349 candidates): - in-page scan: 6537ms -> 585ms (-91%, ~11x) - pagination: 397ms -> 300ms (already optimized in prior commit) - end-to-end: 17787ms -> 4975ms (-72%, ~3.6x) Resolve-phase breakdown after caching: 6121ms -> 51ms. Correctness: 181/181 unit tests pass. Strict integration check (selector, type, labelPosition, bbox, element ORDER) passes on all 16 deterministic mock sites — same elements, same labels, same order. finviz_real returns identical 336/6/138 element/page/page1 counts. Caching is safe because the scan runs in one synchronous Runtime.evaluate call and these classifiers depend only on DOM state that cannot mutate during the scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

perf(highlight): cut tab init from ~20s to ~5s on heavy pages

skill(ob-routines): record → compile → replay browser routines

softpudding and others added 2 commits April 16, 2026 21:38

This was referenced Apr 18, 2026

skill(ob-routines): record → compile → replay browser routines #62

Merged

perf(highlight): cut tab init from ~20s to ~5s on heavy pages #63

Merged

softpudding and others added 12 commits April 18, 2026 10:21

chore: apply pre-commit formatting (black + prettier)

f239cf1

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: update openhands-sdk and openhands-tools to bd4cb296355c3d03dd…

ac7fa57

…411883e78527b1915fa8c4 Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

chore: apply pre-commit formatting (black) on ob-routines scripts

a9c0c7c

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

set extension auto reload timeout to 40s

e8268a8

chore: apply pre-commit formatting (prettier) on highlight perf changes

6aac696

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge pull request #63 from softpudding/perf/tab-init-highlight-cache

2d2876c

perf(highlight): cut tab init from ~20s to ~5s on heavy pages

Merge pull request #62 from softpudding/feat/ob-routines-skill

1a8aa5c

skill(ob-routines): record → compile → replay browser routines

softpudding merged commit b7b4f67 into main Apr 18, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: image input for prompts + upload_file browser action#61

feat: image input for prompts + upload_file browser action#61
softpudding merged 14 commits into
mainfrom
feat/image-input-and-file-upload

softpudding commented Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

softpudding commented Apr 18, 2026

Summary

Test plan

Stack

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant