Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
0bc4c16
feat(highlight): persist highlights across commands + use outline for…
softpudding Apr 18, 2026
728eafa
feat(highlight): corner-badge label overlay for unambiguous element b…
softpudding Apr 19, 2026
f915ae4
fix(highlight): labels always "above", defer on collision; tolerate s…
softpudding Apr 19, 2026
ea8cf32
feat(highlight): emit structured element descriptors instead of raw HTML
softpudding Apr 19, 2026
4102276
chore: apply pre-commit formatting (black + prettier)
softpudding Apr 19, 2026
08206ca
feat(highlight): allow horizontal shift of corner-badge labels
softpudding Apr 20, 2026
0f16c71
fix github eval-mock long line issue
softpudding Apr 20, 2026
43f2643
fix(highlight): filter tiny elements before collision planner; reposi…
softpudding Apr 20, 2026
e29e65a
chore(skills): make Claude skills user-scoped (~/.claude/skills/)
softpudding Apr 20, 2026
20d692f
fix(agent): reduce eval regressions from highlight/descriptor branch
softpudding Apr 20, 2026
09382ae
fix(extension): hash element ids on stable fingerprint, not volatile …
softpudding Apr 20, 2026
c374403
chore: bump litellm and openhands-sdk pins
softpudding Apr 20, 2026
d559ea7
chore: bump openhands-sdk to include qwen prompt-cache enablement
softpudding Apr 20, 2026
4b3e4db
fix(extension): scope deferred highlight cleanup to the command's tab_id
softpudding Apr 20, 2026
74dcddc
fix(eval): per-test mock-site server isolation
softpudding Apr 21, 2026
316a4a2
chore(extension): apply pre-commit prettier formatting
softpudding Apr 21, 2026
78dd2ce
eval: refresh benchmark to the 35-task × 4-model suite (2026-04-21)
softpudding Apr 21, 2026
71df427
chore(extension): repoint lockfile from Alibaba mirror to public npm …
softpudding Apr 21, 2026
61e3868
chore(server): apply pre-commit black formatting
softpudding Apr 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 23 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,24 +88,28 @@ The primary evaluation signal in this repo is the latest checked-in report:

The test set is a series of local mock websites in [`eval/`](eval/) that simulate realistic browser tasks and record structured interaction events.

That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first:
That snapshot was generated on `2026-04-21 02:09:48` and evaluates OpenBrowser on `35` tracked browser tasks across four models from both the Qwen3.5 and Qwen3.6 families. We care about three things first:

- Correctness: pass/fail plus task-score coverage
- Efficiency: average execution time
- Cost: average RMB cost per task

Current snapshot:

- Overall: `24/24` runs passed, `100%` pass rate
- `dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost
- `dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost
- Overall: `111/140` runs passed, `79.3%` pass rate
- `dashscope/qwen3.5-plus`: `30/35` passed, `276.2/304.8` task score, `309.51s` average duration, `0.598152 RMB` average cost
- `dashscope/qwen3.6-flash`: `29/35` passed, `273.0/304.8` task score, `252.27s` average duration, `0.804474 RMB` average cost
- `dashscope/qwen3.6-plus`: `28/35` passed, `262.4/304.8` task score, `337.59s` average duration, `1.605398 RMB` average cost
- `dashscope/qwen3.5-flash`: `24/35` passed, `243.1/304.8` task score, `308.84s` average duration, `0.144029 RMB` average cost

| Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
|-------|-------------|-----------|------------------|-----------------|
| `dashscope/qwen3.5-flash` | `12/12` passed, `68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
| `dashscope/qwen3.5-plus` | `12/12` passed, `67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
| `dashscope/qwen3.5-plus` | `30/35` passed, `276.2/304.8` | `309.51s` | `0.598152` | `0.7425` |
| `dashscope/qwen3.6-flash` | `29/35` passed, `273.0/304.8` | `252.27s` | `0.804474` | `0.7191` |
| `dashscope/qwen3.6-plus` | `28/35` passed, `262.4/304.8` | `337.59s` | `1.605398` | `0.6040` |
| `dashscope/qwen3.5-flash` | `24/35` passed, `243.1/304.8` | `308.84s` | `0.144029` | `0.6938` |

On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
The current 35-task suite is substantially harder than the earlier 12-task snapshot — it includes multi-step bookings, inbox triage with label dialogs, auto-hiding video controls, drag-and-drop boards, and noisy retail flows. On this suite `qwen3.5-plus` is the strongest overall, while `qwen3.6-flash` is the best correctness-per-second point (fastest model of the four and a close second on pass rate). `qwen3.5-flash` stays useful as the cheapest tier for simpler flows; `qwen3.6-plus` is still the most expensive and does not dominate either speed or correctness on this test set. The repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost across both Qwen generations."

Older side-by-side comparisons with OpenClaw are kept only as archived context:

Expand Down Expand Up @@ -268,17 +272,26 @@ Routine runs always start a fresh conversation in `routine_replay` mode so repla

### Try OpenBrowser with SKILL - install to your local agents

OpenBrowser ships with skills for both `Codex` and `OpenClaw`:
OpenBrowser ships with skills for `Claude Code`, `Codex`, and `OpenClaw`:

- `skill/claude/open-browser` — browser control for Claude Code
- `skill/claude/ob-routines` — record/compile/replay Browser Routines
- `skill/codex/open-browser`
- `skill/openclaw/open-browser`

They are similar in purpose, but slightly different in workflow:
**Claude Code** skills install to user scope (`~/.claude/skills/`) so they're available across all projects:

```bash
cp -r skill/claude/open-browser ~/.claude/skills/
cp -r skill/claude/ob-routines ~/.claude/skills/
```

The `Codex` and `OpenClaw` skills are tuned for their respective agent environments:

- The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution.
- The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.

Install the one that matches your local agent environment.
Install the one(s) that match your local agent environment.

## Why Qwen3.5 Family Right Now?

Expand Down
18 changes: 11 additions & 7 deletions README.zh-CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,24 +68,28 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件

这套测试集本身是一系列位于 [`eval/`](eval/) 下的本地 mock 仿真网站,用来模拟真实浏览器任务,并记录结构化交互事件。

这个快照生成于 `2026-03-30 11:17:06`,基于其中 `12` 个带事件跟踪的浏览器任务,对两个模型做评测。我们现在优先看三件事:
这个快照生成于 `2026-04-21 02:09:48`,基于其中 `35` 个带事件跟踪的浏览器任务,对来自 Qwen3.5 和 Qwen3.6 两代的共 4 个模型做评测。我们现在优先看三件事:

- 正确性:是否通过,以及任务分覆盖情况
- 效率:平均执行时间
- 成本:单任务平均 RMB 成本

当前快照结果:

- 总体:`24/24` 次运行通过,整体通过率 `100%`
- `dashscope/qwen3.5-flash`:`12/12` 通过,任务分 `68.5/68.5`,平均耗时 `114.89s`,平均成本 `0.075442 RMB`
- `dashscope/qwen3.5-plus`:`12/12` 通过,任务分 `67.5/68.5`,平均耗时 `149.63s`,平均成本 `0.291952 RMB`
- 总体:`111/140` 次运行通过,整体通过率 `79.3%`
- `dashscope/qwen3.5-plus`:`30/35` 通过,任务分 `276.2/304.8`,平均耗时 `309.51s`,平均成本 `0.598152 RMB`
- `dashscope/qwen3.6-flash`:`29/35` 通过,任务分 `273.0/304.8`,平均耗时 `252.27s`,平均成本 `0.804474 RMB`
- `dashscope/qwen3.6-plus`:`28/35` 通过,任务分 `262.4/304.8`,平均耗时 `337.59s`,平均成本 `1.605398 RMB`
- `dashscope/qwen3.5-flash`:`24/35` 通过,任务分 `243.1/304.8`,平均耗时 `308.84s`,平均成本 `0.144029 RMB`

| 模型 | 正确性 | 平均耗时 | 平均成本(RMB) | 综合分 |
|------|--------|----------|------------------|--------|
| `dashscope/qwen3.5-flash` | `12/12` 通过,`68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
| `dashscope/qwen3.5-plus` | `12/12` 通过,`67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
| `dashscope/qwen3.5-plus` | `30/35` 通过,`276.2/304.8` | `309.51s` | `0.598152` | `0.7425` |
| `dashscope/qwen3.6-flash` | `29/35` 通过,`273.0/304.8` | `252.27s` | `0.804474` | `0.7191` |
| `dashscope/qwen3.6-plus` | `28/35` 通过,`262.4/304.8` | `337.59s` | `1.605398` | `0.6040` |
| `dashscope/qwen3.5-flash` | `24/35` 通过,`243.1/304.8` | `308.84s` | `0.144029` | `0.6938` |

在当前这套评测里,`qwen3.5-flash` 是更好的效率/成本工作点:在同样保持 `100%` 通过率的前提下,它比 `qwen3.5-plus` 约快 `23.2%`,平均成本约低 `74.2%`。`qwen3.5-plus` 仍然是更强 fallback 档位,适合更难的视觉推理或更复杂的工作流;但这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”,而是“看我们当前栈在正确性、速度和成本上的最新结果”。
新的 35 任务测试集比之前的 12 任务快照显著更难——包含多步预订、带标签弹窗的收件箱整理、会自动隐藏控件的播放器、拖拽看板、以及干扰项很多的电商流程等。`qwen3.5-plus` 在当前测试集上综合表现最强;`qwen3.6-flash` 则是“单位耗时正确率”的最佳点——四个模型里最快,且通过率紧随其后。`qwen3.5-flash` 适合更简单流程、作为成本最低档位仍然有用;`qwen3.6-plus` 仍是最贵的档位,但在这套测试集上并没有在速度或正确性上占优。这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”,而是“看我们当前栈在 Qwen 两代模型上的正确性、速度和成本结果”。

之前与 OpenClaw 的并排对比现在作为 archived 资料保留:

Expand Down
Loading
Loading