softpudding · softpudding · Apr 21, 2026 · Apr 18, 2026 · Apr 19, 2026 · Apr 19, 2026
diff --git a/README.md b/README.md
@@ -88,24 +88,28 @@ The primary evaluation signal in this repo is the latest checked-in report:
 
 The test set is a series of local mock websites in [`eval/`](eval/) that simulate realistic browser tasks and record structured interaction events.
 
-That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first:
+That snapshot was generated on `2026-04-21 02:09:48` and evaluates OpenBrowser on `35` tracked browser tasks across four models from both the Qwen3.5 and Qwen3.6 families. We care about three things first:
 
 - Correctness: pass/fail plus task-score coverage
 - Efficiency: average execution time
 - Cost: average RMB cost per task
 
 Current snapshot:
 
-- Overall: `24/24` runs passed, `100%` pass rate
-- `dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost
-- `dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost
+- Overall: `111/140` runs passed, `79.3%` pass rate
+- `dashscope/qwen3.5-plus`: `30/35` passed, `276.2/304.8` task score, `309.51s` average duration, `0.598152 RMB` average cost
+- `dashscope/qwen3.6-flash`: `29/35` passed, `273.0/304.8` task score, `252.27s` average duration, `0.804474 RMB` average cost
+- `dashscope/qwen3.6-plus`: `28/35` passed, `262.4/304.8` task score, `337.59s` average duration, `1.605398 RMB` average cost
+- `dashscope/qwen3.5-flash`: `24/35` passed, `243.1/304.8` task score, `308.84s` average duration, `0.144029 RMB` average cost
 
 | Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score |
 |-------|-------------|-----------|------------------|-----------------|
-| `dashscope/qwen3.5-flash` | `12/12` passed, `68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
-| `dashscope/qwen3.5-plus` | `12/12` passed, `67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
+| `dashscope/qwen3.5-plus` | `30/35` passed, `276.2/304.8` | `309.51s` | `0.598152` | `0.7425` |
+| `dashscope/qwen3.6-flash` | `29/35` passed, `273.0/304.8` | `252.27s` | `0.804474` | `0.7191` |
+| `dashscope/qwen3.6-plus` | `28/35` passed, `262.4/304.8` | `337.59s` | `1.605398` | `0.6040` |
+| `dashscope/qwen3.5-flash` | `24/35` passed, `243.1/304.8` | `308.84s` | `0.144029` | `0.6938` |
 
-On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost."
+The current 35-task suite is substantially harder than the earlier 12-task snapshot — it includes multi-step bookings, inbox triage with label dialogs, auto-hiding video controls, drag-and-drop boards, and noisy retail flows. On this suite `qwen3.5-plus` is the strongest overall, while `qwen3.6-flash` is the best correctness-per-second point (fastest model of the four and a close second on pass rate). `qwen3.5-flash` stays useful as the cheapest tier for simpler flows; `qwen3.6-plus` is still the most expensive and does not dominate either speed or correctness on this test set. The repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost across both Qwen generations."
 
 Older side-by-side comparisons with OpenClaw are kept only as archived context:
 
@@ -268,17 +272,26 @@ Routine runs always start a fresh conversation in `routine_replay` mode so repla
 
 ### Try OpenBrowser with SKILL - install to your local agents
 
-OpenBrowser ships with skills for both `Codex` and `OpenClaw`:
+OpenBrowser ships with skills for `Claude Code`, `Codex`, and `OpenClaw`:
 
+- `skill/claude/open-browser` — browser control for Claude Code
+- `skill/claude/ob-routines` — record/compile/replay Browser Routines
 - `skill/codex/open-browser`
 - `skill/openclaw/open-browser`
 
-They are similar in purpose, but slightly different in workflow:
+**Claude Code** skills install to user scope (`~/.claude/skills/`) so they're available across all projects:
+
+```bash
+cp -r skill/claude/open-browser ~/.claude/skills/
+cp -r skill/claude/ob-routines ~/.claude/skills/
+```
+
+The `Codex` and `OpenClaw` skills are tuned for their respective agent environments:
 
 - The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution.
 - The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks.
 
-Install the one that matches your local agent environment.
+Install the one(s) that match your local agent environment.
 
 ## Why Qwen3.5 Family Right Now?
 

diff --git a/README.zh-CN.md b/README.zh-CN.md
@@ -68,24 +68,28 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件
 
 这套测试集本身是一系列位于 [`eval/`](eval/) 下的本地 mock 仿真网站，用来模拟真实浏览器任务，并记录结构化交互事件。
 
-这个快照生成于 `2026-03-30 11:17:06`，基于其中 `12` 个带事件跟踪的浏览器任务，对两个模型做评测。我们现在优先看三件事：
+这个快照生成于 `2026-04-21 02:09:48`，基于其中 `35` 个带事件跟踪的浏览器任务，对来自 Qwen3.5 和 Qwen3.6 两代的共 4 个模型做评测。我们现在优先看三件事：
 
 - 正确性：是否通过，以及任务分覆盖情况
 - 效率：平均执行时间
 - 成本：单任务平均 RMB 成本
 
 当前快照结果：
 
-- 总体：`24/24` 次运行通过，整体通过率 `100%`
-- `dashscope/qwen3.5-flash`：`12/12` 通过，任务分 `68.5/68.5`，平均耗时 `114.89s`，平均成本 `0.075442 RMB`
-- `dashscope/qwen3.5-plus`：`12/12` 通过，任务分 `67.5/68.5`，平均耗时 `149.63s`，平均成本 `0.291952 RMB`
+- 总体：`111/140` 次运行通过，整体通过率 `79.3%`
+- `dashscope/qwen3.5-plus`：`30/35` 通过，任务分 `276.2/304.8`，平均耗时 `309.51s`，平均成本 `0.598152 RMB`
+- `dashscope/qwen3.6-flash`：`29/35` 通过，任务分 `273.0/304.8`，平均耗时 `252.27s`，平均成本 `0.804474 RMB`
+- `dashscope/qwen3.6-plus`：`28/35` 通过，任务分 `262.4/304.8`，平均耗时 `337.59s`，平均成本 `1.605398 RMB`
+- `dashscope/qwen3.5-flash`：`24/35` 通过，任务分 `243.1/304.8`，平均耗时 `308.84s`，平均成本 `0.144029 RMB`
 
 | 模型 | 正确性 | 平均耗时 | 平均成本（RMB） | 综合分 |
 |------|--------|----------|------------------|--------|
-| `dashscope/qwen3.5-flash` | `12/12` 通过，`68.5/68.5` | `114.89s` | `0.075442` | `0.9358` |
-| `dashscope/qwen3.5-plus` | `12/12` 通过，`67.5/68.5` | `149.63s` | `0.291952` | `0.8774` |
+| `dashscope/qwen3.5-plus` | `30/35` 通过，`276.2/304.8` | `309.51s` | `0.598152` | `0.7425` |
+| `dashscope/qwen3.6-flash` | `29/35` 通过，`273.0/304.8` | `252.27s` | `0.804474` | `0.7191` |
+| `dashscope/qwen3.6-plus` | `28/35` 通过，`262.4/304.8` | `337.59s` | `1.605398` | `0.6040` |
+| `dashscope/qwen3.5-flash` | `24/35` 通过，`243.1/304.8` | `308.84s` | `0.144029` | `0.6938` |
 
-在当前这套评测里，`qwen3.5-flash` 是更好的效率/成本工作点：在同样保持 `100%` 通过率的前提下，它比 `qwen3.5-plus` 约快 `23.2%`，平均成本约低 `74.2%`。`qwen3.5-plus` 仍然是更强 fallback 档位，适合更难的视觉推理或更复杂的工作流；但这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”，而是“看我们当前栈在正确性、速度和成本上的最新结果”。
+新的 35 任务测试集比之前的 12 任务快照显著更难——包含多步预订、带标签弹窗的收件箱整理、会自动隐藏控件的播放器、拖拽看板、以及干扰项很多的电商流程等。`qwen3.5-plus` 在当前测试集上综合表现最强；`qwen3.6-flash` 则是“单位耗时正确率”的最佳点——四个模型里最快，且通过率紧随其后。`qwen3.5-flash` 适合更简单流程、作为成本最低档位仍然有用；`qwen3.6-plus` 仍是最贵的档位，但在这套测试集上并没有在速度或正确性上占优。这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”，而是“看我们当前栈在 Qwen 两代模型上的正确性、速度和成本结果”。
 
 之前与 OpenClaw 的并排对比现在作为 archived 资料保留：