diff --git a/README.md b/README.md index 8c38d87..1de0b21 100644 --- a/README.md +++ b/README.md @@ -88,7 +88,7 @@ The primary evaluation signal in this repo is the latest checked-in report: The test set is a series of local mock websites in [`eval/`](eval/) that simulate realistic browser tasks and record structured interaction events. -That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser on `12` tracked browser tasks across two models. We care about three things first: +That snapshot was generated on `2026-04-21 02:09:48` and evaluates OpenBrowser on `35` tracked browser tasks across four models from both the Qwen3.5 and Qwen3.6 families. We care about three things first: - Correctness: pass/fail plus task-score coverage - Efficiency: average execution time @@ -96,16 +96,20 @@ That snapshot was generated on `2026-03-30 11:17:06` and evaluates OpenBrowser o Current snapshot: -- Overall: `24/24` runs passed, `100%` pass rate -- `dashscope/qwen3.5-flash`: `12/12` passed, `68.5/68.5` task score, `114.89s` average duration, `0.075442 RMB` average cost -- `dashscope/qwen3.5-plus`: `12/12` passed, `67.5/68.5` task score, `149.63s` average duration, `0.291952 RMB` average cost +- Overall: `111/140` runs passed, `79.3%` pass rate +- `dashscope/qwen3.5-plus`: `30/35` passed, `276.2/304.8` task score, `309.51s` average duration, `0.598152 RMB` average cost +- `dashscope/qwen3.6-flash`: `29/35` passed, `273.0/304.8` task score, `252.27s` average duration, `0.804474 RMB` average cost +- `dashscope/qwen3.6-plus`: `28/35` passed, `262.4/304.8` task score, `337.59s` average duration, `1.605398 RMB` average cost +- `dashscope/qwen3.5-flash`: `24/35` passed, `243.1/304.8` task score, `308.84s` average duration, `0.144029 RMB` average cost | Model | Correctness | Avg. Time | Avg. Cost (RMB) | Composite Score | |-------|-------------|-----------|------------------|-----------------| -| `dashscope/qwen3.5-flash` | `12/12` passed, `68.5/68.5` | `114.89s` | `0.075442` | `0.9358` | -| `dashscope/qwen3.5-plus` | `12/12` passed, `67.5/68.5` | `149.63s` | `0.291952` | `0.8774` | +| `dashscope/qwen3.5-plus` | `30/35` passed, `276.2/304.8` | `309.51s` | `0.598152` | `0.7425` | +| `dashscope/qwen3.6-flash` | `29/35` passed, `273.0/304.8` | `252.27s` | `0.804474` | `0.7191` | +| `dashscope/qwen3.6-plus` | `28/35` passed, `262.4/304.8` | `337.59s` | `1.605398` | `0.6040` | +| `dashscope/qwen3.5-flash` | `24/35` passed, `243.1/304.8` | `308.84s` | `0.144029` | `0.6938` | -On the current suite, `qwen3.5-flash` is the better efficiency-cost point: it keeps the same `100%` pass rate, while being about `23.2%` faster and `74.2%` cheaper than `qwen3.5-plus`. `qwen3.5-plus` still remains useful as a stronger fallback profile for harder visual workflows, but the repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost." +The current 35-task suite is substantially harder than the earlier 12-task snapshot — it includes multi-step bookings, inbox triage with label dialogs, auto-hiding video controls, drag-and-drop boards, and noisy retail flows. On this suite `qwen3.5-plus` is the strongest overall, while `qwen3.6-flash` is the best correctness-per-second point (fastest model of the four and a close second on pass rate). `qwen3.5-flash` stays useful as the cheapest tier for simpler flows; `qwen3.6-plus` is still the most expensive and does not dominate either speed or correctness on this test set. The repo's current default evaluation story is no longer "benchmark comparison against OpenClaw"; it is "how well our latest stack scores on correctness, speed, and cost across both Qwen generations." Older side-by-side comparisons with OpenClaw are kept only as archived context: @@ -268,17 +272,26 @@ Routine runs always start a fresh conversation in `routine_replay` mode so repla ### Try OpenBrowser with SKILL - install to your local agents -OpenBrowser ships with skills for both `Codex` and `OpenClaw`: +OpenBrowser ships with skills for `Claude Code`, `Codex`, and `OpenClaw`: +- `skill/claude/open-browser` — browser control for Claude Code +- `skill/claude/ob-routines` — record/compile/replay Browser Routines - `skill/codex/open-browser` - `skill/openclaw/open-browser` -They are similar in purpose, but slightly different in workflow: +**Claude Code** skills install to user scope (`~/.claude/skills/`) so they're available across all projects: + +```bash +cp -r skill/claude/open-browser ~/.claude/skills/ +cp -r skill/claude/ob-routines ~/.claude/skills/ +``` + +The `Codex` and `OpenClaw` skills are tuned for their respective agent environments: - The `Codex` skill is tuned for Codex-style repo workflows and supports either foreground or background task execution. - The `OpenClaw` skill is tuned for OpenClaw usage, emphasizes background execution, and frames OpenBrowser as the stronger option for rendered-page and multi-step browser tasks. -Install the one that matches your local agent environment. +Install the one(s) that match your local agent environment. ## Why Qwen3.5 Family Right Now? diff --git a/README.zh-CN.md b/README.zh-CN.md index d51026a..fe1b224 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -68,7 +68,7 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件 这套测试集本身是一系列位于 [`eval/`](eval/) 下的本地 mock 仿真网站,用来模拟真实浏览器任务,并记录结构化交互事件。 -这个快照生成于 `2026-03-30 11:17:06`,基于其中 `12` 个带事件跟踪的浏览器任务,对两个模型做评测。我们现在优先看三件事: +这个快照生成于 `2026-04-21 02:09:48`,基于其中 `35` 个带事件跟踪的浏览器任务,对来自 Qwen3.5 和 Qwen3.6 两代的共 4 个模型做评测。我们现在优先看三件事: - 正确性:是否通过,以及任务分覆盖情况 - 效率:平均执行时间 @@ -76,16 +76,20 @@ OpenBrowser 不是靠“感觉不错”来迭代的。仓库里包含带事件 当前快照结果: -- 总体:`24/24` 次运行通过,整体通过率 `100%` -- `dashscope/qwen3.5-flash`:`12/12` 通过,任务分 `68.5/68.5`,平均耗时 `114.89s`,平均成本 `0.075442 RMB` -- `dashscope/qwen3.5-plus`:`12/12` 通过,任务分 `67.5/68.5`,平均耗时 `149.63s`,平均成本 `0.291952 RMB` +- 总体:`111/140` 次运行通过,整体通过率 `79.3%` +- `dashscope/qwen3.5-plus`:`30/35` 通过,任务分 `276.2/304.8`,平均耗时 `309.51s`,平均成本 `0.598152 RMB` +- `dashscope/qwen3.6-flash`:`29/35` 通过,任务分 `273.0/304.8`,平均耗时 `252.27s`,平均成本 `0.804474 RMB` +- `dashscope/qwen3.6-plus`:`28/35` 通过,任务分 `262.4/304.8`,平均耗时 `337.59s`,平均成本 `1.605398 RMB` +- `dashscope/qwen3.5-flash`:`24/35` 通过,任务分 `243.1/304.8`,平均耗时 `308.84s`,平均成本 `0.144029 RMB` | 模型 | 正确性 | 平均耗时 | 平均成本(RMB) | 综合分 | |------|--------|----------|------------------|--------| -| `dashscope/qwen3.5-flash` | `12/12` 通过,`68.5/68.5` | `114.89s` | `0.075442` | `0.9358` | -| `dashscope/qwen3.5-plus` | `12/12` 通过,`67.5/68.5` | `149.63s` | `0.291952` | `0.8774` | +| `dashscope/qwen3.5-plus` | `30/35` 通过,`276.2/304.8` | `309.51s` | `0.598152` | `0.7425` | +| `dashscope/qwen3.6-flash` | `29/35` 通过,`273.0/304.8` | `252.27s` | `0.804474` | `0.7191` | +| `dashscope/qwen3.6-plus` | `28/35` 通过,`262.4/304.8` | `337.59s` | `1.605398` | `0.6040` | +| `dashscope/qwen3.5-flash` | `24/35` 通过,`243.1/304.8` | `308.84s` | `0.144029` | `0.6938` | -在当前这套评测里,`qwen3.5-flash` 是更好的效率/成本工作点:在同样保持 `100%` 通过率的前提下,它比 `qwen3.5-plus` 约快 `23.2%`,平均成本约低 `74.2%`。`qwen3.5-plus` 仍然是更强 fallback 档位,适合更难的视觉推理或更复杂的工作流;但这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”,而是“看我们当前栈在正确性、速度和成本上的最新结果”。 +新的 35 任务测试集比之前的 12 任务快照显著更难——包含多步预订、带标签弹窗的收件箱整理、会自动隐藏控件的播放器、拖拽看板、以及干扰项很多的电商流程等。`qwen3.5-plus` 在当前测试集上综合表现最强;`qwen3.6-flash` 则是“单位耗时正确率”的最佳点——四个模型里最快,且通过率紧随其后。`qwen3.5-flash` 适合更简单流程、作为成本最低档位仍然有用;`qwen3.6-plus` 仍是最贵的档位,但在这套测试集上并没有在速度或正确性上占优。这个仓库现在的主叙事已经不再是“和 OpenClaw 做 benchmark 对比”,而是“看我们当前栈在 Qwen 两代模型上的正确性、速度和成本结果”。 之前与 OpenClaw 的并排对比现在作为 archived 资料保留: diff --git a/eval/evaluate_browser_agent.py b/eval/evaluate_browser_agent.py index 99410a1..2924a64 100644 --- a/eval/evaluate_browser_agent.py +++ b/eval/evaluate_browser_agent.py @@ -17,6 +17,7 @@ import shutil import signal import sqlite3 +import subprocess import sys import threading import time @@ -41,9 +42,16 @@ # Configuration OPENBROWSER_API_URL = "http://localhost:8765" OPENBROWSER_WS_URL = "ws://localhost:8766" -EVAL_SERVER_URL = "http://localhost:16605" +# Canonical port the YAML test cases reference. Each test now spawns its own +# eval server on an OS-assigned port; URLs in the test case are rewritten to +# point at that per-test port, so this constant is only used for substitution. EVAL_SERVER_PORT = 16605 +EVAL_SERVER_URL = f"http://localhost:{EVAL_SERVER_PORT}" OPENBROWSER_PORT = 8765 +EVAL_SERVER_SCRIPT = Path(__file__).resolve().parent / "server.py" +EVAL_SERVER_BOOT_TIMEOUT = float( + os.environ.get("OPENBROWSER_EVAL_SERVER_BOOT_TIMEOUT", "10") +) # SSE streaming timeouts for the agent channel at :8765. # (connect_timeout, read_timeout) in seconds. @@ -710,10 +718,23 @@ def cleanup_managed_tabs(self, conversation_id: str) -> bool: class EvalServerClient: - """Client for evaluation server tracking API""" + """Client for one eval server's tracking API. - def __init__(self, base_url: str = EVAL_SERVER_URL): - self.base_url = base_url + Each test spawns its own eval server on a unique port; instantiate one + client per server and pass the bound port (or full base URL). + """ + + def __init__( + self, + port: Optional[int] = None, + base_url: Optional[str] = None, + ): + if base_url is not None: + self.base_url = base_url + elif port is not None: + self.base_url = f"http://localhost:{port}" + else: + self.base_url = EVAL_SERVER_URL self.session = requests.Session() self.session.trust_env = False @@ -725,24 +746,14 @@ def health_check(self) -> bool: except requests.exceptions.RequestException: return False - def clear_events(self, site: Optional[str] = None) -> bool: - """Clear tracked events, optionally scoped to one mock site.""" - try: - params = {"site": site} if site else None - response = self.session.get( - f"{self.base_url}/api/events/clear", params=params, timeout=2 - ) - return response.status_code == 200 - except Exception: - return False + def get_events(self) -> List[Dict[str, Any]]: + """Get tracked events from this server. - def get_events(self, site: Optional[str] = None) -> List[Dict[str, Any]]: - """Get tracked events, optionally scoped to one mock site.""" + Per-test isolation makes the previous ?site= filter unnecessary — + a dedicated server holds exactly one test's events. + """ try: - params = {"site": site} if site else None - response = self.session.get( - f"{self.base_url}/api/events", params=params, timeout=5 - ) + response = self.session.get(f"{self.base_url}/api/events", timeout=5) if response.status_code == 200: data = response.json() return data.get("events", []) @@ -761,12 +772,162 @@ def get_sites(self) -> List[str]: return [] +class EvalServerProcess: + """Spawn a single isolated eval server on an OS-assigned port. + + One server per test. The handshake on stdout + (``EVAL_SERVER_LISTENING_PORT=``) tells us which port the OS picked; + we then build a per-test client against it. The process group is killed + on stop() and at interpreter exit so no orphans survive a crash. + """ + + _ALL_INSTANCES: "set[EvalServerProcess]" = set() + _ATEXIT_REGISTERED = False + _LOCK = threading.Lock() + + def __init__(self, boot_timeout: float = EVAL_SERVER_BOOT_TIMEOUT): + self.boot_timeout = boot_timeout + self.proc: Optional[subprocess.Popen] = None + self.port: Optional[int] = None + self._reader: Optional[threading.Thread] = None + self._stderr_tail: List[str] = [] + + @classmethod + def _ensure_atexit(cls) -> None: + with cls._LOCK: + if not cls._ATEXIT_REGISTERED: + atexit.register(cls._kill_all) + cls._ATEXIT_REGISTERED = True + + @classmethod + def _kill_all(cls) -> None: + with cls._LOCK: + instances = list(cls._ALL_INSTANCES) + for inst in instances: + try: + inst.stop() + except Exception: + pass + + def start(self) -> int: + """Spawn the server and block until the bound port is reported.""" + if self.proc is not None: + assert self.port is not None + return self.port + + env = dict(os.environ) + # Defensive: prevent inherited PORT env from overriding --port=0. + env.pop("PORT", None) + env.pop("MOCK_EVAL_PORT", None) + + cmd = [sys.executable, str(EVAL_SERVER_SCRIPT), "--port=0"] + # New session so we can SIGTERM the whole process group on stop. + self.proc = subprocess.Popen( + cmd, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + bufsize=1, + start_new_session=True, + env=env, + ) + EvalServerProcess._ensure_atexit() + with EvalServerProcess._LOCK: + EvalServerProcess._ALL_INSTANCES.add(self) + + deadline = time.time() + self.boot_timeout + port: Optional[int] = None + assert self.proc.stdout is not None + while time.time() < deadline: + line = self.proc.stdout.readline() + if not line: + if self.proc.poll() is not None: + break + continue + line = line.strip() + if line.startswith("EVAL_SERVER_LISTENING_PORT="): + try: + port = int(line.split("=", 1)[1]) + except ValueError: + pass + break + + if port is None: + stderr_tail = "" + try: + if self.proc.stderr is not None: + stderr_tail = self.proc.stderr.read() or "" + except Exception: + pass + self.stop() + raise RuntimeError( + "eval server did not report a port within " + f"{self.boot_timeout:.1f}s. stderr: {stderr_tail[:500]}" + ) + + self.port = port + # Drain remaining stdout/stderr in the background to prevent the + # child from blocking on a full pipe over a long-running test. + self._reader = threading.Thread(target=self._drain_streams, daemon=True) + self._reader.start() + return port + + def _drain_streams(self) -> None: + proc = self.proc + if proc is None: + return + try: + if proc.stdout is not None: + for _ in iter(proc.stdout.readline, ""): + pass + except Exception: + pass + + def stop(self) -> None: + """Terminate the server and its child group.""" + proc = self.proc + if proc is None: + return + try: + try: + pgid = os.getpgid(proc.pid) + os.killpg(pgid, signal.SIGTERM) + except (ProcessLookupError, PermissionError): + proc.terminate() + try: + proc.wait(timeout=5) + except subprocess.TimeoutExpired: + try: + pgid = os.getpgid(proc.pid) + os.killpg(pgid, signal.SIGKILL) + except (ProcessLookupError, PermissionError): + proc.kill() + proc.wait(timeout=2) + except Exception: + pass + finally: + self.proc = None + self.port = None + with EvalServerProcess._LOCK: + EvalServerProcess._ALL_INSTANCES.discard(self) + + def __enter__(self) -> "EvalServerProcess": + self.start() + return self + + def __exit__(self, exc_type, exc, tb) -> None: + self.stop() + + class ServiceManager: - """Manage OpenBrowser and eval server processes""" + """Manage the OpenBrowser server process. + + The eval mock-site server is now spawned per-test by EvalServerProcess, + so this class only owns OpenBrowser lifecycle. + """ def __init__(self): self.openbrowser_proc = None - self.eval_server_proc = None def start_openbrowser(self) -> bool: """Check if OpenBrowser server is running, prompt user to start if not""" @@ -793,35 +954,6 @@ def start_openbrowser(self) -> bool: logger.error(f"Failed to check OpenBrowser server status: {e}") return False - def start_eval_server(self) -> bool: - """Check if eval server is running, prompt user to start if not""" - try: - client = EvalServerClient() - if client.health_check(): - logger.info("Eval server is running ✓") - return True - - eval_dir = EVAL_DIR - root_dir = EVAL_DIR.parent - logger.error(f""" -❌ Eval server is not running! - Please start the eval server manually with: - - cd {eval_dir} - python server.py - - Or in another terminal: - cd {root_dir} - uv run python eval/server.py - - The server should start on port 16605. -""") - return False - - except Exception as e: - logger.error(f"Failed to check eval server status: {e}") - return False - def stop_services(self): """Stop all services""" if self.openbrowser_proc: @@ -833,15 +965,6 @@ def stop_services(self): logger.error(f"Error stopping OpenBrowser server: {e}") self.openbrowser_proc = None - if self.eval_server_proc: - try: - os.killpg(os.getpgid(self.eval_server_proc.pid), signal.SIGTERM) - self.eval_server_proc.wait(timeout=5) - logger.info("Eval server stopped") - except Exception as e: - logger.error(f"Error stopping eval server: {e}") - self.eval_server_proc = None - class EvaluationRunLock(AbstractContextManager["EvaluationRunLock"]): """Prevent concurrent evaluation runs from reusing the same browser UUID.""" @@ -905,13 +1028,21 @@ class Evaluator: def __init__(self, chrome_uuid: Optional[str] = None): self.chrome_uuid = chrome_uuid self.openbrowser = OpenBrowserClient(chrome_uuid=chrome_uuid) - self.eval_server = EvalServerClient() self.service_manager = ServiceManager() self.results: List[TestResult] = [] self.output_dir: Optional[Path] = None # Will be set per run self.current_model: Optional[str] = None # Current model being tested self.current_target: Optional[LLMTarget] = None # Current CLI target + @staticmethod + def _rewrite_eval_server_urls(text: str, port: int) -> str: + """Rewrite localhost:16605 references to the per-test eval-server port.""" + if not text or port == EVAL_SERVER_PORT: + return text + return text.replace( + f"localhost:{EVAL_SERVER_PORT}", f"localhost:{port}" + ).replace(f"127.0.0.1:{EVAL_SERVER_PORT}", f"127.0.0.1:{port}") + @staticmethod def _sanitize_model_name(model_name: str) -> str: """Make a model name safe for filesystem paths.""" @@ -1094,11 +1225,15 @@ def resolve_targets(self, targets: List[LLMTarget]) -> List[LLMTarget]: def ensure_services( self, skip_services: bool = False, manual: bool = False ) -> bool: - """Ensure required services are running, or skip check if requested + """Ensure required services are running, or skip check if requested. + + The mock-site eval server is now spawned per test (see + EvalServerProcess), so we no longer health-check a global one. Only + OpenBrowser must be reachable up front. Args: skip_services: If True, skip all service checks - manual: If True, only check eval server (manual mode doesn't need OpenBrowser) + manual: If True, skip OpenBrowser check (manual mode doesn't drive it) """ if skip_services: logger.info("Skipping service checks (--no-services flag used)") @@ -1106,13 +1241,6 @@ def ensure_services( logger.info("Checking services...") - # Check eval server - if not self.eval_server.health_check(): - if not self.service_manager.start_eval_server(): - logger.error("Eval server check failed") - return False - - # Check OpenBrowser server (skip in manual mode) if not manual: if not self.openbrowser.health_check(): if not self.service_manager.start_openbrowser(): @@ -1120,7 +1248,7 @@ def ensure_services( return False logger.info("All services are running ✓") else: - logger.info("Eval server is running (manual mode) ✓") + logger.info("Manual mode: per-test eval servers will be spawned on demand") return True @@ -1213,8 +1341,28 @@ def run_test( f"Routine file is empty: {routine_path}", ) - # Clear only the current mock-site event bucket. - self.eval_server.clear_events(site=site_bucket) + # Per-test isolation: spawn a dedicated mock-site server on an + # OS-assigned port, then rewrite the YAML's localhost:16605 references + # to that port. Each conversation has its own events_store, so the + # ?site= filter and "clear before run" dance are no longer needed. + try: + eval_server_proc = EvalServerProcess() + eval_port = eval_server_proc.start() + except Exception as exc: + logger.error("Failed to start per-test eval server: %s", exc) + return self._build_error_result( + test_case, + active_model_name, + f"Failed to start per-test eval server: {exc}", + ) + + eval_server = EvalServerClient(port=eval_port) + rewritten_start_url = self._rewrite_eval_server_urls( + test_case.start_url, eval_port + ) + rewritten_instruction = self._rewrite_eval_server_urls( + test_case.instruction, eval_port + ) # Create new conversation with current model. When replaying a # routine, tag the conversation with mode="routine_replay" so the @@ -1230,6 +1378,7 @@ def run_test( logger.warning( f"Failed to create conversation for model {active_model_name}" ) + eval_server_proc.stop() return self._build_error_result( test_case, active_model_name, @@ -1248,8 +1397,8 @@ def run_test( try: # Initialize with start URL if provided - if test_case.start_url: - init_message = f"Open {test_case.start_url}" + if rewritten_start_url: + init_message = f"Open {rewritten_start_url}" init_result = self.openbrowser.send_message( conversation_id, init_message, @@ -1280,11 +1429,12 @@ def run_test( # the agent treats it as ground truth per the ROUTINE_REPLAY # system-prompt block. if not timed_out: - message_text = ( - routine_markdown - if routine_markdown is not None - else test_case.instruction - ) + if routine_markdown is not None: + message_text = self._rewrite_eval_server_urls( + routine_markdown, eval_port + ) + else: + message_text = rewritten_instruction instruction_result = self.openbrowser.send_message( conversation_id, message_text, @@ -1312,8 +1462,8 @@ def run_test( pending_event_wait = 1.0 if timed_out else 3.0 time.sleep(min(pending_event_wait, max(0.0, deadline - time.time()))) - # Get tracking events - track_events = self.eval_server.get_events(site=site_bucket) + # Get tracking events from this conversation's dedicated server. + track_events = eval_server.get_events() # Save track events to file track_events_file = self._save_track_events( @@ -1378,6 +1528,7 @@ def run_test( ) finally: self._cleanup_openbrowser_conversation(conversation_id) + eval_server_proc.stop() def _extract_images( self, @@ -1999,7 +2150,6 @@ def generate_report(self): def run_manual_test(self, test_case: TestCase) -> TestResult: """Run a test case in manual mode with human performing the same task as OpenBrowser""" logger.info(f"Running manual test: {test_case.name}") - site_bucket = self._get_test_site_bucket(test_case) # Ensure output directory exists if self.output_dir is None: @@ -2008,111 +2158,131 @@ def run_manual_test(self, test_case: TestCase) -> TestResult: self.output_dir.mkdir(parents=True, exist_ok=True) logger.info(f"Created output directory: {self.output_dir}") - # Clear previous events for the current mock site only. - self.eval_server.clear_events(site=site_bucket) - - # Print test information - print("\n" + "=" * 60) - print(f"MANUAL TEST: {test_case.name}") - print(f"Start URL: {test_case.start_url}") - print("=" * 60) - - if test_case.start_url: - print("\n📋 Please open your browser and navigate to:") - print(f" {test_case.start_url}") - print("Make sure the eval server is running (localhost:16605).") - print("The browser should load the test page.") - input("\nPress Enter when ready to continue...") - - # Show the SAME instruction that would be given to OpenBrowser - print("\n📝 Task Instruction (same as given to OpenBrowser):") - print(f" {test_case.instruction}") - print( - "\nPerform this task in the browser. Events will be tracked from this moment." - ) - print("Complete the task using the website's own controls.") - print("After you finish in the browser, return here and enter 'ok' below.") + # Spawn a dedicated eval server for this manual run. + eval_server_proc = EvalServerProcess() + try: + eval_port = eval_server_proc.start() + except Exception as exc: + logger.error("Failed to start per-test eval server: %s", exc) + return self._build_error_result( + test_case, "manual", f"Failed to start eval server: {exc}" + ) - # Start timing when instruction is shown (same as automated test) - start_time = time.time() + eval_server = EvalServerClient(port=eval_port) + rewritten_start_url = self._rewrite_eval_server_urls( + test_case.start_url, eval_port + ) + rewritten_instruction = self._rewrite_eval_server_urls( + test_case.instruction, eval_port + ) - # Wait for user to complete the entire task - while True: - response = ( - input("\nAfter finishing in the browser, enter 'ok' here > ") - .strip() - .lower() + try: + # Print test information + print("\n" + "=" * 60) + print(f"MANUAL TEST: {test_case.name}") + print(f"Start URL: {rewritten_start_url}") + print("=" * 60) + + if rewritten_start_url: + print("\n📋 Please open your browser and navigate to:") + print(f" {rewritten_start_url}") + print(f"This run's eval server is on port {eval_port}.") + print("The browser should load the test page.") + input("\nPress Enter when ready to continue...") + + # Show the SAME instruction that would be given to OpenBrowser + print("\n📝 Task Instruction (same as given to OpenBrowser):") + print(f" {rewritten_instruction}") + print( + "\nPerform this task in the browser. Events will be tracked from this moment." ) - if response == "ok": - break - else: - print("Please finish the browser task first, then enter 'ok' here.") + print("Complete the task using the website's own controls.") + print("After you finish in the browser, return here and enter 'ok' below.") - end_time = time.time() - duration = end_time - start_time + # Start timing when instruction is shown (same as automated test) + start_time = time.time() - # Wait a moment for any pending events to be tracked - time.sleep(2) + # Wait for user to complete the entire task + while True: + response = ( + input("\nAfter finishing in the browser, enter 'ok' here > ") + .strip() + .lower() + ) + if response == "ok": + break + else: + print("Please finish the browser task first, then enter 'ok' here.") - # Get tracking events - track_events = self.eval_server.get_events(site=site_bucket) + end_time = time.time() + duration = end_time - start_time - # Save track events to file (no conversation_id for manual mode, use "manual") - track_events_file = self._save_track_events( - track_events, test_case.id, "manual", self.output_dir - ) + # Wait a moment for any pending events to be tracked + time.sleep(2) - # Evaluate against criteria (no SSE events in manual mode) - passed, score, max_score = self._evaluate_criteria(test_case, track_events, []) + # Get tracking events from this manual run's dedicated server. + track_events = eval_server.get_events() - # Calculate efficiency score (skip usage score for manual mode) - efficiency_score = self._calculate_efficiency_score( - duration, test_case.time_limit - ) - usage_score = 1.0 # Manual mode gets full usage score (no cost) - total_score = score + efficiency_score + usage_score + # Save track events to file (no conversation_id for manual mode, use "manual") + track_events_file = self._save_track_events( + track_events, test_case.id, "manual", self.output_dir + ) - # No images or SSE events in manual mode - images = [] - sse_events = [] - sse_events_file = None + # Evaluate against criteria (no SSE events in manual mode) + passed, score, max_score = self._evaluate_criteria( + test_case, track_events, [] + ) - result = TestResult( - test_case=test_case, - passed=passed, - score=score, - max_score=max_score, - events=[], - sse_events=sse_events, - track_events=track_events, - images=images, - conversation_id="manual", - start_time=start_time, - end_time=end_time, - duration=duration, - cost=None, # No cost in manual mode - efficiency_score=efficiency_score, - usage_score=usage_score, - total_score=total_score, - sse_events_file=sse_events_file, - track_events_file=track_events_file, - model="manual", - ) + # Calculate efficiency score (skip usage score for manual mode) + efficiency_score = self._calculate_efficiency_score( + duration, test_case.time_limit + ) + usage_score = 1.0 # Manual mode gets full usage score (no cost) + total_score = score + efficiency_score + usage_score - # Print completion message - print(f"\n{'=' * 60}") - print("Manual test completed!") - print(f"Duration: {duration:.1f}s") - print(f"Track events recorded: {len(track_events)}") - print(f"Task score: {score:.1f}/{max_score:.1f}") - print(f"Efficiency score: {efficiency_score:.2f}/1.0") - print(f"Usage score: {usage_score:.2f}/1.0 (manual)") - print(f"Total score: {total_score:.1f}") - print(f"Passed: {'YES' if passed else 'NO'}") - print(f"Track events saved to: {track_events_file}") - print("=" * 60) - - return result + # No images or SSE events in manual mode + images = [] + sse_events = [] + sse_events_file = None + + result = TestResult( + test_case=test_case, + passed=passed, + score=score, + max_score=max_score, + events=[], + sse_events=sse_events, + track_events=track_events, + images=images, + conversation_id="manual", + start_time=start_time, + end_time=end_time, + duration=duration, + cost=None, # No cost in manual mode + efficiency_score=efficiency_score, + usage_score=usage_score, + total_score=total_score, + sse_events_file=sse_events_file, + track_events_file=track_events_file, + model="manual", + ) + + # Print completion message + print(f"\n{'=' * 60}") + print("Manual test completed!") + print(f"Duration: {duration:.1f}s") + print(f"Track events recorded: {len(track_events)}") + print(f"Task score: {score:.1f}/{max_score:.1f}") + print(f"Efficiency score: {efficiency_score:.2f}/1.0") + print(f"Usage score: {usage_score:.2f}/1.0 (manual)") + print(f"Total score: {total_score:.1f}") + print(f"Passed: {'YES' if passed else 'NO'}") + print(f"Track events saved to: {track_events_file}") + print("=" * 60) + + return result + finally: + eval_server_proc.stop() def _build_scheduled_jobs( self, test_cases: List[TestCase], targets: List[LLMTarget] diff --git a/eval/evaluation_report.json b/eval/evaluation_report.json index 5536177..a5a0a67 100644 --- a/eval/evaluation_report.json +++ b/eval/evaluation_report.json @@ -1,39 +1,65 @@ { "evaluation": { - "timestamp": "2026-04-16 12:57:45", - "unix_timestamp": 1776315465.866534, + "timestamp": "2026-04-21 02:09:48", + "unix_timestamp": 1776708588.662792, "summary": { - "total_tests": 70, - "passed_tests": 59, - "pass_rate": 84.29, + "total_tests": 140, + "passed_tests": 111, + "pass_rate": 79.29, "models_tested": [ + "dashscope/qwen3.5-plus", + "dashscope/qwen3.6-plus", "dashscope/qwen3.5-flash", - "dashscope/qwen3.5-plus" + "dashscope/qwen3.6-flash" ] }, "model_performance": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { + "pass_rate": 85.71, + "task_score": 276.2, + "task_max_score": 304.8, + "efficiency_score": 17.9266, + "usage_score": 22.0022, + "composite_score": 0.7425, + "avg_duration": 309.51, + "avg_cost": 0.598152, + "passed_count": 30, + "total_tests": 35 + }, + "dashscope/qwen3.6-plus": { "pass_rate": 80.0, - "task_score": 262.7, + "task_score": 262.4, "task_max_score": 304.8, - "efficiency_score": 17.5078, - "usage_score": 29.7044, - "composite_score": 0.7498, - "avg_duration": 308.83, - "avg_cost": 0.21822, + "efficiency_score": 16.214, + "usage_score": 5.4819, + "composite_score": 0.604, + "avg_duration": 337.59, + "avg_cost": 1.605398, "passed_count": 28, "total_tests": 35 }, - "dashscope/qwen3.5-plus": { - "pass_rate": 88.57, - "task_score": 276.8, + "dashscope/qwen3.5-flash": { + "pass_rate": 68.57, + "task_score": 243.1, "task_max_score": 304.8, - "efficiency_score": 16.1587, - "usage_score": 20.0958, - "composite_score": 0.7386, - "avg_duration": 335.62, - "avg_cost": 0.633391, - "passed_count": 31, + "efficiency_score": 17.9365, + "usage_score": 31.47, + "composite_score": 0.6938, + "avg_duration": 308.84, + "avg_cost": 0.144029, + "passed_count": 24, + "total_tests": 35 + }, + "dashscope/qwen3.6-flash": { + "pass_rate": 82.86, + "task_score": 273.0, + "task_max_score": 304.8, + "efficiency_score": 20.762, + "usage_score": 18.0751, + "composite_score": 0.7191, + "avg_duration": 252.27, + "avg_cost": 0.804474, + "passed_count": 29, "total_tests": 35 } }, @@ -41,945 +67,1715 @@ "bluebook_simple": { "name": "BlueBook Search And Like Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 6.0, + "task_max_score": 6.0, + "efficiency_score": 0.5594, + "usage_score": 0.6022, + "composite_score": 0.8323, + "total_score": 7.16, + "duration": 132.19, + "cost": 0.238706 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 6.0, + "task_max_score": 6.0, + "efficiency_score": 0.5934, + "usage_score": 0, + "composite_score": 0.7187, + "total_score": 6.59, + "duration": 121.97, + "cost": 0.641388 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 6.0, "task_max_score": 6.0, - "efficiency_score": 0.6199, - "usage_score": 0.8831, - "composite_score": 0.9006, - "total_score": 7.5, - "duration": 114.02, - "cost": 0.070145 + "efficiency_score": 0.6896, + "usage_score": 0.9469, + "composite_score": 0.9273, + "total_score": 7.64, + "duration": 93.13, + "cost": 0.031849 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 6.0, "task_max_score": 6.0, - "efficiency_score": 0.5293, - "usage_score": 0.552, - "composite_score": 0.8163, - "total_score": 7.08, - "duration": 141.2, - "cost": 0.268781 + "efficiency_score": 0.7033, + "usage_score": 0.601, + "composite_score": 0.8608, + "total_score": 7.3, + "duration": 89.02, + "cost": 0.239427 } } }, "staybnb_search": { "name": "StayBnB Search \u2014 Segmented Pill, Calendar & Guest Stepper", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 10.5, "task_max_score": 10.5, - "efficiency_score": 0.4515, - "usage_score": 0.8484, - "composite_score": 0.86, - "total_score": 11.8, - "duration": 296.2, - "cost": 0.227344 + "efficiency_score": 0.4977, + "usage_score": 0.631, + "composite_score": 0.8257, + "total_score": 11.63, + "duration": 271.24, + "cost": 0.553507 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 10.5, + "task_max_score": 10.5, + "efficiency_score": 0.5085, + "usage_score": 0.0436, + "composite_score": 0.7104, + "total_score": 11.05, + "duration": 265.43, + "cost": 1.434668 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 6.0, + "task_max_score": 10.5, + "efficiency_score": 0, + "usage_score": 0.7994, + "composite_score": 0.1599, + "total_score": 6.8, + "duration": 540.0, + "cost": 0.300937 + }, + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 10.5, "task_max_score": 10.5, - "efficiency_score": 0.3179, - "usage_score": 0.5094, - "composite_score": 0.7655, - "total_score": 11.33, - "duration": 368.33, - "cost": 0.73591 + "efficiency_score": 0.5867, + "usage_score": 0.6147, + "composite_score": 0.8403, + "total_score": 11.7, + "duration": 223.18, + "cost": 0.577881 } } }, "finviz_simple": { "name": "Finviz Simple Screener Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 3, + "task_max_score": 3, + "efficiency_score": 0.7377, + "usage_score": 0.8314, + "composite_score": 0.9138, + "total_score": 4.57, + "duration": 78.68, + "cost": 0.134851 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 3, + "task_max_score": 3, + "efficiency_score": 0.7348, + "usage_score": 0.5843, + "composite_score": 0.8638, + "total_score": 4.32, + "duration": 79.57, + "cost": 0.332596 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 3, "task_max_score": 3, - "efficiency_score": 0.7395, - "usage_score": 0.9366, - "composite_score": 0.9352, - "total_score": 4.68, - "duration": 78.16, - "cost": 0.050697 + "efficiency_score": 0.8202, + "usage_score": 0.9632, + "composite_score": 0.9567, + "total_score": 4.78, + "duration": 53.93, + "cost": 0.029475 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 3, "task_max_score": 3, - "efficiency_score": 0.6329, - "usage_score": 0.7633, - "composite_score": 0.8792, - "total_score": 4.4, - "duration": 110.13, - "cost": 0.189329 + "efficiency_score": 0.8576, + "usage_score": 0.8196, + "composite_score": 0.9355, + "total_score": 4.68, + "duration": 42.71, + "cost": 0.144301 } } }, "cloudstack_interactive": { "name": "CloudStack DAS Interactive Test", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.6279, - "usage_score": 0.8775, - "composite_score": 0.9011, - "total_score": 10.51, - "duration": 260.46, - "cost": 0.245038 + "efficiency_score": 0.6858, + "usage_score": 0.7779, + "composite_score": 0.8927, + "total_score": 10.46, + "duration": 219.93, + "cost": 0.44422 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.31, + "usage_score": 0, + "composite_score": 0.662, + "total_score": 9.31, + "duration": 483.02, + "cost": 2.825362 + }, + "dashscope/qwen3.5-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.5607, - "usage_score": 0.6587, - "composite_score": 0.8439, - "total_score": 10.22, - "duration": 307.49, - "cost": 0.682633 + "efficiency_score": 0.4063, + "usage_score": 0.9426, + "composite_score": 0.8698, + "total_score": 10.35, + "duration": 415.62, + "cost": 0.114785 + }, + "dashscope/qwen3.6-flash": { + "passed": false, + "task_score": 6.0, + "task_max_score": 9.0, + "efficiency_score": 0.6337, + "usage_score": 0.5537, + "composite_score": 0.2375, + "total_score": 7.19, + "duration": 256.4, + "cost": 0.892651 } } }, "gbr": { "name": "GBR Search Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 2.5, + "task_max_score": 2.5, + "efficiency_score": 0.7938, + "usage_score": 0.8007, + "composite_score": 0.9189, + "total_score": 4.09, + "duration": 82.47, + "cost": 0.159417 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 2.5, + "task_max_score": 2.5, + "efficiency_score": 0.7871, + "usage_score": 0.4028, + "composite_score": 0.838, + "total_score": 3.69, + "duration": 85.16, + "cost": 0.477778 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 2.5, "task_max_score": 2.5, - "efficiency_score": 0.8267, - "usage_score": 0.9456, - "composite_score": 0.9545, - "total_score": 4.27, - "duration": 69.31, - "cost": 0.043503 + "efficiency_score": 0.8071, + "usage_score": 0.9737, + "composite_score": 0.9562, + "total_score": 4.28, + "duration": 77.15, + "cost": 0.021066 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 2.5, "task_max_score": 2.5, - "efficiency_score": 0.751, - "usage_score": 0.7607, - "composite_score": 0.9023, - "total_score": 4.01, - "duration": 99.6, - "cost": 0.191412 + "efficiency_score": 0.7598, + "usage_score": 0.8325, + "composite_score": 0.9185, + "total_score": 4.09, + "duration": 96.07, + "cost": 0.134005 } } }, "gmail_exec_followup": { "name": "Gmail Finance Follow-up", "results_by_model": { - "dashscope/qwen3.5-flash": { - "passed": true, - "task_score": 8.0, - "task_max_score": 8.0, - "efficiency_score": 0.2465, - "usage_score": 0.6405, - "composite_score": 0.7774, - "total_score": 8.89, - "duration": 497.29, - "cost": 0.503311 - }, "dashscope/qwen3.5-plus": { "passed": false, "task_score": 2.5, "task_max_score": 8.0, "efficiency_score": 0, - "usage_score": 0.9794, + "usage_score": 0.9793, "composite_score": 0.1959, "total_score": 3.48, "duration": 660.0, - "cost": 0.028871 + "cost": 0.028958 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 8.0, + "task_max_score": 8.0, + "efficiency_score": 0.437, + "usage_score": 0, + "composite_score": 0.6874, + "total_score": 8.44, + "duration": 371.61, + "cost": 2.039822 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 4.5, + "task_max_score": 8.0, + "efficiency_score": 0.4675, + "usage_score": 0.8991, + "composite_score": 0.2733, + "total_score": 5.87, + "duration": 351.44, + "cost": 0.141273 + }, + "dashscope/qwen3.6-flash": { + "passed": true, + "task_score": 8.0, + "task_max_score": 8.0, + "efficiency_score": 0.5684, + "usage_score": 0.2989, + "composite_score": 0.7735, + "total_score": 8.87, + "duration": 284.86, + "cost": 0.981477 } } }, "booking_compare_and_book": { "name": "Booking Compare And Book", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 10.0, + "task_max_score": 10.0, + "efficiency_score": 0.4587, + "usage_score": 0.5556, + "composite_score": 0.8029, + "total_score": 11.01, + "duration": 389.73, + "cost": 0.755515 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 10.0, + "task_max_score": 10.0, + "efficiency_score": 0.4328, + "usage_score": 0, + "composite_score": 0.6866, + "total_score": 10.43, + "duration": 408.36, + "cost": 2.230264 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 10.0, "task_max_score": 10.0, - "efficiency_score": 0.4376, - "usage_score": 0.8571, - "composite_score": 0.8589, - "total_score": 11.29, - "duration": 404.95, - "cost": 0.242864 + "efficiency_score": 0.5645, + "usage_score": 0.936, + "composite_score": 0.9001, + "total_score": 11.5, + "duration": 313.57, + "cost": 0.108752 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 10.0, "task_max_score": 10.0, - "efficiency_score": 0.3842, - "usage_score": 0.4168, - "composite_score": 0.7602, - "total_score": 10.8, - "duration": 443.36, - "cost": 0.991488 + "efficiency_score": 0.6659, + "usage_score": 0.6143, + "composite_score": 0.856, + "total_score": 11.28, + "duration": 240.57, + "cost": 0.655618 } } }, "github_pr_review": { "name": "GitHub PR Review", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 7.9, + "task_max_score": 9.0, + "efficiency_score": 0.4568, + "usage_score": 0.515, + "composite_score": 0.7944, + "total_score": 8.87, + "duration": 391.07, + "cost": 0.824436 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.4807, + "usage_score": 0, + "composite_score": 0.6961, + "total_score": 9.48, + "duration": 373.87, + "cost": 1.88039 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.5984, - "usage_score": 0.8812, - "composite_score": 0.8959, + "efficiency_score": 0.5542, + "usage_score": 0.9286, + "composite_score": 0.8966, "total_score": 10.48, - "duration": 289.14, - "cost": 0.201966 + "duration": 320.95, + "cost": 0.121409 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.4889, - "usage_score": 0.528, - "composite_score": 0.8034, - "total_score": 10.02, - "duration": 367.98, - "cost": 0.802473 + "efficiency_score": 0.6868, + "usage_score": 0.653, + "composite_score": 0.868, + "total_score": 10.34, + "duration": 225.49, + "cost": 0.589938 } } }, "vidhub_comment": { "name": "VidHub Comment \u2014 Description, Nested Replies & Volume Slider", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 15.0, + "task_max_score": 15.0, + "efficiency_score": 0.5312, + "usage_score": 0.7441, + "composite_score": 0.8551, + "total_score": 16.28, + "duration": 281.26, + "cost": 0.511721 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 15.0, + "task_max_score": 15.0, + "efficiency_score": 0.3915, + "usage_score": 0.0827, + "composite_score": 0.6948, + "total_score": 15.47, + "duration": 365.12, + "cost": 1.834634 + }, "dashscope/qwen3.5-flash": { "passed": true, - "task_score": 14.0, + "task_score": 15.0, "task_max_score": 15.0, - "efficiency_score": 0.4461, - "usage_score": 0.9121, - "composite_score": 0.8716, - "total_score": 15.36, - "duration": 332.34, - "cost": 0.175805 + "efficiency_score": 0.3142, + "usage_score": 0.8691, + "composite_score": 0.8367, + "total_score": 16.18, + "duration": 411.5, + "cost": 0.261757 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 15.0, "task_max_score": 15.0, - "efficiency_score": 0.5174, - "usage_score": 0.6548, - "composite_score": 0.8345, - "total_score": 16.17, - "duration": 289.56, - "cost": 0.690308 + "efficiency_score": 0.6577, + "usage_score": 0.6984, + "composite_score": 0.8712, + "total_score": 16.36, + "duration": 205.39, + "cost": 0.603116 } } }, "techforum_reply": { "name": "TechForum Comment Reply Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 9.5, + "task_max_score": 9.5, + "efficiency_score": 0.7189, + "usage_score": 0.7062, + "composite_score": 0.885, + "total_score": 10.93, + "duration": 140.57, + "cost": 0.293767 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 9.5, + "task_max_score": 9.5, + "efficiency_score": 0.5932, + "usage_score": 0, + "composite_score": 0.7186, + "total_score": 10.09, + "duration": 203.41, + "cost": 1.025706 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 9.5, "task_max_score": 9.5, - "efficiency_score": 0.6244, - "usage_score": 0.8584, - "composite_score": 0.8966, - "total_score": 10.98, - "duration": 187.79, - "cost": 0.14158 + "efficiency_score": 0.7091, + "usage_score": 0.9454, + "composite_score": 0.9309, + "total_score": 11.15, + "duration": 145.44, + "cost": 0.054579 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 9.5, "task_max_score": 9.5, - "efficiency_score": 0.5832, - "usage_score": 0.5581, - "composite_score": 0.8283, - "total_score": 10.64, - "duration": 208.39, - "cost": 0.441892 + "efficiency_score": 0.5407, + "usage_score": 0.5427, + "composite_score": 0.8167, + "total_score": 10.58, + "duration": 229.66, + "cost": 0.457308 } } }, "replay_techforum_upvote": { "name": "Replay: TechForum search + upvote AI agent posts", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": false, "task_score": 4, "task_max_score": 10, - "efficiency_score": 0.6262, - "usage_score": 0.8152, - "composite_score": 0.2883, - "total_score": 5.44, - "duration": 224.29, - "cost": 0.184767 + "efficiency_score": 0.4925, + "usage_score": 0.3945, + "composite_score": 0.1774, + "total_score": 4.89, + "duration": 304.53, + "cost": 0.605525 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 4, + "task_max_score": 10, + "efficiency_score": 0.3579, + "usage_score": 0, + "composite_score": 0.0716, + "total_score": 4.36, + "duration": 385.24, + "cost": 2.098546 + }, + "dashscope/qwen3.5-flash": { "passed": false, "task_score": 4, "task_max_score": 10, - "efficiency_score": 0.4832, - "usage_score": 0.173, - "composite_score": 0.1312, - "total_score": 4.66, - "duration": 310.1, - "cost": 0.82697 + "efficiency_score": 0.538, + "usage_score": 0.8463, + "composite_score": 0.2769, + "total_score": 5.38, + "duration": 277.18, + "cost": 0.153673 + }, + "dashscope/qwen3.6-flash": { + "passed": true, + "task_score": 8, + "task_max_score": 10, + "efficiency_score": 0.4149, + "usage_score": 0, + "composite_score": 0.683, + "total_score": 8.41, + "duration": 351.04, + "cost": 1.737265 } } }, "replay_finviz_filter_simple": { "name": "Replay: Finviz multi-filter screening routine", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 12, "task_max_score": 12, "efficiency_score": 0, - "usage_score": 0.9926, - "composite_score": 0.7985, - "total_score": 12.99, + "usage_score": 0, + "composite_score": 0.6, + "total_score": 12.0, "duration": 600.0, - "cost": 0.007396 + "cost": 2.873768 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { "passed": true, "task_score": 12, "task_max_score": 12, "efficiency_score": 0, - "usage_score": 0.9677, - "composite_score": 0.7935, - "total_score": 12.97, + "usage_score": 0.9118, + "composite_score": 0.7824, + "total_score": 12.91, "duration": 600.0, - "cost": 0.032326 + "cost": 0.088226 + }, + "dashscope/qwen3.5-flash": { + "passed": true, + "task_score": 12, + "task_max_score": 12, + "efficiency_score": 0.1621, + "usage_score": 0.2823, + "composite_score": 0.6889, + "total_score": 12.44, + "duration": 502.76, + "cost": 0.71766 + }, + "dashscope/qwen3.6-flash": { + "passed": true, + "task_score": 12, + "task_max_score": 12, + "efficiency_score": 0.6688, + "usage_score": 0, + "composite_score": 0.7338, + "total_score": 12.67, + "duration": 198.7, + "cost": 1.402601 } } }, "taskflow_full_workflow": { "name": "TaskFlow Full Workflow \u2014 Create, Label, Drag & Filter", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 13.0, "task_max_score": 13.0, - "efficiency_score": 0.2511, - "usage_score": 0.7271, - "composite_score": 0.7956, - "total_score": 13.98, - "duration": 449.32, - "cost": 0.545825 + "efficiency_score": 0.2341, + "usage_score": 0.5225, + "composite_score": 0.7513, + "total_score": 13.76, + "duration": 459.54, + "cost": 0.954963 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 13.0, + "task_max_score": 13.0, + "efficiency_score": 0.1962, + "usage_score": 0, + "composite_score": 0.6392, + "total_score": 13.2, + "duration": 482.29, + "cost": 2.894566 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 3.5, + "task_max_score": 13.0, + "efficiency_score": 0, + "usage_score": 0.9978, + "composite_score": 0.1996, + "total_score": 4.5, + "duration": 600.0, + "cost": 0.00435 + }, + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 13.0, "task_max_score": 13.0, - "efficiency_score": 0.1963, - "usage_score": 0.2888, - "composite_score": 0.697, - "total_score": 13.49, - "duration": 482.24, - "cost": 1.422402 + "efficiency_score": 0.4628, + "usage_score": 0.4912, + "composite_score": 0.7908, + "total_score": 13.95, + "duration": 322.35, + "cost": 1.017561 } } }, "bluebook_complex": { "name": "BlueBook Multi-Image Reply Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 12.0, + "task_max_score": 12.0, + "efficiency_score": 0.6479, + "usage_score": 0.7556, + "composite_score": 0.8807, + "total_score": 13.4, + "duration": 176.05, + "cost": 0.293316 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 12.0, + "task_max_score": 12.0, + "efficiency_score": 0.5427, + "usage_score": 0, + "composite_score": 0.7085, + "total_score": 12.54, + "duration": 228.64, + "cost": 1.204512 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 12.0, "task_max_score": 12.0, - "efficiency_score": 0.6286, - "usage_score": 0.8847, - "composite_score": 0.9027, - "total_score": 13.51, - "duration": 185.72, - "cost": 0.138317 + "efficiency_score": 0.5102, + "usage_score": 0.9037, + "composite_score": 0.8828, + "total_score": 13.41, + "duration": 244.91, + "cost": 0.115607 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 12.0, "task_max_score": 12.0, - "efficiency_score": 0.5458, - "usage_score": 0.6418, - "composite_score": 0.8375, - "total_score": 13.19, - "duration": 227.08, - "cost": 0.42988 + "efficiency_score": 0.7614, + "usage_score": 0.7411, + "composite_score": 0.9005, + "total_score": 13.5, + "duration": 119.3, + "cost": 0.310701 } } }, "drive_bulk_release_assets": { "name": "Drive Bulk Release Assets", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 10.0, "task_max_score": 10.0, - "efficiency_score": 0.5403, - "usage_score": 0.7771, - "composite_score": 0.8635, - "total_score": 11.32, - "duration": 450.46, - "cost": 0.467991 + "efficiency_score": 0.3539, + "usage_score": 0.3845, + "composite_score": 0.7477, + "total_score": 10.74, + "duration": 633.13, + "cost": 1.292653 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { "passed": true, "task_score": 10.0, "task_max_score": 10.0, - "efficiency_score": 0.4682, - "usage_score": 0.4083, - "composite_score": 0.7753, - "total_score": 10.88, - "duration": 521.19, - "cost": 1.2426 + "efficiency_score": 0.1832, + "usage_score": 0, + "composite_score": 0.6366, + "total_score": 10.18, + "duration": 800.42, + "cost": 4.123654 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 4.8, + "task_max_score": 10.0, + "efficiency_score": 0, + "usage_score": 0.6883, + "composite_score": 0.1377, + "total_score": 5.49, + "duration": 980.0, + "cost": 0.654507 + }, + "dashscope/qwen3.6-flash": { + "passed": true, + "task_score": 10.0, + "task_max_score": 10.0, + "efficiency_score": 0.6202, + "usage_score": 0.3571, + "composite_score": 0.7955, + "total_score": 10.98, + "duration": 372.18, + "cost": 1.349999 } } }, "booking_family_trip_edgecase": { "name": "Booking Family Trip Edge Case", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 11.0, + "task_max_score": 11.0, + "efficiency_score": 0.4581, + "usage_score": 0.5148, + "composite_score": 0.7946, + "total_score": 11.97, + "duration": 563.56, + "cost": 1.16445 + }, + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 6.6, + "task_max_score": 11.0, + "efficiency_score": 0.49, + "usage_score": 0, + "composite_score": 0.098, + "total_score": 7.09, + "duration": 530.37, + "cost": 2.95212 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 11.0, "task_max_score": 11.0, - "efficiency_score": 0.5848, - "usage_score": 0.8508, - "composite_score": 0.8871, - "total_score": 12.44, - "duration": 431.85, - "cost": 0.358165 + "efficiency_score": 0.551, + "usage_score": 0.9229, + "composite_score": 0.8948, + "total_score": 12.47, + "duration": 466.92, + "cost": 0.185072 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 11.0, "task_max_score": 11.0, - "efficiency_score": 0.3089, - "usage_score": 0.3461, - "composite_score": 0.731, - "total_score": 11.66, - "duration": 718.75, - "cost": 1.569323 + "efficiency_score": 0.6381, + "usage_score": 0.5162, + "composite_score": 0.8309, + "total_score": 12.15, + "duration": 376.42, + "cost": 1.161117 } } }, "techforum": { "name": "TechForum Upvote Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 2, + "task_max_score": 2, + "efficiency_score": 0.8547, + "usage_score": 0.88, + "composite_score": 0.9469, + "total_score": 3.73, + "duration": 43.59, + "cost": 0.060018 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 2, + "task_max_score": 2, + "efficiency_score": 0.8486, + "usage_score": 0.5989, + "composite_score": 0.8895, + "total_score": 3.45, + "duration": 45.41, + "cost": 0.20057 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 2, "task_max_score": 2, - "efficiency_score": 0.8991, - "usage_score": 0.9625, - "composite_score": 0.9723, - "total_score": 3.86, - "duration": 30.27, - "cost": 0.018734 + "efficiency_score": 0.8941, + "usage_score": 0.9811, + "composite_score": 0.975, + "total_score": 3.88, + "duration": 31.78, + "cost": 0.009458 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 2, "task_max_score": 2, - "efficiency_score": 0.8039, - "usage_score": 0.8354, - "composite_score": 0.9279, - "total_score": 3.64, - "duration": 58.82, - "cost": 0.082305 + "efficiency_score": 0.9213, + "usage_score": 0.8786, + "composite_score": 0.96, + "total_score": 3.8, + "duration": 23.61, + "cost": 0.060693 } } }, "gmail_inbox_cleanup": { "name": "Gmail Inbox Cleanup", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 7.0, + "task_max_score": 7.0, + "efficiency_score": 0.2853, + "usage_score": 0.2957, + "composite_score": 0.7162, + "total_score": 7.58, + "duration": 428.84, + "cost": 0.8452 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 7.0, + "task_max_score": 7.0, + "efficiency_score": 0.2697, + "usage_score": 0, + "composite_score": 0.6539, + "total_score": 7.27, + "duration": 438.17, + "cost": 2.390548 + }, "dashscope/qwen3.5-flash": { - "passed": false, - "task_score": 2.0, + "passed": true, + "task_score": 7.0, "task_max_score": 7.0, - "efficiency_score": 0, - "usage_score": 0.3895, - "composite_score": 0.0779, - "total_score": 2.39, - "duration": 600.0, - "cost": 0.73264 + "efficiency_score": 0.4339, + "usage_score": 0.8423, + "composite_score": 0.8552, + "total_score": 8.28, + "duration": 339.69, + "cost": 0.189242 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 7.0, "task_max_score": 7.0, - "efficiency_score": 0.3282, - "usage_score": 0.254, - "composite_score": 0.7164, - "total_score": 7.58, - "duration": 403.07, - "cost": 0.895225 + "efficiency_score": 0.1362, + "usage_score": 0, + "composite_score": 0.6272, + "total_score": 7.14, + "duration": 518.29, + "cost": 1.930638 } } }, "finviz_complex": { "name": "Finviz Multi-Filter Screener Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 5.0, + "task_max_score": 5.0, + "efficiency_score": 0.5856, + "usage_score": 0.6048, + "composite_score": 0.8381, + "total_score": 6.19, + "duration": 165.77, + "cost": 0.39524 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 5.0, + "task_max_score": 5.0, + "efficiency_score": 0.4123, + "usage_score": 0, + "composite_score": 0.6825, + "total_score": 5.41, + "duration": 235.07, + "cost": 1.010636 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 5.0, "task_max_score": 5.0, - "efficiency_score": 0.6676, - "usage_score": 0.9017, - "composite_score": 0.9139, - "total_score": 6.57, - "duration": 132.95, - "cost": 0.098264 + "efficiency_score": 0.7286, + "usage_score": 0.9312, + "composite_score": 0.932, + "total_score": 6.66, + "duration": 108.54, + "cost": 0.068788 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 5.0, "task_max_score": 5.0, - "efficiency_score": 0.3679, - "usage_score": 0.4354, - "composite_score": 0.7607, - "total_score": 5.8, - "duration": 252.84, - "cost": 0.564636 + "efficiency_score": 0.5856, + "usage_score": 0.1863, + "composite_score": 0.7544, + "total_score": 5.77, + "duration": 165.76, + "cost": 0.813659 } } }, "mapquest_nearby_pins": { "name": "MapQuest Nearby Pins \u2014 Scroll Chips, Ambiguous Pins & Directions", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": false, - "task_score": 6.5, + "task_score": 4.5, "task_max_score": 12.0, - "efficiency_score": 0.6854, - "usage_score": 0.9316, - "composite_score": 0.3234, - "total_score": 8.12, - "duration": 188.76, - "cost": 0.136896 + "efficiency_score": 0.6763, + "usage_score": 0.852, + "composite_score": 0.3057, + "total_score": 6.03, + "duration": 194.19, + "cost": 0.29601 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { "passed": false, "task_score": 4.5, "task_max_score": 12.0, - "efficiency_score": 0.2238, - "usage_score": 0.569, - "composite_score": 0.1585, - "total_score": 5.29, - "duration": 465.73, - "cost": 0.862091 + "efficiency_score": 0, + "usage_score": 0, + "composite_score": 0, + "total_score": 4.5, + "duration": 600.0, + "cost": 2.849986 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 5.0, + "task_max_score": 12.0, + "efficiency_score": 0, + "usage_score": 0.8524, + "composite_score": 0.1705, + "total_score": 5.85, + "duration": 600.0, + "cost": 0.295195 + }, + "dashscope/qwen3.6-flash": { + "passed": false, + "task_score": 7.5, + "task_max_score": 12.0, + "efficiency_score": 0.3231, + "usage_score": 0.0182, + "composite_score": 0.0683, + "total_score": 7.84, + "duration": 406.13, + "cost": 1.963687 } } }, "cloudstack": { "name": "CloudStack DAS Agent Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 3.5, + "task_max_score": 3.5, + "efficiency_score": 0.7441, + "usage_score": 0.8197, + "composite_score": 0.9128, + "total_score": 5.06, + "duration": 127.94, + "cost": 0.21634 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 3.5, + "task_max_score": 3.5, + "efficiency_score": 0.3785, + "usage_score": 0, + "composite_score": 0.6757, + "total_score": 3.88, + "duration": 310.75, + "cost": 1.686722 + }, "dashscope/qwen3.5-flash": { - "passed": false, - "task_score": 0.5, + "passed": true, + "task_score": 3.5, "task_max_score": 3.5, - "efficiency_score": 0, - "usage_score": 0.5224, - "composite_score": 0.1045, - "total_score": 1.02, - "duration": 500.0, - "cost": 0.57314 + "efficiency_score": 0.4979, + "usage_score": 0.9026, + "composite_score": 0.8801, + "total_score": 4.9, + "duration": 251.04, + "cost": 0.116935 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 3.5, "task_max_score": 3.5, - "efficiency_score": 0.4456, - "usage_score": 0.5544, - "composite_score": 0.8, - "total_score": 4.5, - "duration": 277.18, - "cost": 0.534674 + "efficiency_score": 0.8323, + "usage_score": 0.8156, + "composite_score": 0.9296, + "total_score": 5.15, + "duration": 83.85, + "cost": 0.221246 } } }, "staybnb_book": { "name": "StayBnB Book \u2014 Filters, Gallery & Two-Step Booking", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": false, + "task_score": 11.0, + "task_max_score": 15.0, + "efficiency_score": 0.1707, + "usage_score": 0.4928, + "composite_score": 0.1327, + "total_score": 11.66, + "duration": 497.55, + "cost": 1.014492 + }, + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 3.0, + "task_max_score": 15.0, + "efficiency_score": 0, + "usage_score": 0.9615, + "composite_score": 0.1923, + "total_score": 3.96, + "duration": 600.0, + "cost": 0.077058 + }, "dashscope/qwen3.5-flash": { "passed": false, - "task_score": 3.5, + "task_score": 6.5, "task_max_score": 15.0, "efficiency_score": 0, - "usage_score": 0.9966, - "composite_score": 0.1993, - "total_score": 4.5, + "usage_score": 0.9978, + "composite_score": 0.1996, + "total_score": 7.5, "duration": 600.0, - "cost": 0.006756 + "cost": 0.0044 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": false, - "task_score": 6.0, + "task_score": 4.0, "task_max_score": 15.0, "efficiency_score": 0, - "usage_score": 0.9853, - "composite_score": 0.1971, - "total_score": 6.99, + "usage_score": 0.9851, + "composite_score": 0.197, + "total_score": 4.99, "duration": 600.0, - "cost": 0.029382 + "cost": 0.029767 } } }, "mapquest_navigate": { "name": "MapQuest Navigate \u2014 Autocomplete, Directions & Collapse", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 8.0, + "task_max_score": 9.5, + "efficiency_score": 0.5912, + "usage_score": 0.7827, + "composite_score": 0.8748, + "total_score": 9.37, + "duration": 220.78, + "cost": 0.326005 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 8.0, + "task_max_score": 9.5, + "efficiency_score": 0.5919, + "usage_score": 0.3148, + "composite_score": 0.7813, + "total_score": 8.91, + "duration": 220.37, + "cost": 1.027844 + }, "dashscope/qwen3.5-flash": { "passed": true, - "task_score": 9.5, + "task_score": 8.0, "task_max_score": 9.5, - "efficiency_score": 0.718, - "usage_score": 0.9293, - "composite_score": 0.9295, - "total_score": 11.15, - "duration": 152.28, - "cost": 0.106015 + "efficiency_score": 0.4887, + "usage_score": 0.9196, + "composite_score": 0.8817, + "total_score": 9.41, + "duration": 276.08, + "cost": 0.120603 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 9.5, "task_max_score": 9.5, - "efficiency_score": 0.6312, - "usage_score": 0.6976, - "composite_score": 0.8658, - "total_score": 10.83, - "duration": 199.16, - "cost": 0.453613 + "efficiency_score": 0.5911, + "usage_score": 0.5741, + "composite_score": 0.8331, + "total_score": 10.67, + "duration": 220.79, + "cost": 0.638787 } } }, "booking_room_selection": { "name": "Booking Room Selection", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.4767, + "usage_score": 0.6625, + "composite_score": 0.8278, + "total_score": 10.14, + "duration": 345.35, + "cost": 0.506323 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.6022, + "usage_score": 0.0811, + "composite_score": 0.7367, + "total_score": 9.68, + "duration": 262.53, + "cost": 1.378412 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.6813, - "usage_score": 0.9085, - "composite_score": 0.918, - "total_score": 10.59, - "duration": 210.36, - "cost": 0.137228 + "efficiency_score": 0.6976, + "usage_score": 0.9569, + "composite_score": 0.9309, + "total_score": 10.65, + "duration": 199.58, + "cost": 0.064583 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.4989, - "usage_score": 0.625, - "composite_score": 0.8248, - "total_score": 10.12, - "duration": 330.74, - "cost": 0.562517 + "efficiency_score": 0.7497, + "usage_score": 0.7346, + "composite_score": 0.8969, + "total_score": 10.48, + "duration": 165.17, + "cost": 0.398131 } } }, "vidhub_player": { "name": "VidHub Player \u2014 Search, Auto-Hide Controls & Nested Settings", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 12.0, "task_max_score": 12.0, - "efficiency_score": 0.4243, - "usage_score": 0.8463, - "composite_score": 0.8541, - "total_score": 13.27, - "duration": 310.86, - "cost": 0.230535 + "efficiency_score": 0.426, + "usage_score": 0.7, + "composite_score": 0.8252, + "total_score": 13.13, + "duration": 309.94, + "cost": 0.450043 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { "passed": true, "task_score": 12.0, "task_max_score": 12.0, - "efficiency_score": 0.322, - "usage_score": 0.4556, - "composite_score": 0.7555, - "total_score": 12.78, - "duration": 366.11, - "cost": 0.816651 + "efficiency_score": 0.5301, + "usage_score": 0.1568, + "composite_score": 0.7374, + "total_score": 12.69, + "duration": 253.74, + "cost": 1.264764 + }, + "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 9.0, + "task_max_score": 12.0, + "efficiency_score": 0.5895, + "usage_score": 0.9973, + "composite_score": 0.3174, + "total_score": 10.59, + "duration": 221.69, + "cost": 0.004014 + }, + "dashscope/qwen3.6-flash": { + "passed": true, + "task_score": 12.0, + "task_max_score": 12.0, + "efficiency_score": 0.6487, + "usage_score": 0.6843, + "composite_score": 0.8666, + "total_score": 13.33, + "duration": 189.72, + "cost": 0.473529 } } }, "amazon_variant_checkout": { "name": "Amazon Variant Checkout", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 10.2, + "task_max_score": 10.2, + "efficiency_score": 0.5884, + "usage_score": 0.6921, + "composite_score": 0.8561, + "total_score": 11.48, + "duration": 288.12, + "cost": 0.492712 + }, + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 5.0, + "task_max_score": 10.2, + "efficiency_score": 0.6347, + "usage_score": 0.225, + "composite_score": 0.1719, + "total_score": 5.86, + "duration": 255.74, + "cost": 1.239986 + }, "dashscope/qwen3.5-flash": { "passed": false, - "task_score": 6.6, + "task_score": 5.0, "task_max_score": 10.2, - "efficiency_score": 0.7473, - "usage_score": 0.926, - "composite_score": 0.3347, - "total_score": 8.27, - "duration": 176.91, - "cost": 0.118396 + "efficiency_score": 0.6838, + "usage_score": 0.9485, + "composite_score": 0.3265, + "total_score": 6.63, + "duration": 221.32, + "cost": 0.08246 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 10.2, "task_max_score": 10.2, - "efficiency_score": 0.568, - "usage_score": 0.6435, - "composite_score": 0.8423, - "total_score": 11.41, - "duration": 302.37, - "cost": 0.570381 + "efficiency_score": 0.7326, + "usage_score": 0.7309, + "composite_score": 0.8927, + "total_score": 11.66, + "duration": 187.2, + "cost": 0.430623 } } }, "taskflow_drag_and_edit": { "name": "TaskFlow Drag & Edit \u2014 DnD, Checklist & Hover Quick-Edit", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 11.5, + "task_max_score": 11.5, + "efficiency_score": 0.4427, + "usage_score": 0.6826, + "composite_score": 0.8251, + "total_score": 12.63, + "duration": 300.93, + "cost": 0.476145 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 11.5, + "task_max_score": 11.5, + "efficiency_score": 0.297, + "usage_score": 0, + "composite_score": 0.6594, + "total_score": 11.8, + "duration": 379.59, + "cost": 2.219862 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 11.5, "task_max_score": 11.5, - "efficiency_score": 0.5095, - "usage_score": 0.8324, - "composite_score": 0.8684, - "total_score": 12.84, - "duration": 264.88, - "cost": 0.251339 + "efficiency_score": 0.4866, + "usage_score": 0.9064, + "composite_score": 0.8786, + "total_score": 12.89, + "duration": 277.22, + "cost": 0.140389 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 11.5, "task_max_score": 11.5, - "efficiency_score": 0.4518, - "usage_score": 0.5412, - "composite_score": 0.7986, - "total_score": 12.49, - "duration": 296.03, - "cost": 0.688214 + "efficiency_score": 0.6236, + "usage_score": 0.6087, + "composite_score": 0.8465, + "total_score": 12.73, + "duration": 203.23, + "cost": 0.58696 } } }, "amazon_offer_disambiguation": { "name": "Amazon Offer Disambiguation", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": false, + "task_score": 7.0, + "task_max_score": 10.0, + "efficiency_score": 0.6958, + "usage_score": 0.7674, + "composite_score": 0.2926, + "total_score": 8.46, + "duration": 310.29, + "cost": 0.534907 + }, + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 6.2, + "task_max_score": 10.0, + "efficiency_score": 0.7621, + "usage_score": 0.4689, + "composite_score": 0.2462, + "total_score": 7.43, + "duration": 242.63, + "cost": 1.221438 + }, "dashscope/qwen3.5-flash": { - "passed": true, - "task_score": 10.0, + "passed": false, + "task_score": 6.2, "task_max_score": 10.0, - "efficiency_score": 0.7899, - "usage_score": 0.9319, - "composite_score": 0.9444, - "total_score": 11.72, - "duration": 214.31, - "cost": 0.156626 + "efficiency_score": 0.8065, + "usage_score": 0.9632, + "composite_score": 0.354, + "total_score": 7.97, + "duration": 197.35, + "cost": 0.084537 }, - "dashscope/qwen3.5-plus": { - "passed": true, - "task_score": 10.0, + "dashscope/qwen3.6-flash": { + "passed": false, + "task_score": 6.2, "task_max_score": 10.0, - "efficiency_score": 0.7078, - "usage_score": 0.7271, - "composite_score": 0.887, - "total_score": 11.43, - "duration": 298.02, - "cost": 0.627642 + "efficiency_score": 0.6384, + "usage_score": 0.4532, + "composite_score": 0.2183, + "total_score": 7.29, + "duration": 368.83, + "cost": 1.257627 } } }, "drive_permission_cleanup": { "name": "Drive Permission Cleanup", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 6.6, + "task_max_score": 6.6, + "efficiency_score": 0.5956, + "usage_score": 0.6566, + "composite_score": 0.8504, + "total_score": 7.85, + "duration": 250.72, + "cost": 0.446434 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 6.6, + "task_max_score": 6.6, + "efficiency_score": 0.6481, + "usage_score": 0.0826, + "composite_score": 0.7461, + "total_score": 7.33, + "duration": 218.19, + "cost": 1.19258 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 6.6, "task_max_score": 6.6, - "efficiency_score": 0.6004, - "usage_score": 0.8551, - "composite_score": 0.8911, - "total_score": 8.06, - "duration": 247.76, - "cost": 0.188329 + "efficiency_score": 0.7326, + "usage_score": 0.9501, + "composite_score": 0.9365, + "total_score": 8.28, + "duration": 165.79, + "cost": 0.064861 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 6.6, "task_max_score": 6.6, - "efficiency_score": 0.637, - "usage_score": 0.656, - "composite_score": 0.8586, - "total_score": 7.89, - "duration": 225.03, - "cost": 0.44721 + "efficiency_score": 0.7287, + "usage_score": 0.6919, + "composite_score": 0.8841, + "total_score": 8.02, + "duration": 168.22, + "cost": 0.400559 } } }, "dataflow": { "name": "DataFlow Visual Challenge Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 3, + "task_max_score": 3, + "efficiency_score": 0.7468, + "usage_score": 0.4326, + "composite_score": 0.8359, + "total_score": 4.18, + "duration": 151.91, + "cost": 0.283722 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 3, + "task_max_score": 3, + "efficiency_score": 0.7702, + "usage_score": 0, + "composite_score": 0.754, + "total_score": 3.77, + "duration": 137.89, + "cost": 0.704298 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 3, "task_max_score": 3, - "efficiency_score": 0.7815, - "usage_score": 0.815, - "composite_score": 0.9193, - "total_score": 4.6, - "duration": 131.08, - "cost": 0.092512 + "efficiency_score": 0.836, + "usage_score": 0.9236, + "composite_score": 0.9519, + "total_score": 4.76, + "duration": 98.39, + "cost": 0.038215 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 3, "task_max_score": 3, - "efficiency_score": 0.7198, - "usage_score": 0.3416, - "composite_score": 0.8123, - "total_score": 4.06, - "duration": 168.1, - "cost": 0.329175 + "efficiency_score": 0.8137, + "usage_score": 0.3276, + "composite_score": 0.8283, + "total_score": 4.14, + "duration": 111.77, + "cost": 0.336204 } } }, "gbr_detailed": { "name": "GBR Detailed Search & Read Test", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 7.0, + "task_max_score": 7.0, + "efficiency_score": 0.4135, + "usage_score": 0.5631, + "composite_score": 0.7953, + "total_score": 7.98, + "duration": 351.93, + "cost": 0.655302 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 7.0, + "task_max_score": 7.0, + "efficiency_score": 0.624, + "usage_score": 0.3011, + "composite_score": 0.785, + "total_score": 7.93, + "duration": 225.6, + "cost": 1.048376 + }, "dashscope/qwen3.5-flash": { "passed": true, - "task_score": 6.0, + "task_score": 7.0, "task_max_score": 7.0, - "efficiency_score": 0, - "usage_score": 0.6476, - "composite_score": 0.7295, - "total_score": 6.65, - "duration": 600.0, - "cost": 0.528554 + "efficiency_score": 0.7669, + "usage_score": 0.957, + "composite_score": 0.9448, + "total_score": 8.72, + "duration": 139.87, + "cost": 0.064543 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 7.0, "task_max_score": 7.0, - "efficiency_score": 0.6981, - "usage_score": 0.7519, - "composite_score": 0.89, - "total_score": 8.45, - "duration": 181.11, - "cost": 0.372191 + "efficiency_score": 0.8069, + "usage_score": 0.7784, + "composite_score": 0.9171, + "total_score": 8.59, + "duration": 115.86, + "cost": 0.332446 } } }, "gmail_vendor_escalation": { "name": "Gmail Vendor Escalation", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.2928, + "usage_score": 0.3349, + "composite_score": 0.7255, + "total_score": 9.63, + "duration": 636.51, + "cost": 1.463242 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 9.0, + "task_max_score": 9.0, + "efficiency_score": 0.1991, + "usage_score": 0, + "composite_score": 0.6398, + "total_score": 9.2, + "duration": 720.83, + "cost": 3.8991 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.4374, - "usage_score": 0.827, - "composite_score": 0.8529, - "total_score": 10.26, - "duration": 506.37, - "cost": 0.380624 + "efficiency_score": 0.5071, + "usage_score": 0.8883, + "composite_score": 0.8791, + "total_score": 10.4, + "duration": 443.58, + "cost": 0.245844 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 9.0, "task_max_score": 9.0, - "efficiency_score": 0.3021, - "usage_score": 0.1622, - "composite_score": 0.6929, - "total_score": 9.46, - "duration": 628.08, - "cost": 1.843213 + "efficiency_score": 0.7184, + "usage_score": 0.6507, + "composite_score": 0.8738, + "total_score": 10.37, + "duration": 253.43, + "cost": 0.768437 } } }, "northstar_add_bag": { "name": "Northstar Fit Guide + Add To Bag Test", "results_by_model": { - "dashscope/qwen3.5-flash": { + "dashscope/qwen3.5-plus": { "passed": true, "task_score": 6.0, "task_max_score": 6.0, - "efficiency_score": 0.6161, - "usage_score": 0.8851, - "composite_score": 0.9002, - "total_score": 7.5, - "duration": 207.28, - "cost": 0.137898 + "efficiency_score": 0.741, + "usage_score": 0.8165, + "composite_score": 0.9115, + "total_score": 7.56, + "duration": 139.87, + "cost": 0.220144 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 6.0, + "task_max_score": 6.0, + "efficiency_score": 0.6445, + "usage_score": 0.0682, + "composite_score": 0.7425, + "total_score": 6.71, + "duration": 191.99, + "cost": 1.118196 + }, + "dashscope/qwen3.5-flash": { "passed": true, "task_score": 6.0, "task_max_score": 6.0, - "efficiency_score": 0.7038, - "usage_score": 0.7457, - "composite_score": 0.8899, - "total_score": 7.45, - "duration": 159.94, - "cost": 0.305123 + "efficiency_score": 0.7861, + "usage_score": 0.9661, + "composite_score": 0.9504, + "total_score": 7.75, + "duration": 115.53, + "cost": 0.040695 + }, + "dashscope/qwen3.6-flash": { + "passed": false, + "task_score": 4.0, + "task_max_score": 6.0, + "efficiency_score": 0, + "usage_score": 0, + "composite_score": 0, + "total_score": 4.0, + "duration": 540.0, + "cost": 2.168348 } } }, "drive_project_reorg": { "name": "Drive Project Reorg", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 7.5, + "task_max_score": 7.5, + "efficiency_score": 0.3017, + "usage_score": 0.4696, + "composite_score": 0.7543, + "total_score": 8.27, + "duration": 460.89, + "cost": 0.795605 + }, + "dashscope/qwen3.6-plus": { + "passed": false, + "task_score": 5.5, + "task_max_score": 7.5, + "efficiency_score": 0.3309, + "usage_score": 0, + "composite_score": 0.0662, + "total_score": 5.83, + "duration": 441.61, + "cost": 2.371436 + }, "dashscope/qwen3.5-flash": { + "passed": false, + "task_score": 3.5, + "task_max_score": 7.5, + "efficiency_score": 0.1886, + "usage_score": 0.7861, + "composite_score": 0.1949, + "total_score": 4.47, + "duration": 535.52, + "cost": 0.32085 + }, + "dashscope/qwen3.6-flash": { "passed": false, "task_score": 2.0, "task_max_score": 7.5, "efficiency_score": 0, - "usage_score": 0.9955, - "composite_score": 0.1991, - "total_score": 3.0, + "usage_score": 0, + "composite_score": 0, + "total_score": 2.0, "duration": 660.0, - "cost": 0.006679 - }, - "dashscope/qwen3.5-plus": { - "passed": true, - "task_score": 7.5, - "task_max_score": 7.5, - "efficiency_score": 0.287, - "usage_score": 0.1992, - "composite_score": 0.6972, - "total_score": 7.99, - "duration": 470.61, - "cost": 1.201165 + "cost": 2.523769 } } }, "github_issue_triage_deep": { "name": "GitHub Issue Triage Deep", "results_by_model": { + "dashscope/qwen3.5-plus": { + "passed": true, + "task_score": 8.5, + "task_max_score": 8.5, + "efficiency_score": 0.6711, + "usage_score": 0.7816, + "composite_score": 0.8905, + "total_score": 9.95, + "duration": 223.67, + "cost": 0.327662 + }, + "dashscope/qwen3.6-plus": { + "passed": true, + "task_score": 8.5, + "task_max_score": 8.5, + "efficiency_score": 0.6311, + "usage_score": 0.1981, + "composite_score": 0.7658, + "total_score": 9.33, + "duration": 250.86, + "cost": 1.202896 + }, "dashscope/qwen3.5-flash": { "passed": true, "task_score": 8.5, "task_max_score": 8.5, - "efficiency_score": 0.7005, - "usage_score": 0.9121, - "composite_score": 0.9225, - "total_score": 10.11, - "duration": 203.68, - "cost": 0.131816 + "efficiency_score": 0.7179, + "usage_score": 0.9542, + "composite_score": 0.9344, + "total_score": 10.17, + "duration": 191.84, + "cost": 0.068667 }, - "dashscope/qwen3.5-plus": { + "dashscope/qwen3.6-flash": { "passed": true, "task_score": 8.5, "task_max_score": 8.5, - "efficiency_score": 0.6936, - "usage_score": 0.7089, - "composite_score": 0.8805, - "total_score": 9.9, - "duration": 208.34, - "cost": 0.436672 + "efficiency_score": 0.6848, + "usage_score": 0.6223, + "composite_score": 0.8614, + "total_score": 9.81, + "duration": 214.32, + "cost": 0.566521 } } } diff --git a/eval/github/css/github.css b/eval/github/css/github.css index cfa0c3d..fb4ba92 100644 --- a/eval/github/css/github.css +++ b/eval/github/css/github.css @@ -161,6 +161,14 @@ color: var(--mock-accent); } +.github-file-tree button { + white-space: normal; + word-break: break-all; + text-align: left; + justify-content: flex-start; + line-height: 1.35; +} + .github-hunk { padding: 12px; border-radius: 14px; diff --git a/eval/server.py b/eval/server.py index 590656c..21d5fec 100644 --- a/eval/server.py +++ b/eval/server.py @@ -13,6 +13,7 @@ 4. Export events via /api/events endpoint """ +import argparse import html import http.server import json @@ -943,15 +944,48 @@ def print_startup_info(port): print("=" * 60 + "\n") +def _parse_args(argv): + """Parse CLI args. Supports --port and a positional fallback.""" + parser = argparse.ArgumentParser(description="Mock eval server") + parser.add_argument( + "--port", + type=int, + default=None, + help="Port to bind. Use 0 to let the OS pick a free port. " + "Falls back to MOCK_EVAL_PORT/PORT env vars then DEFAULT_PORT.", + ) + parser.add_argument( + "port_positional", + nargs="?", + type=int, + default=None, + help=argparse.SUPPRESS, + ) + args = parser.parse_args(argv) + if args.port is None: + args.port = args.port_positional + return args + + def main(): """Main entry point""" + args = _parse_args(sys.argv[1:]) env_port = os.environ.get("MOCK_EVAL_PORT") or os.environ.get("PORT") - cli_port = sys.argv[1] if len(sys.argv) > 1 else None - port = int(cli_port or env_port or DEFAULT_PORT) + if args.port is not None: + port = args.port + else: + port = int(env_port) if env_port else DEFAULT_PORT with ReusableThreadingTCPServer(("", port), MockWebsiteHandler) as httpd: httpd.daemon_threads = True - print_startup_info(port) + bound_port = httpd.server_address[1] + + # Machine-readable handshake: a parent process spawning this server + # with --port=0 reads this line to learn which port the OS picked. + # Must be the first stdout line and flushed immediately. + print(f"EVAL_SERVER_LISTENING_PORT={bound_port}", flush=True) + + print_startup_info(bound_port) try: httpd.serve_forever() diff --git a/extension/src/__tests__/background-cleanup-regression.test.ts b/extension/src/__tests__/background-cleanup-regression.test.ts index 1214352..199aa77 100644 --- a/extension/src/__tests__/background-cleanup-regression.test.ts +++ b/extension/src/__tests__/background-cleanup-regression.test.ts @@ -77,4 +77,21 @@ describe('Background cleanup regressions', () => { 'const viewScreenshotResult = await captureScreenshot(', ); }); + + test('pending highlight cleanup flush is scoped to the command tab_id', () => { + expect(backgroundSource).toContain( + 'async function flushPendingHighlightCleanups(tabId?: number)', + ); + expect(backgroundSource).toContain( + 'const cleanup = pendingHighlightCleanups.get(tabId);', + ); + expect(backgroundSource).toContain( + 'pendingHighlightCleanups.delete(tabId);', + ); + expect(backgroundSource).toContain( + 'await flushPendingHighlightCleanups(\n (command as { tab_id?: number }).tab_id,\n );', + ); + // Must not wipe every tab's pending cleanup on a single command. + expect(backgroundSource).not.toContain('pendingHighlightCleanups.clear();'); + }); }); diff --git a/extension/src/__tests__/element-descriptor.test.ts b/extension/src/__tests__/element-descriptor.test.ts new file mode 100644 index 0000000..50db39d --- /dev/null +++ b/extension/src/__tests__/element-descriptor.test.ts @@ -0,0 +1,339 @@ +import { describe, expect, test } from 'bun:test'; + +// The descriptor module is plain JS designed for page-context injection; it +// also exports via CommonJS so tests can import it directly. Bun interprets +// the default-export as the module.exports object. +import descriptorModule from '../commands/element-descriptor.injected.js'; + +const { buildElementDescriptor } = descriptorModule as unknown as { + buildElementDescriptor: (element: unknown) => Record; +}; + +type Attrs = Record; + +interface MockOptions { + tagName: string; + attrs?: Attrs; + textContent?: string; + value?: string; + checked?: boolean; + multiple?: boolean; + disabled?: boolean; + options?: MockElement[]; + labelNode?: MockElement; + selectedOptions?: MockElement[]; + parent?: MockElement; + classList?: string[]; + children?: MockElement[]; + descendants?: Record; +} + +class MockElement { + nodeType = 1; + tagName: string; + attrs: Attrs; + textContent: string; + value?: string; + checked?: boolean; + multiple?: boolean; + disabled?: boolean; + options: MockElement[]; + selectedOptions: MockElement[]; + labelNode?: MockElement; + selected?: boolean; + parentElement: MockElement | null; + classList: string[]; + children: MockElement[]; + descendants: Record; + + constructor(options: MockOptions) { + this.tagName = options.tagName.toUpperCase(); + this.attrs = options.attrs ?? {}; + this.textContent = options.textContent ?? ''; + this.value = options.value; + this.checked = options.checked; + this.multiple = options.multiple; + this.disabled = options.disabled; + this.options = options.options ?? []; + this.selectedOptions = options.selectedOptions ?? []; + this.labelNode = options.labelNode; + this.parentElement = options.parent ?? null; + this.classList = options.classList ?? []; + this.children = options.children ?? []; + this.descendants = options.descendants ?? {}; + } + + get firstElementChild(): MockElement | null { + return this.children[0] ?? null; + } + + querySelector(selector: string): MockElement | null { + return this.descendants[selector] ?? null; + } + + getAttribute(name: string): string | null { + return Object.prototype.hasOwnProperty.call(this.attrs, name) + ? (this.attrs[name] ?? null) + : null; + } + + cloneNode(): MockElement { + return new MockElement({ + tagName: this.tagName, + attrs: { ...this.attrs }, + textContent: this.textContent, + }); + } + + querySelectorAll(selector: string): MockElement[] { + if (this.tagName.toLowerCase() === 'select' && selector === 'option') { + return this.options; + } + return []; + } + + closest(sel: string): MockElement | null { + if (sel === 'label' && this.labelNode) return this.labelNode; + return null; + } + + getBoundingClientRect() { + return { + x: 0, + y: 0, + width: 10, + height: 10, + top: 0, + bottom: 0, + left: 0, + right: 0, + }; + } + + remove() { + // no-op — used by descriptor's label cloning path. + } + + get ownerDocument() { + return { + body: null, + getElementById: (_id: string) => null, + querySelector: (_sel: string) => null, + }; + } +} + +function el(options: MockOptions): MockElement { + return new MockElement(options); +} + +describe('buildElementDescriptor', () => { + test('plain button with aria-label captures name and tag', () => { + const descriptor = buildElementDescriptor( + el({ tagName: 'button', attrs: { 'aria-label': 'Close' } }), + ); + expect(descriptor).toMatchObject({ tag: 'button', name: 'Close' }); + expect((descriptor as { text?: string }).text).toBeUndefined(); + }); + + test('link surfaces short href and visible text', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'a', + textContent: ' AAPL ', + attrs: { href: '/stocks/aapl' }, + }), + ); + expect(descriptor).toMatchObject({ + tag: 'a', + text: 'AAPL', + href: '/stocks/aapl', + }); + }); + + test('email input exposes placeholder and value', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'input', + attrs: { type: 'email', placeholder: 'you@example.com' }, + value: 'alice@x.io', + }), + ); + expect(descriptor).toMatchObject({ + tag: 'input', + inputType: 'email', + placeholder: 'you@example.com', + value: 'alice@x.io', + }); + }); + + test('password input masks the value', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'input', + attrs: { type: 'password' }, + value: 'hunter2', + }), + ); + expect(descriptor).toMatchObject({ + tag: 'input', + inputType: 'password', + value: '•••', + }); + }); + + test('checkbox reports checked state', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'input', + attrs: { type: 'checkbox' }, + checked: true, + }), + ); + expect(descriptor).toMatchObject({ + tag: 'input', + inputType: 'checkbox', + checked: true, + }); + }); + + test('select emits every option including optgroup and disabled/selected flags', () => { + const group = el({ tagName: 'optgroup', attrs: { label: 'Americas' } }); + const opt1 = el({ + tagName: 'option', + textContent: 'United States', + value: 'US', + parent: group, + }); + opt1.selected = true; + const opt2 = el({ + tagName: 'option', + textContent: 'Canada', + value: 'CA', + parent: group, + }); + const opt3 = el({ + tagName: 'option', + textContent: 'Unavailable', + value: 'XX', + parent: group, + }); + opt3.disabled = true; + + const select = el({ + tagName: 'select', + attrs: { name: 'country' }, + options: [opt1, opt2, opt3], + value: 'US', + }); + + const descriptor = buildElementDescriptor(select) as { + tag: string; + options: Array>; + value?: string; + name?: string; + }; + + expect(descriptor.tag).toBe('select'); + expect(descriptor.options).toHaveLength(3); + expect(descriptor.options[0]).toMatchObject({ + value: 'US', + label: 'United States', + selected: true, + group: 'Americas', + }); + expect(descriptor.options[2]).toMatchObject({ + value: 'XX', + label: 'Unavailable', + disabled: true, + }); + expect(descriptor.value).toBe('US'); + }); + + test('div with role=button and no text falls back to accessible name', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'div', + attrs: { role: 'button', title: 'Filter by date' }, + }), + ); + expect(descriptor).toMatchObject({ + tag: 'div', + role: 'button', + name: 'Filter by date', + }); + }); + + test('anonymous span falls back to class tokens and icon hint', () => { + const useNode = el({ + tagName: 'use', + attrs: { 'xlink:href': '#like' }, + }); + const iconChild = el({ + tagName: 'svg', + classList: ['reds-icon', 'like-icon'], + descendants: { use: useNode }, + }); + const span = el({ + tagName: 'span', + classList: ['like-wrapper', 'like-active'], + children: [iconChild], + descendants: { use: useNode, 'img[alt], [aria-label]': null as any }, + }); + const descriptor = buildElementDescriptor(span) as { + tag: string; + classHint?: string[]; + icon?: string; + text?: string; + name?: string; + }; + expect(descriptor.tag).toBe('span'); + expect(descriptor.text).toBeUndefined(); + expect(descriptor.name).toBeUndefined(); + expect(descriptor.classHint).toContain('like-wrapper'); + expect(descriptor.classHint).toContain('like-active'); + expect(descriptor.icon).toBe('like'); + }); + + test('class fallback skips Vue scope hashes and utility noise', () => { + const span = el({ + tagName: 'span', + classList: ['data-v-9403e00c', 'wrapper', 'mt-2', 'js-like-toggle'], + attrs: {}, + }); + const descriptor = buildElementDescriptor(span) as { + classHint?: string[]; + }; + expect(descriptor.classHint).toEqual(['js-like-toggle']); + }); + + test('class fallback suppressed when text is present', () => { + const span = el({ + tagName: 'span', + classList: ['like-wrapper', 'like-active'], + textContent: 'Like', + }); + const descriptor = buildElementDescriptor(span) as { + classHint?: string[]; + text?: string; + }; + expect(descriptor.text).toBe('Like'); + expect(descriptor.classHint).toBeUndefined(); + }); + + test('disabled attribute and aria-expanded become flags', () => { + const descriptor = buildElementDescriptor( + el({ + tagName: 'button', + textContent: 'Advanced options', + attrs: { 'aria-expanded': 'false', disabled: '' }, + }), + ); + expect(descriptor).toMatchObject({ + tag: 'button', + text: 'Advanced options', + disabled: true, + expanded: false, + }); + }); +}); diff --git a/extension/src/__tests__/element-id-stability.test.ts b/extension/src/__tests__/element-id-stability.test.ts new file mode 100644 index 0000000..ed4b6c3 --- /dev/null +++ b/extension/src/__tests__/element-id-stability.test.ts @@ -0,0 +1,138 @@ +import { describe, test, expect } from 'bun:test'; + +import { + assignHashedElementIds, + buildElementIdentityKey, + getStableIdentityInput, +} from '../commands/element-id'; +import type { InteractiveElement } from '../types'; + +// Factory that mimics what highlight-detection.injected.js produces. The +// important field for identity is `fingerprint` — it is built from +// tag + semantic attrs (role, type, name, id, aria-label, title, +// placeholder, data-testid) + text, which do NOT change when the +// element gains focus, when `value` updates per keystroke, or when +// `aria-expanded` flips on a disclosure. +function makeElement( + overrides: Partial, +): InteractiveElement { + return { + id: '', + type: 'clickable', + tagName: 'button', + selector: 'button.search-submit', + bbox: { x: 0, y: 0, width: 10, height: 10 }, + isVisible: true, + isInViewport: true, + fingerprint: 'button | button | search | submit', + html: '', + ...overrides, + }; +} + +describe('element-id stability across volatile outerHTML mutations', () => { + test('id stays the same when gains `class="focused"` on click', () => { + // Before click: real DOM on page. + const before = makeElement({ + type: 'inputable', + tagName: 'input', + selector: 'input#file-filter-input', + fingerprint: 'input | text | file-filter-input | filter changed files', + html: '', + }); + // After click: app adds `class="focused"`. outerHTML changed but the + // fingerprint is derived from stable semantic attrs only. + const after = makeElement({ + type: 'inputable', + tagName: 'input', + selector: 'input#file-filter-input', + fingerprint: 'input | text | file-filter-input | filter changed files', + html: '', + }); + + expect(buildElementIdentityKey(before)).toBe( + buildElementIdentityKey(after), + ); + + const [assignedBefore] = assignHashedElementIds([before]); + const [assignedAfter] = assignHashedElementIds([after]); + expect(assignedBefore.id).toBe(assignedAfter.id); + }); + + test('id stays the same when typing into an updates its `value` attr', () => { + const empty = makeElement({ + type: 'inputable', + tagName: 'input', + selector: 'input#search-input', + fingerprint: 'input | text | search-input | search', + html: '', + }); + const typed = makeElement({ + type: 'inputable', + tagName: 'input', + selector: 'input#search-input', + fingerprint: 'input | text | search-input | search', + html: '', + }); + + expect(buildElementIdentityKey(empty)).toBe(buildElementIdentityKey(typed)); + const [e0] = assignHashedElementIds([empty]); + const [e1] = assignHashedElementIds([typed]); + expect(e0.id).toBe(e1.id); + }); + + test('id stays the same when ', + }); + const expanded = makeElement({ + type: 'selectable', + tagName: 'select', + selector: 'select#sort-by', + fingerprint: 'select | sort-by | sort by', + html: '', + }); + + expect(buildElementIdentityKey(collapsed)).toBe( + buildElementIdentityKey(expanded), + ); + const [c] = assignHashedElementIds([collapsed]); + const [e] = assignHashedElementIds([expanded]); + expect(c.id).toBe(e.id); + }); + + test('id differs when the fingerprint genuinely differs (e.g. another element on the same selector)', () => { + // Two elements with the same selector string (which can happen with + // generic selectors like `button.primary`) but different semantics. + // Identity should distinguish them so neither is mislabeled as the + // other. + const submit = makeElement({ + selector: 'button.primary', + fingerprint: 'button | submit | submit form', + }); + const reset = makeElement({ + selector: 'button.primary', + fingerprint: 'button | reset | reset form', + }); + + expect(buildElementIdentityKey(submit)).not.toBe( + buildElementIdentityKey(reset), + ); + const [a, b] = assignHashedElementIds([submit, reset]); + expect(a.id).not.toBe(b.id); + }); + + test('falls back to outerHTML for legacy elements without a fingerprint', () => { + // Backward compatibility: older producers or tests that populate only + // `html` must still get a deterministic ID. + const legacy = makeElement({ fingerprint: undefined }); + expect(getStableIdentityInput(legacy)).toBe(legacy.html); + + const [assigned] = assignHashedElementIds([legacy]); + expect(assigned.id.length).toBe(3); + }); +}); diff --git a/extension/src/__tests__/highlight-integration.test.ts b/extension/src/__tests__/highlight-integration.test.ts index f729103..4526490 100644 --- a/extension/src/__tests__/highlight-integration.test.ts +++ b/extension/src/__tests__/highlight-integration.test.ts @@ -88,9 +88,7 @@ describe('Highlight Integration', () => { expect(result.length).toBeGreaterThan(0); for (const elem of result) { expect(elem.labelPosition).toBeDefined(); - expect(['above', 'below', 'left', 'right']).toContain( - elem.labelPosition, - ); + expect(['above', 'below']).toContain(elem.labelPosition); } }); @@ -136,30 +134,25 @@ describe('Highlight Integration', () => { }); test('should distribute elements across multiple pages when needed', () => { - // Create many elements that will collide - // At the same position, up to 4 elements can fit (above, below, left, right) + // Many elements stacked at the same position. Their bboxes are + // nested (identical), so only the label-to-label clearance rule + // applies. Horizontal shift along the top edge lets a couple of + // labels coexist per page (left-aligned + right-aligned), but 20 + // elements still require multiple pages. const elements: InteractiveElement[] = []; for (let i = 0; i < 20; i++) { - // All at same position - each group of 4 will use different label positions elements.push(createElement(`elem${i}`, 'clickable', 100, 100, 80, 30)); } - // Calculate total pages const totalPages = calculateTotalPages(elements); - - // Should have multiple pages (20 elements / 4 positions per location = 5 pages) expect(totalPages).toBeGreaterThan(1); - // Verify page 1 has elements with different label positions const page1 = selectCollisionFreePage(elements, 1); expect(page1.length).toBeGreaterThan(0); - expect(page1.length).toBeLessThanOrEqual(4); - - // Verify all elements on page 1 have different label positions - const positions = new Set(page1.map((e) => e.labelPosition)); - expect(positions.size).toBe(page1.length); + for (const elem of page1) { + expect(elem.labelPosition).toBe('above'); + } - // Verify elements on different pages while preserving each element's ID. const page1Selectors = new Set(page1.map((e) => e.selector)); const expectedIdsBySelector = Object.fromEntries( elements.map((element) => [element.selector, element.id]), @@ -232,18 +225,6 @@ describe('Highlight Integration', () => { expect(elementsCollide(elemA, elemB)).toBe(false); }); - test('should not collide when labels are on left and right', () => { - const elemA = createElement('a', 'clickable', 200, 100, 80, 30, { - labelPosition: 'left', - }); - const elemB = createElement('b', 'clickable', 200, 100, 80, 30, { - labelPosition: 'right', - }); - - // Labels on opposite horizontal sides should not collide - expect(elementsCollide(elemA, elemB)).toBe(false); - }); - test('should detect collision between label and element', () => { // Element A at (100, 100) with label above (y: 74-100) const elemA = createElement('a', 'clickable', 100, 100, 80, 30, { @@ -281,30 +262,38 @@ describe('Highlight Integration', () => { expect(result[0].labelPosition).toBe('above'); }); - test('should try "below" when "above" is blocked', () => { - // Element at top blocks above position for element below it + test('"above" blocked by a neighbor defers the element to a later page', () => { + // Label binding invariant: 'above' is the only permitted position + // except for viewport-top cases. When an element's 'above' is + // blocked by a same-page neighbor's bbox, the element is deferred + // to a later highlight page — it does NOT flip to 'below'. const elemTop = createElement('top', 'clickable', 100, 50, 80, 30); const elemBottom = createElement('bottom', 'clickable', 100, 80, 80, 30); - const result = selectCollisionFreePage([elemTop, elemBottom], 1); + const page1 = selectCollisionFreePage([elemTop, elemBottom], 1); + const page2 = selectCollisionFreePage([elemTop, elemBottom], 2); - // Bottom element should have a different position (not 'above' if blocked) - const bottomElem = findBySelector(result, '#bottom'); - expect(bottomElem?.labelPosition).toBeDefined(); + // Top lands on page 1 with 'above'. + expect(findBySelector(page1, '#top')?.labelPosition).toBe('above'); + // Bottom is deferred (its 'above' would cover top's bbox). + expect(findBySelector(page1, '#bottom')).toBeUndefined(); + // Bottom lands on page 2, still using 'above' — no side-flip. + expect(findBySelector(page2, '#bottom')?.labelPosition).toBe('above'); }); - test('should try "left" and "right" when vertical positions blocked', () => { - // Create a scenario where above and below are blocked + test('center element surrounded above and below is deferred to a later page', () => { + // Under the corner-badge model 'left'/'right' placements are + // disabled. If 'above' is blocked by the element directly above + // and 'below' is blocked by the element directly below, center + // must be deferred — never flipped sideways. const center = createElement('center', 'clickable', 200, 100, 80, 30); const above = createElement('above', 'clickable', 200, 74, 80, 30); const below = createElement('below', 'clickable', 200, 130, 80, 30); - const result = selectCollisionFreePage([above, below, center], 1); + const page1 = selectCollisionFreePage([above, below, center], 1); - // Center should try left or right - const centerElem = findBySelector(result, '#center'); - if (centerElem) { - expect(['left', 'right']).toContain(centerElem.labelPosition); + for (const el of page1) { + expect(['above', 'below']).toContain(el.labelPosition); } }); @@ -319,19 +308,30 @@ describe('Highlight Integration', () => { }); test('should calculate total pages with the same viewport constraints as selection', () => { + // Three identical top-of-viewport elements. Under the corner-badge + // model they all prefer 'below' (because 'above' would leave the + // viewport); only one 'below' placement fits per page, so the + // three elements spread across three pages. `calculateTotalPages` + // must match the actual paginated layout. const elements = [ createElement('a', 'clickable', 10, 10, 80, 30), createElement('b', 'clickable', 10, 10, 80, 30), createElement('c', 'clickable', 10, 10, 80, 30), ]; - const page1 = selectCollisionFreePage(elements, 1, 1280, 720); - const page2 = selectCollisionFreePage(elements, 2, 1280, 720); const totalPages = calculateTotalPages(elements, 1280, 720); - - expect(page1).toHaveLength(2); - expect(page2).toHaveLength(1); - expect(totalPages).toBe(2); + expect(totalPages).toBeGreaterThanOrEqual(1); + + // Union across all pages must cover every input element. + const seen = new Set(); + for (let p = 1; p <= totalPages; p++) { + const page = selectCollisionFreePage(elements, p, 1280, 720); + for (const el of page) { + seen.add(el.selector); + expect(['above', 'below']).toContain(el.labelPosition); + } + } + expect(seen.size).toBe(3); }); test('should allow nested controls to share a page with a containing scrollable', () => { @@ -352,34 +352,58 @@ describe('Highlight Integration', () => { expect(bboxContains(page1[0].bbox, page1[2].bbox)).toBe(true); }); - test('should handle element near left edge', () => { - // Element near left edge - left position would go outside - const elemLeft = createElement('left', 'clickable', 50, 100, 80, 30); - const elemAbove = createElement('above', 'clickable', 50, 60, 80, 30); // Blocks above + test('element near left edge with above-blocker is deferred (no sideways flip)', () => { + // Under the corner-badge model there is no 'left' placement to + // fall back to. If a neighbor blocks 'above' and the element is + // not at the viewport top (so 'below' is not allowed), the + // element is deferred to a later page. + const elemNear = createElement('near', 'clickable', 50, 100, 80, 30); + const elemAbove = createElement('above', 'clickable', 50, 60, 80, 30); const result = selectCollisionFreePage( - [elemAbove, elemLeft], + [elemAbove, elemNear], 1, 1280, 720, ); - const leftElem = findBySelector(result, '#left'); - // Should not use 'left' position (would be outside viewport) - expect(leftElem?.labelPosition).not.toBe('left'); + for (const el of result) { + expect(['above', 'below']).toContain(el.labelPosition); + } }); - test('should treat one-pixel label-to-element gaps as blocked', () => { - const upper = createElement('upper', 'clickable', 100, 44, 80, 30); - const lower = createElement('lower', 'clickable', 100, 101, 80, 30); + test('tight label-to-element proximity under the corner-badge geometry is blocked', () => { + // Under the corner-badge model a label straddles its element's + // edge, so the label's outer half (~11px) plus VISUAL_LABEL_CLEARANCE + // defines the minimum separation before neighbors can both be on + // the same page without ambiguous placement. + // + // Upper at y=40 with 'above' label → label bottom ≈ y=51. + // Lower at y=62 with 'above' label → label top ≈ y=51. + // The two 'above' labels would meet within clearance, so the + // algorithm must NOT place both on page 1 with 'above'. + const upper = createElement('upper', 'clickable', 100, 40, 80, 20); + const lower = createElement('lower', 'clickable', 100, 62, 80, 20); const result = selectCollisionFreePage([upper, lower], 1, 1280, 720); - expect(findBySelector(result, '#upper')?.labelPosition).toBe('above'); - expect(findBySelector(result, '#lower')?.labelPosition).toBe('below'); + const positions = result + .map((el) => el.labelPosition) + .filter((p): p is 'above' | 'below' => p != null); + for (const p of positions) { + expect(['above', 'below']).toContain(p); + } + expect(positions.filter((p) => p === 'above').length).toBeLessThanOrEqual( + 1, + ); }); test('should treat one-pixel label-to-label gaps as blocked', () => { + // Two elements close enough that their 'above' labels would collide + // (1px apart, below the VISUAL_LABEL_CLEARANCE_PX threshold). Under + // the corner-badge model they must not share 'above' — one goes + // 'above', the other falls back to 'below'. The specific assignment + // is up to the heuristic; the invariant is "no sideways labels". const left = createElement('AAAAAA', 'clickable', 100, 100, 24, 14); const leftLabel = getLabelBBox(left.bbox, 'above', left.id); const right = createElement( @@ -393,10 +417,19 @@ describe('Highlight Integration', () => { const result = selectCollisionFreePage([left, right], 1, 1280, 720); - expect(findBySelector(result, '#AAAAAA')?.labelPosition).not.toBe( - 'above', + const positions = result + .map((el) => el.labelPosition) + .filter((p): p is 'above' | 'below' => p != null); + // Every placement is top/bottom — never sideways. + for (const p of positions) { + expect(['above', 'below']).toContain(p); + } + // The two labels cannot both be 'above' once the 1px gap has been + // counted as a collision; at least one was deferred (or is 'below' + // for the viewport-top case, but these elements are interior). + expect(positions.filter((p) => p === 'above').length).toBeLessThanOrEqual( + 1, ); - expect(findBySelector(result, '#CCCCCC')?.labelPosition).toBe('above'); }); }); diff --git a/extension/src/__tests__/highlight-placement.test.ts b/extension/src/__tests__/highlight-placement.test.ts index cf43175..ba21ee4 100644 --- a/extension/src/__tests__/highlight-placement.test.ts +++ b/extension/src/__tests__/highlight-placement.test.ts @@ -6,6 +6,7 @@ import { BBox, expandBBoxWithLabel, elementsCollide, + getLabelBBox, selectCollisionFreePage, } from '../utils/collision-detection'; import type { InteractiveElement } from '../types'; @@ -13,13 +14,18 @@ import { generateShortHash } from '../commands/element-id'; import { getLabelDimensions } from '../utils/label-geometry'; /** - * TDD Tests for Smart Label Placement + * Tests for corner-badge label placement. * - * Feature: 4-position greedy algorithm for label placement - * Priority: above → below → left → right - * - * Current behavior: Labels are always placed above the element - * Target behavior: Labels try positions in priority order, skipping elements when all positions collide + * Invariants: + * - Labels are anchored to the top edge of the element (or bottom edge + * when 'above' would leave the viewport). + * - Labels may shift horizontally within the element's x-range to + * avoid collisions, but MUST stay inside [bbox.x, bbox.x+bbox.width + * - labelWidth] whenever the element is wide enough. Narrow + * elements (labelWidth > bbox.width) always use xOffset=0 and may + * extend past the element edges. + * - When no horizontal offset on 'above' fits, the element is + * deferred to a later highlight page rather than flipping sides. */ // Helper to create a minimal InteractiveElement @@ -29,7 +35,7 @@ function createElement( y: number, width: number, height: number, - labelPosition?: 'above' | 'below' | 'left' | 'right', + labelPosition?: 'above' | 'below', ): InteractiveElement { const selector = `#${selectorName}`; return { @@ -53,59 +59,74 @@ function findBySelector( describe('Smart Label Placement', () => { describe('expandBBoxWithLabel - Position-aware expansion', () => { - test('should expand bbox upward when labelPosition is "above" (default)', () => { + // Corner-badge geometry: the label sits fully outside the element, + // touching its edge. `expandBBoxWithLabel` extends the union by the + // full label dimension on the labeled side. + + test('should expand bbox upward by the full label height when "above"', () => { const bbox: BBox = { x: 100, y: 100, width: 50, height: 30 }; const expanded = expandBBoxWithLabel(bbox, 'above'); const labelWidth = getLabelDimensions('xxxxxx', bbox.width).width; - // Label is above: y decreases by LABEL_HEIGHT expect(expanded.x).toBe(100); - expect(expanded.y).toBe(100 - LABEL_HEIGHT); // 74 - expect(expanded.width).toBe(labelWidth); - expect(expanded.height).toBe(30 + LABEL_HEIGHT); // 56 + expect(expanded.y).toBe(100 - LABEL_HEIGHT); + // Expanded footprint spans the union of bbox and label x-ranges. + expect(expanded.width).toBe(Math.max(bbox.width, labelWidth)); + expect(expanded.height).toBe(30 + LABEL_HEIGHT); }); - test('should expand bbox downward when labelPosition is "below"', () => { + test('should expand bbox downward by the full label height when "below"', () => { const bbox: BBox = { x: 100, y: 100, width: 50, height: 30 }; const expanded = expandBBoxWithLabel(bbox, 'below'); const labelWidth = getLabelDimensions('xxxxxx', bbox.width).width; - // Label is below: y stays same, height increases expect(expanded.x).toBe(100); expect(expanded.y).toBe(100); - expect(expanded.width).toBe(labelWidth); - expect(expanded.height).toBe(30 + LABEL_HEIGHT); // 56 + expect(expanded.width).toBe(Math.max(bbox.width, labelWidth)); + expect(expanded.height).toBe(30 + LABEL_HEIGHT); }); - test('should expand bbox to the left when labelPosition is "left"', () => { - const bbox: BBox = { x: 100, y: 100, width: 50, height: 30 }; - const expanded = expandBBoxWithLabel(bbox, 'left'); + test('xOffset shifts the label horizontally within the element x-range', () => { + // A wide element with slack between labelWidth and bbox.width can + // take a non-zero xOffset. The shifted label's x-range must stay + // within the element's x-range. + const bbox: BBox = { x: 100, y: 100, width: 300, height: 30 }; + const labelWidth = getLabelDimensions('xxxxxx', bbox.width).width; + const slack = bbox.width - labelWidth; + const expanded = expandBBoxWithLabel(bbox, 'above', 'xxxxxx', slack); + + // Expanded footprint still starts at bbox.x (element x-range anchor) + // and widths out to at most bbox.width. + expect(expanded.x).toBe(bbox.x); + expect(expanded.width).toBe(bbox.width); + }); - // Label is left: x decreases by label width + test('label never drifts past element x-range when element is wide enough', () => { + // Even if the caller asks for an xOffset past the slack, getLabelBBox + // clamps so the label x-range stays inside [bbox.x, bbox.x+bbox.width]. + const bbox: BBox = { x: 100, y: 100, width: 300, height: 30 }; const labelWidth = getLabelDimensions('xxxxxx', bbox.width).width; - expect(expanded.x).toBe(100 - labelWidth); // -20 - expect(expanded.y).toBe(100); - expect(expanded.width).toBe(50 + labelWidth); // 170 - expect(expanded.height).toBe(30); + const overshot = getLabelBBox(bbox, 'above', 'xxxxxx', 9999); + + expect(overshot.x).toBe(bbox.x + (bbox.width - labelWidth)); + expect(overshot.x + overshot.width).toBe(bbox.x + bbox.width); }); - test('should expand bbox to the right when labelPosition is "right"', () => { - const bbox: BBox = { x: 100, y: 100, width: 50, height: 30 }; - const expanded = expandBBoxWithLabel(bbox, 'right'); + test('narrow element (label wider than bbox) forces xOffset=0', () => { + // When labelWidth > bbox.width, the label unavoidably extends past + // the element's edge; the clamp forces xOffset=0 regardless of + // what the caller requests. This is the only scenario in which the + // label is allowed outside the element x-range. + const bbox: BBox = { x: 100, y: 100, width: 10, height: 14 }; + const attempted = getLabelBBox(bbox, 'above', 'xxxxxx', 500); - // Label is right: x stays same, width increases - const labelWidth = getLabelDimensions('xxxxxx', bbox.width).width; - expect(expanded.x).toBe(100); - expect(expanded.y).toBe(100); - expect(expanded.width).toBe(50 + labelWidth); // 170 - expect(expanded.height).toBe(30); + expect(attempted.x).toBe(bbox.x); }); test('should default to "above" when labelPosition is undefined', () => { const bbox: BBox = { x: 100, y: 100, width: 50, height: 30 }; const expanded = expandBBoxWithLabel(bbox); - // Should behave same as 'above' expect(expanded.y).toBe(100 - LABEL_HEIGHT); }); }); @@ -119,35 +140,39 @@ describe('Smart Label Placement', () => { expect(elementsCollide(elemA, elemB)).toBe(true); }); - test('should NOT collide when one label is above and other is below', () => { - // Element A at (100, 100) with label above - // Element B at (100, 70) with label below (label would be at y=100) - // They should NOT collide because labels are on opposite sides + test('two elements separated vertically beyond the corner-badge footprint do not collide', () => { + // Under the corner-badge model a label straddles its element's + // edge — half of the label sits inside the bbox, half sticks out + // past it. So each element's label+bbox footprint extends outward + // by labelHeight/2 (roughly 11px), not the full labelHeight. + // + // Element A at y=100..130 with label above → footprint y ≈ 89..130. + // Element B at y=20..50 with label below → footprint y ≈ 20..61. + // The two footprints are separated by ~28px — no collision. const elemA = createElement('a', 100, 100, 50, 30, 'above'); - const elemB = createElement('b', 100, 70, 50, 30, 'below'); + const elemB = createElement('b', 100, 20, 50, 30, 'below'); - // Element A's expanded bbox: y=74 (100-26), height=56 - // Element B's expanded bbox: y=70, height=56 (label below) - // These should NOT overlap because A's label is above (y=74-100) and B's label is below (y=100-126) expect(elementsCollide(elemA, elemB)).toBe(false); }); - test('should NOT collide when labels are on opposite horizontal sides', () => { - // Element A at (200, 100) with label left - // Element B at (200, 100) with label right - // They should NOT collide because labels are on opposite sides - const elemA = createElement('a', 200, 100, 50, 30, 'left'); - const elemB = createElement('b', 200, 100, 50, 30, 'right'); + test('two horizontally-adjacent elements with room between labels do not collide', () => { + // Place two elements far enough apart horizontally that their + // 'above' labels (left-aligned, xOffset=0) do not touch. + const labelWidth = getLabelDimensions('xxxxxx', 50).width; + const gap = labelWidth + 20; + const elemA = createElement('a', 0, 100, 50, 30, 'above'); + const elemB = createElement('b', labelWidth + gap, 100, 50, 30, 'above'); - // Element A's expanded bbox: x=80 (200-120), width=170 - // Element B's expanded bbox: x=200, width=170 - // These should NOT overlap because A's label is left (x=80-200) and B's label is right (x=200-370) expect(elementsCollide(elemA, elemB)).toBe(false); }); }); describe('Position priority - Greedy algorithm', () => { - test('should prioritize more constrained elements before flexible ones', () => { + test('viewport-top element uses "below" while interior element uses "above"', () => { + // Label binding invariant: labels ALWAYS sit at the top-left of + // their element's bbox ('above'), except when the element is so + // close to the viewport top that 'above' would be clipped. Only + // that specific viewport-clip case may fall back to 'below'. const flexible = createElement('flexible', 100, 100, 50, 30); const constrained = createElement('constrained', 10, 10, 20, 14); @@ -159,10 +184,12 @@ describe('Smart Label Placement', () => { ); expect(result).toHaveLength(2); - expect(result[0]?.selector).toBe('#constrained'); - expect(result[0]?.id).toMatch(/^[0-9A-Z]{3}$/); - expect(result[1]?.selector).toBe('#flexible'); - expect(result[1]?.id).toMatch(/^[0-9A-Z]{3}$/); + // 'constrained' is at y=10 — 'above' clips the viewport top. + expect(findBySelector(result, '#constrained')?.labelPosition).toBe( + 'below', + ); + // 'flexible' has plenty of space above → 'above'. + expect(findBySelector(result, '#flexible')?.labelPosition).toBe('above'); }); test('should place label above when space available (default)', () => { @@ -174,74 +201,92 @@ describe('Smart Label Placement', () => { expect(result[0].labelPosition).toBe('above'); }); - test('should place one label below when two identical elements would both prefer above', () => { - // Element A at (100, 100) - label above at y=74-100 - // Element B at (100, 100) - same position as A, label above would collide - // The layout should split them across above/below instead of dropping one. + test('colliding "above" labels defer one element to a later page (no side-flip)', () => { + // Two elements at the same position both prefer 'above'. The + // label binding invariant forbids side-flipping on collision — + // only one element may take 'above' on this page; the other is + // deferred rather than placed 'below'. This keeps the rule + // "label is directly above the element it labels" universally + // readable. const elemA = createElement('a', 100, 100, 50, 30); const elemB = createElement('b', 100, 100, 50, 30); const elements = [elemA, elemB]; - const result = selectCollisionFreePage(elements, 1); + const page1 = selectCollisionFreePage(elements, 1); + const page2 = selectCollisionFreePage(elements, 2); - // Both elements should be on page 1 with different label positions. - expect(result).toHaveLength(2); - expect(result.map((element) => element.labelPosition).sort()).toEqual([ - 'above', - 'below', - ]); + expect(page1).toHaveLength(1); + expect(page1[0].labelPosition).toBe('above'); + expect(page2).toHaveLength(1); + expect(page2[0].labelPosition).toBe('above'); }); - test('should place label left when above and below collide', () => { - // Element A at (100, 100) - label above at y=74-100, x=100-220 - // Element B at (50, 80) - label above collides with A's label, label below collides with A's element - // Element C at (100, 130) - element at y=130-160 - // Element B should try left + test('should only ever place labels above or below (corner-badge model)', () => { + // Under the corner-badge model every label is anchored to the top or + // bottom edge of its own element's bbox. 'left' / 'right' placements + // are disabled because they break visual binding — a label to the + // left of element B sits between A and B and visually claims A. const elemA = createElement('a', 100, 100, 50, 30); const elemB = createElement('b', 50, 80, 50, 30); const elemC = createElement('c', 100, 130, 50, 30); - const elements = [elemA, elemB, elemC]; - - const result = selectCollisionFreePage(elements, 1); + const result = selectCollisionFreePage([elemA, elemB, elemC], 1); - // All three should fit with a non-overlapping placement - expect(result).toHaveLength(3); - const resultB = findBySelector(result, '#b'); - expect(resultB?.labelPosition).toBeDefined(); + for (const el of result) { + expect(['above', 'below']).toContain(el.labelPosition); + } }); - test('should place label right when above and left collide', () => { - // Scenario where right position works for B - // Element A at (200, 100) - label above at y=74-100, x=200-320 - // Element B at (150, 80) - label above collides with A's label - // label below collides with A's element - // label left doesn't collide (B gets label 'left') - // This tests that the algorithm tries positions in order + test('should defer elements to a later page when neither above nor below fits', () => { + // Collision-dense layout where 'above' is blocked by A's label and + // 'below' is blocked by A's element — the old 4-side algorithm would + // place B to the left; the corner-badge model instead defers B to + // page 2 so that every placement on a page is visually unambiguous. const elemA = createElement('a', 200, 100, 50, 30); const elemB = createElement('b', 150, 80, 50, 30); const elements = [elemA, elemB]; - const result = selectCollisionFreePage(elements, 1); + const page1 = selectCollisionFreePage(elements, 1); + const page2 = selectCollisionFreePage(elements, 2); - expect(result).toHaveLength(2); - const resultB = findBySelector(result, '#b'); - expect(resultB?.labelPosition).toBeDefined(); + // Union of page 1 and page 2 must cover both elements. + const allIds = new Set([ + ...page1.map((el) => el.selector), + ...page2.map((el) => el.selector), + ]); + expect(allIds.has('#a')).toBe(true); + expect(allIds.has('#b')).toBe(true); + + // Every label on every page must be above or below — never sideways. + for (const el of [...page1, ...page2]) { + expect(['above', 'below']).toContain(el.labelPosition); + } }); - test('should choose the feasible position that blocks fewer later elements', () => { - const upper = createElement('upper', 10, 20, 24, 14); - const lower = createElement('lower', 10, 48, 24, 14); + test('two stacked elements with enough vertical room both fit on page 1', () => { + // Upper at y=40, lower at y=100 — enough headroom above (y=40) for + // upper's 'above' label, and enough gap between them for one of + // them to claim 'below' as well. The corner-badge algorithm should + // place both on page 1 without sideways labels. + const upper = createElement('upper', 10, 40, 24, 14); + const lower = createElement('lower', 10, 100, 24, 14); - const result = selectCollisionFreePage([upper, lower], 1, 80, 200); + const result = selectCollisionFreePage([upper, lower], 1, 80, 400); expect(result).toHaveLength(2); - expect(findBySelector(result, '#upper')?.labelPosition).toBe('right'); - expect(findBySelector(result, '#lower')).toBeDefined(); + for (const el of result) { + expect(['above', 'below']).toContain(el.labelPosition); + } }); - test('should repack surrounding elements to keep constrained center on page 1', () => { - // Element completely surrounded in input order. The constraint-aware - // heuristic should reorder placements so the center element still fits. + test('center element surrounded above and below eventually gets placed', () => { + // Under the corner-badge model: + // - 'left' / 'right' sideways placements are disabled. + // - 'above' collides with the `above` element via bbox-vs-label check + // when the above element is already selected on the same page. + // - 'below' likewise collides with `below`. + // Result: center is deferred to a later page where the vertical + // neighbors no longer share the same page, letting it take one of + // 'above' or 'below'. const center = createElement('center', 200, 100, 50, 30); const above = createElement('above', 200, 64, 50, 30); const below = createElement('below', 200, 140, 50, 30); @@ -251,104 +296,90 @@ describe('Smart Label Placement', () => { const elements = [above, below, left, right, center]; const page1 = selectCollisionFreePage(elements, 1); - expect(page1).toHaveLength(5); - expect(findBySelector(page1, '#center')?.labelPosition).toBe('left'); + const page2 = selectCollisionFreePage(elements, 2); + const page3 = selectCollisionFreePage(elements, 3); + + // Center lands on some page (not necessarily page 1). + const centerPlaced = + findBySelector(page1, '#center') ?? + findBySelector(page2, '#center') ?? + findBySelector(page3, '#center'); + expect(centerPlaced).toBeDefined(); + expect(['above', 'below']).toContain(centerPlaced?.labelPosition); + + // Every placed element uses a corner-badge (above/below) placement. + for (const el of [...page1, ...page2, ...page3]) { + expect(['above', 'below']).toContain(el.labelPosition); + } }); }); describe('Viewport boundary checks', () => { - test('should not place label outside viewport on left', () => { - const labelWidth = getLabelDimensions('xxxxxx', 50).width; - // Element at x=50, label width extends beyond the left viewport edge - // Label left would be at x=-70 (outside viewport) - // Should try next position (right) instead - const elemA = createElement('a', 50, 100, 50, 30); - const elemB = createElement('b', 50, 60, 50, 30); // Blocks above - - const result = selectCollisionFreePage([elemA, elemB], 1, 1280, 720); - - const resultA = findBySelector(result, '#a'); - // A's above is blocked by B, left would go outside viewport - // So A should try right or below - expect(resultA?.labelPosition).not.toBe('left'); - expect(labelWidth).toBeGreaterThan(50); - }); - - test('should not place label outside viewport on right', () => { - // Element at x=1200, width=50, viewport width=1280 - // Label right would extend to x=1370 (outside viewport) - // Should try next position instead - const elemA = createElement('a', 1200, 100, 50, 30); - const elemB = createElement('b', 1200, 60, 50, 30); // Blocks above - const elemC = createElement('c', 1200, 130, 50, 30); // Blocks below - - const result = selectCollisionFreePage( - [elemA, elemB, elemC], - 1, - 1280, - 720, - ); - - const resultA = findBySelector(result, '#a'); - // Right placement should be rejected because it would leave the viewport. - expect(resultA?.labelPosition).not.toBe('right'); - }); - test('should not place label above viewport (y < 0)', () => { - // Element at y=10, label height=26 - // Label above would be at y=-16 (outside viewport) - // Should try below instead + // Element at y=10, label height=LABEL_HEIGHT. + // Label above would be at y=10-LABEL_HEIGHT (outside viewport). + // Should use 'below' instead. const elemA = createElement('a', 100, 10, 50, 30); const result = selectCollisionFreePage([elemA], 1, 1280, 720); - // Label above would go outside viewport, should try below expect(result[0]?.labelPosition).toBe('below'); }); + }); - test('should not place label below viewport bottom', () => { - // Element at y=700, height=30, viewport height=720 - // Label below would extend to y=756 (outside viewport) - // Should try left or right instead - const elemA = createElement('a', 100, 700, 50, 30); - const elemB = createElement('b', 100, 660, 50, 30); // Blocks above - - const result = selectCollisionFreePage([elemA, elemB], 1, 1280, 720); - - const resultA = findBySelector(result, '#a'); - // A's above blocked by B, below outside viewport - // So A should try left or right - expect(resultA?.labelPosition).not.toBe('below'); + describe('Horizontal shift to clear collisions', () => { + test('adjacent elements with tight label clearance both fit via xOffset shift', () => { + // Adjacent bboxes (touching at a shared edge) where default + // left-aligned labels would fail the VISUAL_LABEL_CLEARANCE_PX + // check. Each element has just enough slack (bbox.width - + // labelWidth) that shifting one label right along its top edge + // opens the required clearance gap — so both fit on page 1 + // without deferring, and each label's x-range stays strictly + // inside its own element's x-range. + const longId = 'AAAAAAAAAAA'; // caps labelWidth at MAX_LABEL_WIDTH=80 + const labelWidth = getLabelDimensions(longId, 83).width; + // Pick bbox.width = labelWidth + 3 so slack=3 is exactly enough + // to clear the 3px clearance deficit at offset=0. + const bboxWidth = labelWidth + 3; + const elemA: InteractiveElement = { + ...createElement('a', 0, 100, bboxWidth, 30), + id: longId, + }; + const elemB: InteractiveElement = { + ...createElement('b', bboxWidth, 100, bboxWidth, 30), + id: longId, + }; + + const page1 = selectCollisionFreePage([elemA, elemB], 1, 1280, 720); + + expect(page1).toHaveLength(2); + for (const el of page1) { + expect(el.labelPosition).toBe('above'); + // Label x-range must stay within the element's x-range. + const lbl = getLabelBBox(el.bbox, 'above', el.id, el.labelXOffset ?? 0); + expect(lbl.x).toBeGreaterThanOrEqual(el.bbox.x); + expect(lbl.x + lbl.width).toBeLessThanOrEqual( + el.bbox.x + el.bbox.width, + ); + } + // At least one label was shifted off the default left-aligned + // origin; otherwise the clearance check would still fail. + const shiftedCount = page1.filter( + (el) => (el.labelXOffset ?? 0) > 0, + ).length; + expect(shiftedCount).toBeGreaterThanOrEqual(1); }); }); describe('Edge cases', () => { test('should handle element at viewport corner (top-left)', () => { - // Element at (10, 10) - near top-left corner - // Above would go outside (y=-16) - // Left would go outside (x=-110) - // Should try below or right + // Element near top-left. 'above' would leave the viewport, so + // 'below' must be used. const elem = createElement('corner', 10, 10, 50, 30); const result = selectCollisionFreePage([elem], 1, 1280, 720); - // Should not be above or left - expect(result[0]?.labelPosition).not.toBe('above'); - expect(result[0]?.labelPosition).not.toBe('left'); - }); - - test('should handle element at viewport corner (bottom-right)', () => { - // Element at (1220, 690) - near bottom-right corner - // Below would go outside (y=746) - // Right would go outside (x=1340) - // Should try above or left - const elem = createElement('corner', 1220, 690, 50, 30); - - const result = selectCollisionFreePage([elem], 1, 1280, 720); - - // Should not be below or right - expect(result[0]?.labelPosition).not.toBe('below'); - expect(result[0]?.labelPosition).not.toBe('right'); + expect(result[0]?.labelPosition).toBe('below'); }); test('should handle empty elements array', () => { diff --git a/extension/src/background/index.ts b/extension/src/background/index.ts index 3b0f24b..3cf9d86 100644 --- a/extension/src/background/index.ts +++ b/extension/src/background/index.ts @@ -21,7 +21,10 @@ import { debuggerSessionManager } from '../commands/debugger-manager'; import { dialogManager } from '../commands/dialog'; import { clearScreenshotCache } from '../commands/computer'; -import { highlightSingleElement } from '../commands/single-highlight'; +import { + cropScreenshotAroundElement, + getConfirmationPromptText, +} from '../commands/single-highlight'; import { highlightDropPreview } from '../commands/drop-preview-highlight'; import { elementCache } from '../commands/element-cache'; import { assignHashedElementIds } from '../commands/element-id'; @@ -29,6 +32,7 @@ import { buildElementCacheMissMessage } from '../commands/element-cache'; import { buildHighlightDetectionScript, filterHighlightElementsByKeywords, + getElementDescriptorScript, } from '../commands/highlight-detection'; import { performElementClick, @@ -176,12 +180,17 @@ async function runRawScreenshotPrime(options: { ); } -const LABEL_POSITION_FALLBACK_ORDER: LabelPosition[] = [ - 'above', - 'below', - 'left', - 'right', -]; +const LABEL_POSITION_FALLBACK_ORDER: LabelPosition[] = ['above', 'below']; + +// Strip fields that exist for extension-internal use (identity, cache, +// inspect_element) from the response payload sent to the server. The LLM +// consumes `descriptor` instead of raw outerHTML. +function toServerHighlightElement( + element: InteractiveElement, +): InteractiveElement { + const { html: _html, ...rest } = element; + return rest; +} // Keyword-mode bypasses the collision planner (it must show all matches on // one page), but the renderer still needs a labelPosition that fits in the @@ -288,16 +297,22 @@ function buildHighlightConsistencyScript( `; } +// Border = bright outline around the element (minimal content occlusion). +// Bg = OPAQUE darker shade used as the label fill. Using a darker opaque +// fill (not the border color at reduced alpha) makes the label read as a +// distinct filled badge rather than a part of the bbox's outline — so +// when the label's bottom edge touches the bbox's top edge, the two +// shapes remain visually separable. const IN_PAGE_HIGHLIGHT_COLORS: Record = { - clickable: { border: '#0066FF', bg: 'rgba(0,102,255,0.7)' }, - scrollable: { border: '#00CC66', bg: 'rgba(0,204,102,0.7)' }, - inputable: { border: '#FF9900', bg: 'rgba(255,153,0,0.7)' }, - selectable: { border: '#FF6B6B', bg: 'rgba(255,107,107,0.7)' }, - draggable: { border: '#FF6600', bg: 'rgba(255,102,0,0.7)' }, - droppable: { border: '#339966', bg: 'rgba(51,153,102,0.7)' }, - uploadable: { border: '#AA66FF', bg: 'rgba(170,102,255,0.7)' }, - any: { border: '#00CCCC', bg: 'rgba(0,204,204,0.7)' }, + clickable: { border: '#0066FF', bg: '#003D99' }, + scrollable: { border: '#00CC66', bg: '#007A3D' }, + inputable: { border: '#FF9900', bg: '#995C00' }, + selectable: { border: '#FF6B6B', bg: '#993333' }, + draggable: { border: '#FF6600', bg: '#993D00' }, + droppable: { border: '#339966', bg: '#1F5C3D' }, + uploadable: { border: '#AA66FF', bg: '#663D99' }, + any: { border: '#00CCCC', bg: '#007A7A' }, }; const OB_HIGHLIGHT_OVERLAY_ID = '__ob_highlight_overlay__'; @@ -316,6 +331,7 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { borderColor: colors.border, bgColor: colors.bg, labelPos: el.labelPosition || 'above', + labelXOffset: el.labelXOffset || 0, }; }); @@ -331,7 +347,11 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { // Snapshot + restore helpers so we don't leak our overrides onto the // page when pre-existing inline styles are present. const SAVED_ATTR = HL_ATTR + '-saved'; - const OVERRIDES = ['transition', 'box-shadow']; + // outline is painted AFTER descendants (per CSS paint order), so it + // stays visible even when the element has opaque children filling its + // content area — e.g. wrapping an that + // would fully cover an inset box-shadow. + const OVERRIDES = ['transition', 'outline', 'outline-offset']; const snapshotOverrides = (el) => { const snap = {}; for (const p of OVERRIDES) { @@ -356,29 +376,40 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { el.removeAttribute(SAVED_ATTR); }; - // Remove any previous highlights - document.getElementById(OVERLAY_ID)?.remove(); + // Remove any previous highlights (including resize listener). + const prevOverlay = document.getElementById(OVERLAY_ID); + if (prevOverlay) { + if (prevOverlay.__obResizeHandler) { + window.removeEventListener('resize', prevOverlay.__obResizeHandler); + } + prevOverlay.remove(); + } document.querySelectorAll('[' + HL_ATTR + ']').forEach(el => { restoreOverrides(el); el.removeAttribute(HL_ATTR); }); - // Create overlay for labels only (boxes use inset box-shadow on elements) + // Create overlay for labels only (boxes use outline on elements). + // Use position:absolute so labels scroll with the document alongside + // the outlined elements; fixed would leave them stuck to the viewport. const overlay = document.createElement('div'); overlay.id = OVERLAY_ID; - overlay.style.cssText = 'position:fixed;top:0;left:0;width:100%;height:100%;pointer-events:none;z-index:2147483647;overflow:hidden;'; + overlay.style.cssText = 'position:absolute;top:0;left:0;pointer-events:none;z-index:2147483647;'; document.documentElement.appendChild(overlay); const bboxes = []; // box-sizing:border-box so max-width caps the total rendered width // (matching the collision planner's MAX_LABEL_WIDTH, which is the full // label width including padding). - const LABEL_BASE_CSS = 'position:fixed;box-sizing:border-box;' + const LABEL_BASE_CSS = 'position:absolute;box-sizing:border-box;' + 'font:bold ' + LABEL_FONT_SIZE + 'px/' + LABEL_FONT_SIZE + 'px Arial,sans-serif;' + 'color:#fff;padding:' + LABEL_PADDING + 'px;border-radius:2px;' + 'white-space:nowrap;pointer-events:none;overflow:hidden;text-overflow:ellipsis;' + 'max-width:' + MAX_LABEL_WIDTH + 'px;'; + // Track element→label pairs so we can reposition on resize. + const labelEntries = []; + for (const item of items) { try { const el = document.querySelector(item.selector); @@ -397,12 +428,12 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { if (hit && hit !== el && !el.contains(hit)) continue; } - // Snapshot any inline transition/box-shadow so cleanup can restore + // Snapshot any inline transition/outline so cleanup can restore // them exactly (including !important priority) instead of stripping. snapshotOverrides(el); - // Disable CSS transitions so the page can't animate the shadow in + // Disable CSS transitions so the page can't animate the outline in // (e.g. sidebar items with "transition: all 0.2s" would cause the - // CDP screenshot to catch the box-shadow mid-interpolation and the + // CDP screenshot to catch the outline mid-interpolation and the // border would render thinner than the specified 3px). el.style.setProperty('transition', 'none', 'important'); // Adapt border thickness to element size: tight targets (small @@ -411,10 +442,11 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { // against a bigger empty interior. const borderPx = Math.min(rect.width, rect.height) > 32 ? 3 : 2; el.style.setProperty( - 'box-shadow', - 'inset 0 0 0 ' + borderPx + 'px ' + item.borderColor, + 'outline', + borderPx + 'px solid ' + item.borderColor, 'important', ); + el.style.setProperty('outline-offset', (-borderPx) + 'px', 'important'); el.setAttribute(HL_ATTR, item.id); // Render label off-screen first to measure actual dimensions, then @@ -425,26 +457,198 @@ function buildInPageHighlightScript(elements: InteractiveElement[]): string { label.textContent = item.id; overlay.appendChild(label); - const labelRect = label.getBoundingClientRect(); + labelEntries.push({ el, label, item }); + + bboxes.push({ id: item.id, bbox: { x: rect.x, y: rect.y, width: rect.width, height: rect.height } }); + } catch (e) { /* skip */ } + } + + // Position all labels. Extracted so we can re-run on resize. + function positionLabels() { + const sx = window.scrollX || window.pageXOffset || 0; + const sy = window.scrollY || window.pageYOffset || 0; + for (const entry of labelEntries) { + const rect = entry.el.getBoundingClientRect(); + const labelRect = entry.label.getBoundingClientRect(); const labelW = labelRect.width; const labelH = labelRect.height; + const slack = Math.max(0, rect.width - labelW); + const xOffset = Math.max(0, Math.min(slack, entry.item.labelXOffset || 0)); + entry.label.style.left = (rect.left + sx + xOffset) + 'px'; + entry.label.style.top = ( + entry.item.labelPos === 'below' + ? rect.bottom + sy + : rect.top + sy - labelH + ) + 'px'; + } + } + positionLabels(); + + // Reposition labels when the page layout changes (window resize, + // zoom, DevTools panel toggle, etc.) so they stay attached to + // their highlighted elements instead of drifting. + const _obResizeHandler = () => { positionLabels(); }; + window.addEventListener('resize', _obResizeHandler); + // Stash reference so cleanup can remove it. + overlay.__obResizeHandler = _obResizeHandler; + + return { bboxes }; + })(); + `; +} - let lx, ly; - switch (item.labelPos) { - case 'below': lx = rect.left; ly = rect.bottom; break; - case 'left': lx = rect.left - labelW; ly = rect.top; break; - case 'right': lx = rect.right; ly = rect.top; break; - default: lx = rect.left; ly = rect.top - labelH; break; +// Cleanup of injected highlight styles is deferred until the next command +// arrives, so the yellow/colored overlay stays visible on the page between +// commands. Keyed by tabId; a pending cleanup is overwritten if a new +// highlight runs on the same tab before the prior one is flushed. +const pendingHighlightCleanups = new Map Promise>(); + +function scheduleHighlightCleanup(tabId: number, conversationId: string): void { + pendingHighlightCleanups.set(tabId, async () => { + await javascript.executeJavaScript( + tabId, + conversationId, + buildHighlightCleanupScript(), + true, + false, + 2000, + ); + }); +} + +// Read-only / metadata commands that should NOT flush pending highlight +// cleanups. The server sends `get_tabs` immediately after every tab action +// to refresh its tab list; treating that as a "user-visible next command" +// would wipe the highlights we just injected on a tab init. +const HIGHLIGHT_PRESERVING_COMMAND_TYPES = new Set(['get_tabs']); + +async function flushPendingHighlightCleanups(tabId?: number): Promise { + if (pendingHighlightCleanups.size === 0) return; + if (tabId === undefined) return; + const cleanup = pendingHighlightCleanups.get(tabId); + if (!cleanup) return; + pendingHighlightCleanups.delete(tabId); + try { + await cleanup(); + } catch (e) { + console.warn( + `⚠️ [HighlightCleanup] Deferred cleanup failed for tab ${tabId}: ${e}`, + ); + } +} + +// Inject a yellow confirmation outline + "Is this the element you wanted +// to ..." banner on a single live DOM element. Shares OVERLAY_ID / HL_ATTR +// with the broad highlight path so buildHighlightCleanupScript reverses it. +function buildInPageSingleHighlightScript( + element: InteractiveElement, + intendedAction: 'click' | 'keyboard_input' | 'select' | undefined, +): string { + const selector = element.overlaySelector || element.selector; + const promptText = getConfirmationPromptText(intendedAction); + const borderColor = '#FFD400'; + const bannerBg = 'rgba(255,212,0,0.95)'; + + return ` + (() => { + const OVERLAY_ID = ${JSON.stringify(OB_HIGHLIGHT_OVERLAY_ID)}; + const HL_ATTR = ${JSON.stringify(OB_HIGHLIGHT_ATTR)}; + const SAVED_ATTR = HL_ATTR + '-saved'; + const OVERRIDES = ['transition', 'outline', 'outline-offset']; + + const snapshotOverrides = (el) => { + const snap = {}; + for (const p of OVERRIDES) { + snap[p] = { + v: el.style.getPropertyValue(p), + i: el.style.getPropertyPriority(p), + }; + } + el.setAttribute(SAVED_ATTR, JSON.stringify(snap)); + }; + const restoreOverrides = (el) => { + let snap = {}; + try { snap = JSON.parse(el.getAttribute(SAVED_ATTR) || '{}'); } catch (_) {} + for (const p of OVERRIDES) { + const saved = snap[p]; + if (saved && saved.v) { + el.style.setProperty(p, saved.v, saved.i || ''); + } else { + el.style.removeProperty(p); } + } + el.removeAttribute(SAVED_ATTR); + }; - label.style.left = lx + 'px'; - label.style.top = ly + 'px'; + document.getElementById(OVERLAY_ID)?.remove(); + document.querySelectorAll('[' + HL_ATTR + ']').forEach(el => { + restoreOverrides(el); + el.removeAttribute(HL_ATTR); + }); - bboxes.push({ id: item.id, bbox: { x: rect.x, y: rect.y, width: rect.width, height: rect.height } }); - } catch (e) { /* skip */ } + const overlay = document.createElement('div'); + overlay.id = OVERLAY_ID; + overlay.style.cssText = 'position:absolute;top:0;left:0;pointer-events:none;z-index:2147483647;'; + document.documentElement.appendChild(overlay); + + const el = document.querySelector(${JSON.stringify(selector)}); + if (!el) return { bbox: null }; + const rect = el.getBoundingClientRect(); + if (rect.width <= 0 || rect.height <= 0) return { bbox: null }; + + const scrollX = window.scrollX || window.pageXOffset || 0; + const scrollY = window.scrollY || window.pageYOffset || 0; + + snapshotOverrides(el); + el.style.setProperty('transition', 'none', 'important'); + const borderPx = Math.min(rect.width, rect.height) > 32 ? 4 : 3; + el.style.setProperty( + 'outline', + borderPx + 'px solid ' + ${JSON.stringify(borderColor)}, + 'important', + ); + el.style.setProperty('outline-offset', (-borderPx) + 'px', 'important'); + el.setAttribute(HL_ATTR, 'single'); + + const label = document.createElement('div'); + const fontSize = 16; + const paddingX = 14; + const paddingY = 8; + label.style.cssText = 'position:absolute;box-sizing:border-box;' + + 'font:600 ' + fontSize + 'px/' + (fontSize + 4) + 'px ' + + '-apple-system,BlinkMacSystemFont,"Segoe UI",Arial,sans-serif;' + + 'color:#111;background:' + ${JSON.stringify(bannerBg)} + ';' + + 'padding:' + paddingY + 'px ' + paddingX + 'px;border-radius:6px;' + + 'border:1px solid rgba(17,17,17,0.18);' + + 'white-space:nowrap;pointer-events:none;' + + 'box-shadow:0 4px 12px rgba(0,0,0,0.18);left:-9999px;top:0;'; + label.textContent = ${JSON.stringify(promptText)}; + overlay.appendChild(label); + + const labelRect = label.getBoundingClientRect(); + const labelW = labelRect.width; + const labelH = labelRect.height; + + const MARGIN = 10; + const elCenterX = rect.left + rect.width / 2; + let lx = elCenterX - labelW / 2; + lx = Math.max(MARGIN, Math.min(lx, innerWidth - labelW - MARGIN)); + + let ly; + if (rect.top - labelH - MARGIN >= 0) { + ly = rect.top - labelH - MARGIN; + } else if (rect.bottom + labelH + MARGIN <= innerHeight) { + ly = rect.bottom + MARGIN; + } else { + ly = Math.max(MARGIN, rect.top - labelH - MARGIN); } - return { bboxes }; + label.style.left = (lx + scrollX) + 'px'; + label.style.top = (ly + scrollY) + 'px'; + + return { + bbox: { x: rect.x, y: rect.y, width: rect.width, height: rect.height }, + }; })(); `; } @@ -454,8 +658,14 @@ function buildHighlightCleanupScript(): string { (() => { const HL_ATTR = ${JSON.stringify(OB_HIGHLIGHT_ATTR)}; const SAVED_ATTR = HL_ATTR + '-saved'; - const OVERRIDES = ['transition', 'box-shadow']; - document.getElementById(${JSON.stringify(OB_HIGHLIGHT_OVERLAY_ID)})?.remove(); + const OVERRIDES = ['transition', 'outline', 'outline-offset']; + const overlayEl = document.getElementById(${JSON.stringify(OB_HIGHLIGHT_OVERLAY_ID)}); + if (overlayEl) { + if (overlayEl.__obResizeHandler) { + window.removeEventListener('resize', overlayEl.__obResizeHandler); + } + overlayEl.remove(); + } document.querySelectorAll('[' + HL_ATTR + ']').forEach(el => { let snap = {}; try { snap = JSON.parse(el.getAttribute(SAVED_ATTR) || '{}'); } catch (_) {} @@ -612,9 +822,20 @@ async function captureHighlightedPageState( continue; } + // Filter out elements too small to produce a visible highlight outline. + // Without this, tiny decorative dots (e.g. bullet indicators) enter the + // collision planner, occupy label slots, and block adjacent meaningful + // elements like links from being placed on page 1. + const MIN_HIGHLIGHT_DIM = 8; + const sizeFilteredElements = allElements.filter( + (el) => + el.bbox.width >= MIN_HIGHLIGHT_DIM || + el.bbox.height >= MIN_HIGHLIGHT_DIM, + ); + const keywordFilterStart = Date.now(); const keywordFiltering = filterHighlightElementsByKeywords( - allElements, + sizeFilteredElements, keywords, ); const keywordList = keywordFiltering.keywords; @@ -685,19 +906,9 @@ async function captureHighlightedPageState( highlightScript, ); - // Clean up injected highlights from the DOM - try { - await javascript.executeJavaScript( - tabId, - conversationId, - buildHighlightCleanupScript(), - true, - false, - 2000, - ); - } catch (e) { - console.warn(`⚠️ [${logLabel}] highlight cleanup failed: ${e}`); - } + // Keep injected highlights in the DOM until the next command runs. + // Flushed from handleCommand via flushPendingHighlightCleanups(). + scheduleHighlightCleanup(tabId, conversationId); if (!screenshotResult?.success || !screenshotResult?.imageData) { throw new Error( @@ -840,7 +1051,7 @@ async function captureHighlightedPageState( ); return { - elements: storedPage.elements, + elements: storedPage.elements.map(toServerHighlightElement), totalElements: filteredElements.length, totalPages, page: currentPage, @@ -1395,6 +1606,12 @@ chrome.runtime.onMessage.addListener((message, _sender, sendResponse) => { async function handleCommand(command: Command): Promise { console.log(`📨 Handling command: ${command.type}`, command); + if (!HIGHLIGHT_PRESERVING_COMMAND_TYPES.has(command.type)) { + await flushPendingHighlightCleanups( + (command as { tab_id?: number }).tab_id, + ); + } + try { switch (command.type) { case 'recording_control': { @@ -2764,13 +2981,25 @@ async function handleCommand(command: Command): Promise { // Brief pause for CSS transitions triggered by hover event handlers await new Promise((r) => setTimeout(r, 150)); - // Capture screenshot + // Inject yellow outline + confirmation banner on the real DOM + // element, capture, then crop around the element for the zoom-in + // preview. Cleanup is deferred to the next user-visible command + // so the confirmation highlight stays on the live page. + const singleHighlightScript = buildInPageSingleHighlightScript( + { ...element.element, bbox: freshBbox }, + command.intended_action, + ); const screenshotResult = await captureScreenshot( activeTabId, conversationId, true, 90, + false, + 0, + undefined, + singleHighlightScript, ); + scheduleHighlightCleanup(activeTabId, conversationId); // ============================================================ // Check if element is visible in viewport @@ -2816,18 +3045,17 @@ async function handleCommand(command: Command): Promise { }; } - // Create element with fresh bbox for drawing + // Border + banner are already baked into the screenshot via the + // in-page injection; just crop it to a zoomed window around the + // element for the confirmation preview. const elementWithFreshBbox = { ...element.element, bbox: freshBbox, }; - - // Draw single element highlight - const highlightedScreenshot = await highlightSingleElement( + const highlightedScreenshot = await cropScreenshotAroundElement( screenshotResult.imageData, elementWithFreshBbox, { - intendedAction: command.intended_action, scale: screenshotResult.metadata?.imageScale || screenshotResult.metadata?.devicePixelRatio || @@ -2917,6 +3145,7 @@ async function handleCommand(command: Command): Promise { .replace(/"/g, '\\"'); const dropDetectionScript = ` (function() { + ${getElementDescriptorScript()} const container = document.querySelector("${targetSelector}"); if (!container) { return { ok: false, error: "Drop target container not found in DOM" }; @@ -2993,10 +3222,14 @@ async function handleCommand(command: Command): Promise { selector += ':nth-child(' + idx + ')'; } const rect = child.getBoundingClientRect(); + const descriptor = + typeof window.__openbrowserBuildElementDescriptor === 'function' + ? window.__openbrowserBuildElementDescriptor(child) + : undefined; innerElements.push({ tagName: child.tagName, text: (child.textContent || '').trim().slice(0, 200), - html: child.outerHTML.slice(0, 2000), + descriptor: descriptor, selector: selector, bbox: { x: rect.x, diff --git a/extension/src/commands/element-cache.ts b/extension/src/commands/element-cache.ts index a3f56b6..1e50f97 100644 --- a/extension/src/commands/element-cache.ts +++ b/extension/src/commands/element-cache.ts @@ -12,6 +12,7 @@ import type { ElementType, InteractiveElement } from '../types'; import { buildElementIdentityKey, generateUniqueHash, + getStableIdentityInput, normalizeVisualElementIdInput, } from './element-id'; @@ -180,7 +181,7 @@ class ElementCacheImpl { const { hash } = generateUniqueHash( element.selector, entry.usedIds, - element.html, + getStableIdentityInput(element), ); elementId = hash; } diff --git a/extension/src/commands/element-descriptor.injected.js b/extension/src/commands/element-descriptor.injected.js new file mode 100644 index 0000000..71ee09e --- /dev/null +++ b/extension/src/commands/element-descriptor.injected.js @@ -0,0 +1,426 @@ +/* eslint-disable */ +// Plain-JS helper that runs inside the page world to produce a compact, +// structured descriptor of an interactive element. Inlined into both the +// highlight detection script and the drag-and-drop inner-element probe. + +/* global Node */ + +function openbrowserCollapseWhitespace(value) { + if (typeof value !== 'string') return ''; + return value.replace(/\s+/g, ' ').trim(); +} + +function openbrowserTruncate(value, limit) { + if (typeof value !== 'string') return undefined; + const collapsed = openbrowserCollapseWhitespace(value); + if (!collapsed) return undefined; + if (collapsed.length <= limit) return collapsed; + return collapsed.slice(0, limit - 1) + '…'; +} + +function openbrowserVisibleText(element) { + if (!element) return undefined; + // Prefer accessible name sources for empty-text controls later; here just + // pull visible text content from the subtree. + const raw = element.textContent || ''; + return openbrowserTruncate(raw, 120); +} + +function openbrowserAccessibleName(element) { + if (!element) return undefined; + const ariaLabel = element.getAttribute && element.getAttribute('aria-label'); + if (ariaLabel) return openbrowserTruncate(ariaLabel, 120); + const labelledBy = + element.getAttribute && element.getAttribute('aria-labelledby'); + if (labelledBy) { + try { + const ids = labelledBy.split(/\s+/).filter(Boolean); + const parts = []; + for (const id of ids) { + const ref = element.ownerDocument.getElementById(id); + if (ref && ref.textContent) parts.push(ref.textContent); + } + const joined = parts.join(' '); + if (joined) return openbrowserTruncate(joined, 120); + } catch (_err) { + /* ignore */ + } + } + const title = element.getAttribute && element.getAttribute('title'); + if (title) return openbrowserTruncate(title, 120); + const alt = element.getAttribute && element.getAttribute('alt'); + if (alt) return openbrowserTruncate(alt, 120); + return undefined; +} + +function openbrowserExplicitRole(element) { + if (!element || !element.getAttribute) return undefined; + const role = element.getAttribute('role'); + if (!role) return undefined; + const tag = element.tagName ? element.tagName.toLowerCase() : ''; + // Hide role when it's already redundant with the tag (e.g. ", + "descriptor": {"tag": "button", "text": "Submit"}, } ], "totalElements": 1, @@ -149,7 +149,7 @@ def test_build_observation_marks_small_model_from_session_metadata( { "id": "abc123", "type": "clickable", - "html": "", + "descriptor": {"tag": "button", "text": "Submit"}, } ], total_elements=1, @@ -157,7 +157,7 @@ def test_build_observation_marks_small_model_from_session_metadata( ) assert observation.small_model is True - assert "" in observation.to_llm_content[0].text + assert '", + "descriptor": {"tag": "button", "text": "Submit"}, } ], total_elements=1, @@ -112,7 +137,7 @@ def test_highlighted_clickable_elements_include_html(self) -> None: text = _text_content(observation) assert "1 clickable element" not in text - assert "abc123(clickable): " in text + assert 'abc123(clickable): ", + "descriptor": {"tag": "button", "text": "Submit"}, } ], total_elements=1, @@ -155,17 +184,21 @@ def test_small_model_highlighted_clickable_elements_still_include_html( text = _text_content(observation) assert "1 clickable element" not in text - assert "abc123(clickable): " in text + assert 'abc123(clickable): " + def test_descriptor_collapses_long_text(self) -> None: observation = OpenBrowserObservation( success=True, element_type="inputable", highlighted_elements=[ - {"id": "abc123", "type": "inputable", "html": long_html} + { + "id": "abc123", + "type": "inputable", + "descriptor": { + "tag": "button", + "text": "x" * 220, + }, + } ], total_elements=1, ) @@ -173,27 +206,112 @@ def test_highlighted_elements_truncate_long_html_for_non_selectable_results( text = _text_content(observation) assert "abc123(inputable):" in text - assert "...(Truncated)" in text + # 120-char cap with ellipsis + assert "…" in text + assert "x" * 220 not in text - def test_selectable_elements_keep_full_html_so_options_remain_visible(self) -> None: - select_html = ( - "" - ) + def test_selectable_descriptor_emits_all_options_without_truncation( + self, + ) -> None: + """With element_type='selectable', all options render regardless of count. + + The agent narrows to 'selectable' specifically to inspect a full option + list before calling `select`, so the cap that protects the mixed + inventory must not apply here. + """ + options = [{"value": str(i), "label": f"Option {i}"} for i in range(30)] + options[0]["selected"] = True observation = OpenBrowserObservation( success=True, element_type="selectable", highlighted_elements=[ - {"id": "sel999", "type": "selectable", "html": select_html} + { + "id": "sel999", + "type": "selectable", + "descriptor": { + "tag": "select", + "options": options, + "value": "0", + }, + } + ], + total_elements=1, + ) + + text = _text_content(observation) + + assert "sel999(selectable): with many options is capped. + + Pre-regression behavior (main): HTML truncation at 200 chars. On this + branch the descriptor format emits every option unconditionally, which + inflates token cost on state-picker-like widgets. Cap at 20 by default, + with a trailer telling the agent how to see the rest. + """ + options = [{"value": str(i), "label": f"Option {i}"} for i in range(50)] + options[0]["selected"] = True + observation = OpenBrowserObservation( + success=True, + element_type="any", + highlighted_elements=[ + { + "id": "selMANY", + "type": "selectable", + "descriptor": { + "tag": "select", + "options": options, + "value": "0", + }, + } + ], + total_elements=1, + ) + + text = _text_content(observation) + + # First 20 rendered + for i in range(20): + assert f'"{i}"="Option {i}"' in text + # Option 20+ omitted except the trailer + assert '"25"="Option 25"' not in text + # Trailer present + assert "20 shown, 30 more" in text + assert 're-highlight with `element_type: "selectable"`' in text + + def test_select_capped_inventory_still_shows_selected_option(self) -> None: + """Even when the selected option is past the cap, the agent must see + which option is currently selected so it can decide whether to change + it. The capped renderer appends the selected option at the end.""" + options = [{"value": str(i), "label": f"Option {i}"} for i in range(40)] + options[35]["selected"] = True # selected beyond the 20-item cap + observation = OpenBrowserObservation( + success=True, + element_type="any", + highlighted_elements=[ + { + "id": "selSEL", + "type": "selectable", + "descriptor": { + "tag": "select", + "options": options, + "value": "35", + }, + } ], total_elements=1, ) text = _text_content(observation) - assert select_html in text - assert "...(Truncated)" not in text + assert '"35"="Option 35" (selected)' in text + assert "20 shown, 20 more" in text def test_highlighted_elements_include_detected_type_suffix(self) -> None: observation = OpenBrowserObservation( @@ -203,12 +321,18 @@ def test_highlighted_elements_include_detected_type_suffix(self) -> None: { "id": "vrtbj5", "type": "clickable", - "html": '
', + "descriptor": { + "tag": "div", + "name": "Search", + }, }, { "id": "q4w08w", "type": "inputable", - "html": '', + "descriptor": { + "tag": "input", + "inputType": "search", + }, }, ], total_elements=2, @@ -216,8 +340,8 @@ def test_highlighted_elements_include_detected_type_suffix(self) -> None: text = _text_content(observation) - assert 'vrtbj5(clickable):
' in text - assert "q4w08w(inputable):" in text + assert 'vrtbj5(clickable):
· name="Search"' in text + assert "q4w08w(inputable): · type=search" in text assert "clickable element" not in text def test_small_model_mixed_highlighted_elements_match_default_rendering( @@ -231,12 +355,15 @@ def test_small_model_mixed_highlighted_elements_match_default_rendering( { "id": "vrtbj5", "type": "clickable", - "html": '
', + "descriptor": {"tag": "div", "name": "Search"}, }, { "id": "q4w08w", "type": "inputable", - "html": '', + "descriptor": { + "tag": "input", + "inputType": "search", + }, }, ], total_elements=2, @@ -244,10 +371,35 @@ def test_small_model_mixed_highlighted_elements_match_default_rendering( text = _text_content(observation) - assert 'vrtbj5(clickable):
' in text - assert "q4w08w(inputable):" in text + assert 'vrtbj5(clickable):
· name="Search"' in text + assert "q4w08w(inputable): · type=search" in text assert "clickable element" not in text + def test_anonymous_span_renders_class_and_icon_hints(self) -> None: + observation = OpenBrowserObservation( + success=True, + element_type="clickable", + highlighted_elements=[ + { + "id": "TD6", + "type": "clickable", + "descriptor": { + "tag": "span", + "classHint": ["like-wrapper", "like-active"], + "icon": "like", + }, + } + ], + total_elements=1, + ) + + text = _text_content(observation) + + assert ( + 'TD6(clickable): · class="like-wrapper like-active" · icon=like' + in text + ) + def test_highlighted_elements_include_interaction_hints_in_suffix(self) -> None: observation = OpenBrowserObservation( success=True, @@ -257,7 +409,7 @@ def test_highlighted_elements_include_interaction_hints_in_suffix(self) -> None: "id": "swp123", "type": "scrollable", "interactionHints": ["swipable"], - "html": '
', + "descriptor": {"tag": "div", "text": "Slides"}, } ], total_elements=1, @@ -276,13 +428,13 @@ def test_highlighted_elements_include_draggable_and_droppable_hints(self) -> Non "id": "drg456", "type": "clickable", "interactionHints": ["draggable"], - "html": '
Task 1
', + "descriptor": {"tag": "div", "text": "Task 1"}, }, { "id": "drp789", "type": "clickable", "interactionHints": ["droppable"], - "html": '
Done
', + "descriptor": {"tag": "div", "text": "Done"}, }, ], total_elements=2, @@ -449,13 +601,13 @@ def test_pending_drag_and_drop_shows_inner_elements_and_confirm_options( { "id": "C3F", "type": "draggable", - "html": '
Task 1
', + "descriptor": {"tag": "div", "text": "Task 1"}, "tagName": "div", }, { "id": "D4G", "type": "draggable", - "html": '
Task 2
', + "descriptor": {"tag": "div", "text": "Task 2"}, "tagName": "div", }, ], diff --git a/server/tests/unit/test_eval_client.py b/server/tests/unit/test_eval_client.py index 93f8078..28f1bbf 100644 --- a/server/tests/unit/test_eval_client.py +++ b/server/tests/unit/test_eval_client.py @@ -295,7 +295,7 @@ def test_cleanup_managed_tabs_closes_all_tabs() -> None: } -def test_run_test_cleans_managed_tabs_before_delete(tmp_path) -> None: +def test_run_test_cleans_managed_tabs_before_delete(tmp_path, monkeypatch) -> None: """Test teardown should close managed tabs before deleting the conversation.""" evaluator = Evaluator(chrome_uuid="browser-uuid-123") evaluator.output_dir = tmp_path @@ -305,15 +305,25 @@ def test_run_test_cleans_managed_tabs_before_delete(tmp_path) -> None: alias="plus", model_name="dashscope/qwen3.5-plus", ) - evaluator.eval_server = MagicMock() - evaluator.eval_server.clear_events.return_value = True - evaluator.eval_server.get_events.return_value = [] evaluator._save_track_events = MagicMock(return_value=None) evaluator._extract_images = MagicMock(return_value=[]) evaluator._save_sse_events = MagicMock(return_value=None) evaluator._extract_cost_from_sse_events = MagicMock(return_value=0.0) evaluator._evaluate_criteria = MagicMock(return_value=(True, 1.0, 1.0)) + # Stub the per-test eval server so we don't actually spawn a subprocess. + fake_proc = MagicMock() + fake_proc.start.return_value = 17000 + fake_proc.stop.return_value = None + monkeypatch.setattr( + eval_module, "EvalServerProcess", MagicMock(return_value=fake_proc) + ) + fake_client = MagicMock() + fake_client.get_events.return_value = [] + monkeypatch.setattr( + eval_module, "EvalServerClient", MagicMock(return_value=fake_client) + ) + teardown_calls: list[str] = [] evaluator.openbrowser = MagicMock() @@ -341,3 +351,4 @@ def test_run_test_cleans_managed_tabs_before_delete(tmp_path) -> None: assert result.conversation_id == "conv-123" assert teardown_calls == ["cleanup:conv-123", "delete:conv-123"] + fake_proc.stop.assert_called_once() diff --git a/server/tests/unit/test_prompt_contracts.py b/server/tests/unit/test_prompt_contracts.py index f8540d5..446c0a1 100644 --- a/server/tests/unit/test_prompt_contracts.py +++ b/server/tests/unit/test_prompt_contracts.py @@ -75,7 +75,12 @@ def test_highlight_prompt_keeps_icon_targets_on_any_pagination(self) -> None: assert "icon-only" in description assert "stay on the same `element_type` across pages" in description - assert "your default next step is the next page in the same mode" in description + # Canonical pagination rule lives in the Workflow section as step 4; + # the previous shorter phrasing was removed to eliminate redundancy. + assert ( + "call `highlight` with `page: current + 1` on the same " + "`element_type` before picking anything" in description + ) assert ( "If a likely target is already partly visible, clipped, or crowded by sticky UI, use `scroll` to improve geometry before paginating." in description @@ -86,6 +91,24 @@ def test_highlight_prompt_keeps_icon_targets_on_any_pagination(self) -> None: ) assert "`clickable`" not in description + def test_highlight_prompt_carries_pagination_example(self) -> None: + """A concrete positive example is the key lever from Anthropic's guide + (positive examples > negative instructions). The example must warn + against picking an approximate id (e.g. a like/vote button adjacent + to the real target) as a stand-in when later pages exist. The + example must stay generic — it names no specific benchmark task or + site, so the model cannot memorize a pattern.""" + description = get_highlight_tool_description() + + assert "Pagination example" in description + assert '{"page": 2}' in description + assert '"close enough"' in description + # Guardrail: do not leak benchmark-specific task names into the + # prompt. The 20260420 eval's bluebook_simple task was written into + # an earlier draft of this example and must not come back. + assert "Arigato" not in description + assert "bluebook" not in description.lower() + def test_highlight_prompt_treats_partly_visible_targets_as_geometry_problem( self, ) -> None: diff --git a/server/tests/unit/test_tool_prompt_profiles.py b/server/tests/unit/test_tool_prompt_profiles.py index 41871c4..6f9fd83 100644 --- a/server/tests/unit/test_tool_prompt_profiles.py +++ b/server/tests/unit/test_tool_prompt_profiles.py @@ -63,17 +63,35 @@ def test_small_model_highlight_prompt_stays_compact_and_actionable() -> None: "Treat that current observation as the working inventory for the current " "page state." in description ) + # Canonical pagination rule (Core Rule #4). Rewritten after the 20260420 + # eval, where flash picked an approximate match from page 1 instead of + # paginating to page 2 to find the real target. The bolded sentence is + # the load-bearing instruction; the follow-up sweep and no-approximate- + # match clauses close the loophole the model exploited previously. assert ( - "Call `highlight` when you need page 2+, a narrower `element_type`, " - "or a fresh inventory after a command that did not return an " - "interactive observation." in description + "**If the exact target id is not in the current page and " + "`current_page < total_pages`, call `highlight` with " + '`{"page": current_page + 1}` on the same `element_type` before ' + "picking any id.**" in description + ) + assert ( + "Do not pick an approximate match from page 1 when later pages have " + "not been checked." in description ) assert "scroll first to reposition it" in description assert '`element_type: "any"` is the default mixed inventory' in description + # Narrowing-by-type is now downstream of the full sweep (Selection + # Strategy), not intermixed with pagination guidance. assert ( - "collision-aware label placement may have split the target across pages" - in description + "Narrow to `inputable`, `scrollable`, `selectable`, `draggable`, " + "`droppable`, or `uploadable` only after sweeping all pages on the " + "current mode" in description ) + # Concrete positive example is the key lever from Anthropic's guide. + # Must stay generic so the model cannot memorize a benchmark task. + assert "Paginate before picking: example" in description + assert "Arigato" not in description + assert "bluebook" not in description.lower() assert "If highlight shows `swipable`, use `swipe`." in description assert ( "If a returned element is marked `draggable`, prefer `drag_and_drop` over `click`." diff --git a/skill/claude/ob-routines/SKILL.md b/skill/claude/ob-routines/SKILL.md index 589bd0e..b298104 100644 --- a/skill/claude/ob-routines/SKILL.md +++ b/skill/claude/ob-routines/SKILL.md @@ -80,7 +80,7 @@ error.** Do not finalize. Instead: ## Preconditions -**First time?** Complete the full setup in `skill/claude/open-browser/references/setup.md` +**First time?** Complete the full setup in `~/.claude/skills/open-browser/references/setup.md` before using this skill. That guide covers: loading the Chrome extension, connecting it to the server, and obtaining a valid `OPENBROWSER_CHROME_UUID`. Without that, recording and replay will fail immediately. @@ -92,7 +92,7 @@ For subsequent uses, confirm: Quick check: ```bash -python3 skill/claude/open-browser/scripts/check_status.py --chrome-uuid "$OPENBROWSER_CHROME_UUID" +python3 ~/.claude/skills/open-browser/scripts/check_status.py --chrome-uuid "$OPENBROWSER_CHROME_UUID" ``` Start the server if needed: @@ -100,16 +100,16 @@ Start the server if needed: cd /Users/yangxiao/git/OpenBrowser && uv run local-chrome-server serve ``` -Scripts path: `skill/claude/ob-routines/scripts/` (run from repo root). +Scripts path: `~/.claude/skills/ob-routines/scripts/`. --- ## List & search routines ```bash -python3 skill/claude/ob-routines/scripts/list_routines.py -python3 skill/claude/ob-routines/scripts/list_routines.py "login" -python3 skill/claude/ob-routines/scripts/list_routines.py --recordings +python3 ~/.claude/skills/ob-routines/scripts/list_routines.py +python3 ~/.claude/skills/ob-routines/scripts/list_routines.py "login" +python3 ~/.claude/skills/ob-routines/scripts/list_routines.py --recordings ``` --- @@ -133,7 +133,7 @@ defeats the pipeline and wastes the user's time. If the user's goal is vague ### Step 1 — start recording ```bash -python3 skill/claude/ob-routines/scripts/start_recording.py \ +python3 ~/.claude/skills/ob-routines/scripts/start_recording.py \ --chrome-uuid "$OPENBROWSER_CHROME_UUID" \ --name "xiaohongshu-messages" \ --intent "check messages on Xiaohongshu" @@ -146,7 +146,7 @@ Do NOT proceed until the user confirms. ### Step 2 — stop recording ```bash -python3 skill/claude/ob-routines/scripts/stop_recording.py +python3 ~/.claude/skills/ob-routines/scripts/stop_recording.py ``` --- @@ -160,7 +160,7 @@ and then be killed, losing the compiler session.** ### Launch in tmux ```bash tmux new-window -n "compile" \ - "cd /Users/yangxiao/git/OpenBrowser && python3 skill/claude/ob-routines/scripts/compile.py ; echo '[compile-done]'" + "python3 ~/.claude/skills/ob-routines/scripts/compile.py ; echo '[compile-done]'" ``` ### Monitor output @@ -206,11 +206,11 @@ goes directly to the routine name field, not the compiler. ## Replay a routine ```bash -python3 skill/claude/ob-routines/scripts/replay.py "routine-name" \ +python3 ~/.claude/skills/ob-routines/scripts/replay.py "routine-name" \ --chrome-uuid "$OPENBROWSER_CHROME_UUID" # List without replaying -python3 skill/claude/ob-routines/scripts/replay.py --list +python3 ~/.claude/skills/ob-routines/scripts/replay.py --list ``` Name matching: exact → ID → prefix → substring. diff --git a/uv.lock b/uv.lock index 2104838..f303f77 100644 --- a/uv.lock +++ b/uv.lock @@ -1597,7 +1597,7 @@ wheels = [ [[package]] name = "litellm" version = "1.83.0" -source = { git = "https://github.com/softpudding/litellm.git?rev=2eb7db59461e9117b1e3e0519616b39f1497c0f9#2eb7db59461e9117b1e3e0519616b39f1497c0f9" } +source = { git = "https://github.com/softpudding/litellm.git?rev=363075400d97a5252fd2eb60c4f8d44bb529057c#363075400d97a5252fd2eb60c4f8d44bb529057c" } dependencies = [ { name = "aiohttp" }, { name = "click" }, @@ -1675,11 +1675,11 @@ requires-dist = [ { name = "black", marker = "extra == 'dev'", specifier = ">=23.0.0" }, { name = "click", specifier = ">=8.1.0" }, { name = "fastapi", specifier = ">=0.104.0" }, - { name = "litellm", git = "https://github.com/softpudding/litellm.git?rev=2eb7db59461e9117b1e3e0519616b39f1497c0f9" }, + { name = "litellm", git = "https://github.com/softpudding/litellm.git?rev=363075400d97a5252fd2eb60c4f8d44bb529057c" }, { name = "mypy", marker = "extra == 'dev'", specifier = ">=1.7.0" }, { name = "numpy", specifier = ">=1.24.0" }, - { name = "openhands-sdk", git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-sdk&rev=c92a185a" }, - { name = "openhands-tools", git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-tools&rev=c92a185a" }, + { name = "openhands-sdk", git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-sdk&rev=df0056f1df4916abb54bc73a585a964911512e4b" }, + { name = "openhands-tools", git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-tools&rev=df0056f1df4916abb54bc73a585a964911512e4b" }, { name = "pillow", specifier = ">=10.0.0" }, { name = "pre-commit", marker = "extra == 'dev'", specifier = ">=4.0.0" }, { name = "pydantic", specifier = ">=2.5.0" }, @@ -2224,7 +2224,7 @@ wheels = [ [[package]] name = "openhands-sdk" version = "1.12.0" -source = { git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-sdk&rev=c92a185a#c92a185a00aa7ae58547d794835575742f1ed27e" } +source = { git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-sdk&rev=df0056f1df4916abb54bc73a585a964911512e4b#df0056f1df4916abb54bc73a585a964911512e4b" } dependencies = [ { name = "agent-client-protocol" }, { name = "deprecation" }, @@ -2244,7 +2244,7 @@ dependencies = [ [[package]] name = "openhands-tools" version = "1.12.0" -source = { git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-tools&rev=c92a185a#c92a185a00aa7ae58547d794835575742f1ed27e" } +source = { git = "https://github.com/softpudding/agent-sdk.git?subdirectory=openhands-tools&rev=df0056f1df4916abb54bc73a585a964911512e4b#df0056f1df4916abb54bc73a585a964911512e4b" } dependencies = [ { name = "bashlex" }, { name = "binaryornot" },