**Copyright 2026 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.**
A lightweight framework for benchmarking browser agents. It runs tasks, records the screen, and scores performance using code assertions and Gemini.
This pipeline is designed for developers to evaluate and test the Gemini Computer Use model with zero friction.
It is not an agentic harness for production automation. Instead, it is a testing and drive-evaluation tool that allows you to:
- Objective measurement of agent performance on specific UI tasks.
- Rapidly iterate on system prompts and context strategies.
- Verify stability and safety across multiple screen resolutions.
- Conduct Root-Cause Analysis (RCA) via multimodal visual and log judging to understand why an agent failed.
The goal is to provide a standardized "exam" environment for browser agents, ensuring they are reliable for enterprise workflows (e.g., Enterprise Resource Planning (ERP)) before deployment.
The pipeline provides a high-efficiency Default System Prompt that handles turn-batching and smart input. When defining your own prompt in YAML, use the {{DEFAULT}} placeholder to inherit these core rules:
agent:
system_prompt: |
{{DEFAULT}}
# Task-specific instructions
Stay within the ERP workflow...Note: Omitting {{DEFAULT}} will completely replace the agent's operating rules.
Visual agents often get stuck in repetitive loops when they can't visually locate elements on the screen. To solve this, the pipeline includes an advanced Auto-Injection Middleware that bridges the "Vision-DOM Gap" with zero developer overhead.
Instead of forcing the agent to use complex developer tools (which breaks its "standard user" persona), the middleware automatically intercepts failures and injects context directly into the prompt:
- Semantic Location Hints: Deterministic viewport-relative coordinates calculated by the browser engine (e.g.,
[Location: Off-screen below (requires scrolling DOWN)]). - Supervisor Advice: A lightweight LLM (Flash Lite) analyzes the DOM and exact Playwright errors to provide a single sentence of behavioral coaching, helping the agent navigate complex UI paradigms.
The framework is built around a modern, modular architecture:
core/,browser/,actions/Structure: Clean separation of concerns replacing legacy monolithic classes (like the deprecatedActionSpace).- Gemini 3 Integration: Native support for the Gemini 3 family (e.g., Flash models) for both the main agent and reasoning judges.
- Aria Hashing & Smart Batching: High-performance DOM state stagnation detection (Aria Hashing) and optimized action batching to reduce latency.
- PerceptionService & CoordinateScaler: Advanced hitbox normalization and visual state processing to ensure the agent's spatial understanding aligns perfectly with the actual viewport.
Every run is evaluated by:
- Assertion Judge: Deterministic URL/DOM/Script checks.
- Video Judge: Visual success analysis by Gemini (multimodal).
- Log Judge: Reasoning and safety audit.
Initialize the environment and install dependencies.
uv sync
uv run playwright install chromiumCreate a .env file (see Setup Guide):
GCP_PROJECT_ID="your-project-id"
GCP_REGION="us-central1"Step 1: Create a Benchmark
Generate a new benchmark directory from a template. This creates a folder in config/benchmarks/ with a starter benchmark.yaml.
uv run computer-eval create "My First Test"
# Output: ✅ Created standard benchmark structure at: config/benchmarks/my_first_test/Step 2: Run it
Execute the evaluation. The pipeline will open a browser (if not in headless mode), run the task, and generate a report in artifacts/.
uv run computer-eval --benchmark config/benchmarks/my_first_test/benchmark.yamlStep 3: View Results Open the latest run summary to see success scores and judge reasoning.
cat artifacts/my-first-test/latest/result.jsonThe pipeline evaluates agents using three methods:
| Method | Type | Description |
|---|---|---|
| Assertion | Code | Validates the final URL, DOM elements, and JavaScript state. |
| Video | Vision | Uses Gemini to watch the screen recording and verify the goal was reached. |
| Trace | Logic | Audits the agent's internal logs for reasoning and safety issues. |
Control how the agent handles high-stakes actions:
auto_approve: Automatically allows actions (for CI/CD).auto_deny: Automatically blocks actions (for safety testing).interactive(Default): Pauses for human confirmation.
We track:
- Autonomy Score: Percentage of steps completed without human help.
- Cost: Input/output token usage.
- Success Rate: Did the agent complete the task?
- 🚀 Start Here: Hello World & Experiments: Zero-to-hero guide.
- Setup Guide: Installation and Authentication.
- 🐳 Docker Setup & Usage: Running reproducible benchmarks in containers.
- Usage Guide: CLI commands and Batch Evaluation.
- Architecture: How the system works.
- Creating Benchmarks: Defining new tasks.
- Configuration: YAML schema reference.
- Extending the Pipeline: Adding custom tools and hooks.
- Prompting: Best practices for agent instructions.
- Performance: Latency and optimization.
See our documentation for guidelines on adding new features or benchmarks.
Copyright 2026 Google LLC
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.