Name	Name	Last commit message	Last commit date
parent directory ..
autoresearch	autoresearch
computer_use_eval	computer_use_eval
config/benchmarks	config/benchmarks
docs	docs
examples	examples
scripts	scripts
tests	tests
.dockerignore	.dockerignore
.env.docker.example	.env.docker.example
.env.example	.env.example
.geminiignore	.geminiignore
.gitignore	.gitignore
Dockerfile	Dockerfile
README.md	README.md
docker-compose.yaml	docker-compose.yaml
pyproject.toml	pyproject.toml
uv.lock	uv.lock
uv.toml	uv.toml

Computer Use Evaluation Pipeline

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.**

A lightweight framework for benchmarking browser agents. It runs tasks, records the screen, and scores performance using code assertions and Gemini.

🎯 Purpose

This pipeline is designed for developers to evaluate and test the Gemini Computer Use model with zero friction.

It is not an agentic harness for production automation. Instead, it is a testing and drive-evaluation tool that allows you to:

Objective measurement of agent performance on specific UI tasks.
Rapidly iterate on system prompts and context strategies.
Verify stability and safety across multiple screen resolutions.
Conduct Root-Cause Analysis (RCA) via multimodal visual and log judging to understand why an agent failed.

The goal is to provide a standardized "exam" environment for browser agents, ensuring they are reliable for enterprise workflows (e.g., Enterprise Resource Planning (ERP)) before deployment.

💡 Key Concepts

System Prompting (Inheritance)

The pipeline provides a high-efficiency Default System Prompt that handles turn-batching and smart input. When defining your own prompt in YAML, use the {{DEFAULT}} placeholder to inherit these core rules:

agent:
  system_prompt: |
    {{DEFAULT}}
    
    # Task-specific instructions
    Stay within the ERP workflow...

Note: Omitting {{DEFAULT}} will completely replace the agent's operating rules.

Zero-Friction Auto-Injection (Self-Healing)

Visual agents often get stuck in repetitive loops when they can't visually locate elements on the screen. To solve this, the pipeline includes an advanced Auto-Injection Middleware that bridges the "Vision-DOM Gap" with zero developer overhead.

Instead of forcing the agent to use complex developer tools (which breaks its "standard user" persona), the middleware automatically intercepts failures and injects context directly into the prompt:

Semantic Location Hints: Deterministic viewport-relative coordinates calculated by the browser engine (e.g., [Location: Off-screen below (requires scrolling DOWN)]).
Supervisor Advice: A lightweight LLM (Flash Lite) analyzes the DOM and exact Playwright errors to provide a single sentence of behavioral coaching, helping the agent navigate complex UI paradigms.

Advanced Perception & Architecture

The framework is built around a modern, modular architecture:

core/, browser/, actions/ Structure: Clean separation of concerns replacing legacy monolithic classes (like the deprecated ActionSpace).
Gemini 3 Integration: Native support for the Gemini 3 family (e.g., Flash models) for both the main agent and reasoning judges.
Aria Hashing & Smart Batching: High-performance DOM state stagnation detection (Aria Hashing) and optimized action batching to reduce latency.
PerceptionService & CoordinateScaler: Advanced hitbox normalization and visual state processing to ensure the agent's spatial understanding aligns perfectly with the actual viewport.

Evaluation Judges

Every run is evaluated by:

Assertion Judge: Deterministic URL/DOM/Script checks.
Video Judge: Visual success analysis by Gemini (multimodal).
Log Judge: Reasoning and safety audit.

⚡ Quick Start

1. Install

Initialize the environment and install dependencies.

uv sync
uv run playwright install chromium

2. Configure

Create a .env file (see Setup Guide):

GCP_PROJECT_ID="your-project-id"
GCP_REGION="us-central1"

3. Run a Benchmark

Step 1: Create a Benchmark Generate a new benchmark directory from a template. This creates a folder in config/benchmarks/ with a starter benchmark.yaml.

uv run computer-eval create "My First Test"
# Output: ✅ Created standard benchmark structure at: config/benchmarks/my_first_test/

Step 2: Run it Execute the evaluation. The pipeline will open a browser (if not in headless mode), run the task, and generate a report in artifacts/.

uv run computer-eval --benchmark config/benchmarks/my_first_test/benchmark.yaml

Step 3: View Results Open the latest run summary to see success scores and judge reasoning.

cat artifacts/my-first-test/latest/result.json

⚖️ How it Works

The pipeline evaluates agents using three methods:

Method	Type	Description
Assertion	Code	Validates the final URL, DOM elements, and JavaScript state.
Video	Vision	Uses Gemini to watch the screen recording and verify the goal was reached.
Trace	Logic	Audits the agent's internal logs for reasoning and safety issues.

🛡️ Safety & Telemetry

Safety Modes

Control how the agent handles high-stakes actions:

auto_approve: Automatically allows actions (for CI/CD).
auto_deny: Automatically blocks actions (for safety testing).
interactive (Default): Pauses for human confirmation.

Metrics

We track:

Autonomy Score: Percentage of steps completed without human help.
Cost: Input/output token usage.
Success Rate: Did the agent complete the task?

📚 Documentation

🚀 Start Here: Hello World & Experiments: Zero-to-hero guide.
Setup Guide: Installation and Authentication.
🐳 Docker Setup & Usage: Running reproducible benchmarks in containers.
Usage Guide: CLI commands and Batch Evaluation.
Architecture: How the system works.
Creating Benchmarks: Defining new tasks.
Configuration: YAML schema reference.
Extending the Pipeline: Adding custom tools and hooks.
Prompting: Best practices for agent instructions.
Performance: Latency and optimization.

🤝 Contributing

See our documentation for guidelines on adding new features or benchmarks.

📄 License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

https://www.apache.org/licenses/LICENSE-2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Computer Use Evaluation Pipeline

🎯 Purpose

💡 Key Concepts

System Prompting (Inheritance)

Zero-Friction Auto-Injection (Self-Healing)

Advanced Perception & Architecture

Evaluation Judges

⚡ Quick Start

1. Install

2. Configure

3. Run a Benchmark

⚖️ How it Works

🛡️ Safety & Telemetry

Safety Modes

Metrics

📚 Documentation

🤝 Contributing

📄 License

FilesExpand file tree

gemini-computer-use-eval

Directory actions

More options

Directory actions

More options

Latest commit

History

gemini-computer-use-eval

Folders and files

parent directory

README.md

Computer Use Evaluation Pipeline

🎯 Purpose

💡 Key Concepts

System Prompting (Inheritance)

Zero-Friction Auto-Injection (Self-Healing)

Advanced Perception & Architecture

Evaluation Judges

⚡ Quick Start

1. Install

2. Configure

3. Run a Benchmark

⚖️ How it Works

🛡️ Safety & Telemetry

Safety Modes

Metrics

📚 Documentation

🤝 Contributing

📄 License