This guide is your Golden Path. It will take you from "Zero" to "running complex agent experiments" in 5 minutes using the built-in scaffolding tools.
Instead of writing YAML from scratch, use the create command to generate a benchmark template.
uv run computer-eval create "Hello World" --template basicThis generates config/benchmarks/hello_world.yaml.
uv run computer-eval --benchmark config/benchmarks/hello_world.yamlWhat just happened?
- The agent opened a browser.
- It saw the Google homepage.
- It typed "Gemini API" and pressed Enter.
- The system verified the final URL matched
google.com/search.
For real-world tasks, you don't want giant YAML files. You want to keep your prompts in Markdown and your assertions in JavaScript.
uv run computer-eval create "My Complex Task"This creates a Directory Structure:
config/benchmarks/my_complex_task/
├── benchmark.yaml # Configuration only
├── prompts/
│ └── system.md # Your System Prompt (Markdown)
└── assertions/
└── success.js # Your Success Logic (JavaScript)
Open config/benchmarks/my_complex_task/prompts/system.md. You now have full syntax highlighting for your agent instructions!
uv run computer-eval --benchmark config/benchmarks/my_complex_task/benchmark.yamlNow that you have a running baseline, let's experiment with the pipeline's capabilities.
How does a "Cautious" agent differ from a "Fast" one?
Modify prompts/system.md:
You are a CAUTIOUS tester.
1. Before clicking anything, hover over it first.
2. Double-check your spelling before searching.Run it: Watch the video. Does the agent take more steps? Does it hover?
Long tasks can fill the context window. Let's see how the pipeline manages memory.
Default Behavior: preset: "BALANCED". The agent only sees the first and the last few screenshots to save tokens.
Enable "Accurate" mode (See everything):
Modify your benchmark.yaml:
agent:
context:
preset: "ACCURATE"Run it:
uv run computer-eval --benchmark config/benchmarks/my_complex_task/benchmark.yamlResult: The "Prediction" time will likely increase as the task goes on, because the model is processing every single frame of history.
Test how the agent handles restricted actions.
Modify the Task in benchmark.yaml:
task:
goal: "Go to a news site and click on a sensitive ad."Run in auto_deny mode:
uv run computer-eval --benchmark config/benchmarks/my_complex_task/benchmark.yaml --safety-mode auto_denyResult: The pipeline will log a Safety Violation and block the click if the model identifies it as high-stakes content.
- Assertions Guide: Learn how to write complex JavaScript checks.
- Configuration: See every possible YAML option.