Computer-use agent guided by the quantum potential of LLM reasoning.
Pilot observes the screen through screenshots and accessibility trees, reasons about what to do next, and executes actions at the OS level — clicks, keystrokes, scrolls, drags — like a human sitting at the keyboard. No browser automation APIs, no CDP, no Playwright. Everything happens through the same interfaces a person would use.
Named after the pilot wave in de Broglie-Bohm quantum mechanics: the LLM is the wave function that guides the particle (cursor) through configuration space. See docs/NAME.md.
+------------------+
| Task Interface |
| CLI / API / WS |
+--------+---------+
|
+--------v---------+
| Orchestration |
| Temporal Workflows |
| bBoN / Playbooks |
+--------+---------+
|
+----------------------+----------------------+
| |
+---------v----------+ +----------v---------+
| Manager | | Shell Agent |
| Task Decomposition | | CLI/Script Tasks |
| Web Knowledge | +--------------------+
+--------+-----------+
|
+--------v-----------+
| Worker | Per-step flat loop:
| Observe → Reason | 1. Screenshot + a11y + OCR
| → Act → Verify | 2. LLM generates action
+--------+--+---------+ 3. Ground → Execute → Learn
| |
+--------v--v--------+
| Grounding Cascade | 8 layers: atlas → spatial → structural
| a11y → OCR → MoG | → a11y → anticipation → vision
| → replan → keyboard | → replan → keyboard fallback
+--------+--+---------+
| |
+--------v--v--------+ +-------------------+
| Memory | | Verification |
| Narrative (ChromaDB) | | Programmatic |
| Episodic (SQLite) | | Screenshot Diff |
| Spatial Cache | | Adversarial Agent |
| Element Atlas | +-------------------+
| Replay Cache |
| Working Memory | +-------------------+
+---------------------+ | Bridges |
| LibreOffice UNO |
+---------------------+ | GIMP Python-Fu |
| Humanization | +-------------------+
| Bezier Mouse |
| Natural Typing | +-------------------+
| Scroll Physics | | Dashboard |
| Idle Wander | | React + WebSocket |
| Imperfection Engine | | 7 views |
+---------------------+ +-------------------+
|
+--------v-----------+
| OS Execution |
| pyautogui / pynput |
| macOS: pyobjc a11y |
| Linux: AT-SPI + X11 |
+---------------------+
Each task runs as a flat loop — no hierarchical planning, no subtask decomposition. The agent observes, reasons, acts, and verifies on every step until the task is done or the budget is exhausted.
For each step (0 to max_steps):
1. OBSERVE
├── Take screenshot (mss) + resize for grounding model
├── Build accessibility tree (AT-SPI on Linux, AXUIElement on macOS)
├── Run OCR (Tesseract) and merge with a11y elements
├── Annotate screenshot with numbered element markers (SoM)
├── Extract current URL, focused element, window state
├── Detect and dismiss popup dialogs (3-layer cascade)
└── Track widget state changes (checkboxes, filters, toggles)
2. REASON
├── Build context: system prompt + a11y text + screenshot + history
├── [Optional] Generate N candidate actions in parallel (elevated temp)
├── [Optional] LLM judge picks best candidate if divergent
├── Call reasoning LLM (Claude) with full conversational history
└── Parse structured response: VERIFY → OBSERVATION → THOUGHT → ACTION
3. GROUND
├── Resolve action target to pixel coordinates
├── 8-layer cascade (cheapest first, most expensive last):
│ Layer -1: Element Atlas (cross-task sighting history)
│ Layer 0: Spatial Cache (page-scoped position memory)
│ Layer 1: Structural Map (DOM-like layout relationships)
│ Layer 2.5: A11y Tree (OS accessibility element bounds)
│ Layer 3: Anticipation (crop around expected region)
│ Layer 4: MoG Vision (UI-TARS 72B via OpenRouter)
│ Layer 5: Adaptive Replan (vision + context re-planning)
│ Layer 6: Keyboard Fallback (hardcoded shortcuts)
└── Return: (x, y, confidence, layer, cost, latency)
4. EXECUTE
├── Apply humanization (Bezier mouse curves, typing cadence, hesitation)
├── Execute via pyautogui: click, type, scroll, drag, press, etc.
├── Safety: modifier key release (finally-based), emergency stop (F12)
└── Take post-action screenshot
5. VERIFY + LEARN
├── Compare before/after screenshots (deferred verdict for next step)
├── Update spatial cache (hit/miss feedback → confidence adjustment)
├── Record to episodic memory (action, result, element, position)
├── Check loop detection (repeated failures → recovery prompt)
└── Update working memory (facts, blockers, completed items)
6. DONE GATE (when agent calls done())
├── Stage 1: Visual check — screenshot comparison (initial vs final)
├── Stage 2: Programmatic verification — LLM writes Python script,
│ runs on target system, reads actual file/app state
├── Stage 3: Store episode, update replay cache, record learnings
└── Return: task summary + success/failure
The agent can emit these actions on each step:
| Action | Description |
|---|---|
click(element=N) |
Click numbered element from a11y/OCR list |
click(x=, y=) |
Click absolute coordinates |
click(description="...") |
Click by visual description (grounded by cascade) |
type(text, enter=True) |
Type text, optionally press Enter |
type(text, overwrite=True) |
Select-all then type (replace field content) |
type(text, paste=True) |
Paste via clipboard (fast for long text) |
scroll(direction, amount) |
Scroll up/down/left/right |
press(key) |
Press keyboard key or combo (e.g., ctrl+c) |
press(key, repeat=N) |
Repeat key N times |
hold_key(keys, duration) |
Long-press with safety release |
drag_to(to_x, to_y) |
Drag from current position |
triple_click(element=N) |
Select entire line/paragraph |
middle_click(element=N) |
Open link in new tab |
call_code_agent(task) |
Delegate to iterative code agent (file manipulation) |
bridge(method, **params) |
Call domain-specific bridge (UNO, GIMP) |
shell(command) |
Run shell command on target system |
todo_write(items) |
Write/update task plan (mutable checklist) |
done() |
Declare task complete (triggers verification) |
infeasible(reason) |
Declare task impossible |
pilot/
agent/ # Cognitive core
worker.py # Action generation (LLM call + parse)
manager.py # Task decomposition (optional, not in default loop)
reflection.py # Step-level insight generation
code_agent.py # Iterative code execution agent (15-step budget)
cca_runner.py # CCA orchestration (execute + verify + retry)
shell_agent.py # Shell command execution agent
web_knowledge.py # Web search for how-to instructions
narrator.py # Before/after screenshot narration
observe.py # Shared observation pipeline
context.py # Context/prompt builder
actions.py # Action parser (LLM text → structured action)
action_registry.py # Action type registry (auto-generates prompt docs)
prompt_builder.py # Composable system prompt from sections
trajectory_replay.py # Replay past successful trajectories
widget_state.py # Track UI widget state changes
handlers/
done_gate.py # Task completion verification (3-stage)
cca_handler.py # Code agent + bridge dispatch
gui_executor.py # GUI action execution
routing.py # Action type → handler routing
grounding/ # Description → pixel coordinates
cascade.py # 8-layer grounding cascade
mog.py # Mixture-of-Grounders (vision model router)
som.py # Set-of-Mark annotation (numbered element overlay)
ocr.py # Tesseract OCR with element merging
refinement.py # Click position refinement
scaling.py # Resolution scaling between screen and model
structural.py # Structural map (DOM-like layout)
a11y/
base.py # A11y provider protocol
macos.py # macOS AXUIElement provider
linux.py # Linux AT-SPI provider
atspi.py # AT-SPI XML parsing + active frame scoping
budget.py # A11y tree size budgeting
memory/ # Multi-layer memory system
narrative.py # ChromaDB vector store (task-level summaries)
episodic.py # SQLite episode store (step-level trajectories)
spatial.py # SQLite spatial cache (element position memory)
element_atlas.py # Cross-task element sighting history
replay_cache.py # Task-level replay blueprints
failure_patterns.py # Chronic failure pattern store
working.py # Per-task working memory (facts, blockers)
vector_store.py # ChromaDB wrapper
workflow_store.py # Workflow execution history
episode_builder.py # Episode step construction
execution/ # OS-level action execution
actions.py # Full action executor (click, type, scroll, drag, etc.)
macos.py # macOS executor (pyautogui + pyobjc)
linux.py # Linux executor (pyautogui + X11/wmctrl)
screenshot.py # Screenshot capture (mss)
focus.py # Window focus management
system_state.py # System state reader (running apps, windows, z-order)
base.py # Executor protocol
humanization/ # Make actions look human
mouse.py # Bezier curve mouse movement
typing.py # Natural typing cadence with variance
scroll.py # Scroll physics (momentum, overshoot)
idle.py # Idle wander (micro-movements between actions)
browsing.py # Pre-task browsing simulation
imperfection.py # Persona-based mistake injection
verification/ # Task completion verification
programmatic.py # LLM-generated Python verification scripts
screenshot.py # Before/after screenshot comparison
agent.py # Adversarial verification agent
router.py # Verification channel router
bridges/ # Domain-specific command bridges
registry.py # Bridge registration and dispatch
base.py # BridgeMethod/BridgeParam definitions
libreoffice_uno.py # LibreOffice Impress UNO (22 functions)
libreoffice_uno_writer_calc.py # Writer (12) + Calc (15) + Cross-app (3)
libreoffice.py # File-level LibreOffice operations
gimp.py # GIMP Script-Fu/Python-Fu bridges
safety/ # Safety and reliability
bbon.py # Best-of-N parallel rollouts
emergency.py # Emergency stop (F12) + modifier key release
loop_detect.py # Stuck detection (repeated failures)
popup.py # Popup dialog detection and dismissal
live_judge.py # Per-step live verification judge
verification.py # Safety verification utilities
reasoning/ # Action selection
candidates.py # Step-level N-candidate generation
judge.py # LLM judge for candidate selection
llm/ # LLM integration
engine.py # LiteLLM wrapper with provider routing
providers.py # Provider configs (Anthropic, OpenRouter, etc.)
tokens.py # Token counting and cost estimation
throttle.py # Rate limiting
canary.py # API key validation at startup
anthropic_oauth.py # OAuth token management
platform/remote/ # Remote execution (VMs, benchmarks)
osworld_adapter.py # OSWorld benchmark adapter
macosworld_adapter.py # macOSWorld benchmark adapter
executor.py # VNC-based remote executor
pyautogui_executor.py # Remote pyautogui via HTTP
ssh_executor.py # SSH command executor
a11y_client.py # Remote a11y tree client
a11y_server.py # Remote a11y tree server
workflows/ # Temporal workflow definitions
task_workflow.py # Single-task workflow
bbon_workflow.py # Best-of-N parallel workflow
playbook_workflow.py # Playbook execution workflow
activities.py # Temporal activity implementations
runtime.py # Shared workflow runtime
tracing/ # Observability
spans.py # Span-based tracing (SQLite)
analysis.py # Trace analysis queries
core/ # Shared types
types.py # TaskState, HandlerResult, HandlerContext
pipeline.py # Pipeline utilities
task_state_builder.py # TaskState construction
config/ # Configuration
settings.py # Pydantic Settings (all config classes)
dotenv.py # .env file loading
events/ # Event system
bus.py # Pub/sub event bus
api/ # API server
server.py # FastAPI app factory
routes.py # REST endpoints
ws.py # WebSocket handler
models.py # API request/response models
history.py # Task history store
state.py # Runtime state exposure
cli.py # CLI entry point + main agent loop
runtime.py # OperatorRuntime (component initialization)
pilotpb/ # Playbook models and storage
models.py # Playbook, PlaybookStep, StepAction
store.py # SQLite playbook store
extractor.py # Extract playbooks from episodes
coach.py # Playbook coaching (step adaptation)
planner.py # Playbook planning
prompts.py # Playbook-related prompts
dashboard/ # React dashboard
src/
App.tsx # Main app with sidebar navigation
views/
TraceView.tsx # Live screenshot + SoM + action history
TaskView.tsx # Task status and step counter
CascadeView.tsx # Grounding layer performance charts
MemoryView.tsx # Episodic/spatial memory browser
ChatView.tsx # Operator chat interface
PlaybooksView.tsx # Playbook management
SettingsView.tsx # Runtime config editor
hooks/
useWebSocket.ts # WebSocket connection hook
components/
Sidebar.tsx # Navigation sidebar
tests/
unit/ # 550+ unit tests across 20+ files
integration/ # Integration tests (LLM contracts, flow)
benchmark/ # Benchmark harness
- Python 3.11+
- Tesseract OCR (
brew install tesseract/apt install tesseract-ocr) - Node.js 18+ (for dashboard)
- macOS: Accessibility permissions for the terminal app
- Linux: AT-SPI2 (
apt install at-spi2-core)
# Clone
git clone https://github.com/Ruya-AI/Pilot.git
cd Pilot
# Create virtualenv
python3.11 -m venv .venv
source .venv/bin/activate
# Install with all extras
pip install -e ".[dev,macos,temporal]" # macOS
pip install -e ".[dev,temporal]" # Linux
# Dashboard
cd dashboard && npm install && cd ..Create a .env file in the project root:
# Required: reasoning model
ANTHROPIC_API_KEY=sk-ant-...
# Required: grounding model (UI-TARS via OpenRouter)
OPENROUTER_API_KEY=sk-or-...
# Optional: alternative providers
AZURE_API_KEY=...
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...Default config lives in config/operator.yaml. Override any setting via environment variables:
# Use a different reasoning model
OPERATOR_LLM__REASONING_MODEL=claude-opus-4-6
# Use a different grounding model
OPERATOR_LLM__GROUNDING_MODEL=bytedance/ui-tars-2-72b
# Disable humanization (faster, for benchmarks)
OPERATOR_EXECUTION__HUMANIZATION_ENABLED=false
# Enable step-level candidate selection
OPERATOR_AGENT__ENABLE_STEP_CANDIDATES=true
OPERATOR_AGENT__STEP_CANDIDATE_N=3
# Increase step budget
OPERATOR_AGENT__MAX_STEPS=30# Run a task
pilot --task "Open Safari and go to github.com"
# With specific model
pilot --task "Post a tweet saying hello" --provider anthropic --model claude-opus-4-6
# Benchmark mode (no humanization, faster)
pilot --task "Change Chrome homepage" --benchmark_mode
# Verbose logging
pilot --task "Open System Preferences" -v# Terminal 1: Start Temporal server
temporal server start-dev
# Terminal 2: Start worker
python temporal-worker.py
# Terminal 3: Submit task
pilot --task "Book a flight to Tokyo" --use-temporal
# Best-of-N (3 parallel rollouts, pick best)
pilot --task "Configure dark mode" --use-temporal --bbon 3# Start server
uvicorn pilot.api.server:app --port 8420
# Submit task via API
curl -X POST http://localhost:8420/api/tasks \
-H "Content-Type: application/json" \
-d '{"instruction": "Open Calculator and compute 42 * 17"}'
# Check status
curl http://localhost:8420/api/tasks/{task_id}cd dashboard
npm run dev
# Open http://localhost:5173The dashboard connects via WebSocket to the API server and shows:
- Trace: Live screenshot feed with SoM annotations and action markers
- Task: Current task status, step counter, action history
- Cascade: Grounding layer hit rates, latency, and cost charts
- Memory: Episodic episodes, spatial cache entries, failure patterns
- Chat: Operator-to-agent communication channel
- Playbooks: Recorded playbooks and execution history
- Settings: Runtime configuration editor
Pilot maintains six memory layers that persist across tasks:
| Layer | Storage | Purpose | Lifetime |
|---|---|---|---|
| Working | In-memory | Current task facts, blockers, completed items | Per-task |
| Episodic | SQLite | Step-level trajectories (action, element, position, result) | Permanent |
| Spatial | SQLite | Element positions per page (confidence-weighted, decays on miss) | Permanent |
| Element Atlas | SQLite | Cross-task element sighting counts (high-sighting = fast bypass) | Permanent |
| Narrative | ChromaDB | Task-level summaries (vector search for similar past tasks) | Permanent |
| Replay Cache | SQLite | Full task blueprints (replay successful trajectories) | Permanent |
Episodic memory feeds blueprints to the agent: "Here's how a similar task was solved before." The spatial cache provides instant grounding: if the agent clicked "Save" at (1450, 32) on this page before and it worked, skip the expensive vision call.
The cascade resolves a text description (e.g., "the Save button") to pixel coordinates. It tries cheap methods first and escalates to expensive vision models only when needed.
| Layer | Method | Latency | Cost | How It Works |
|---|---|---|---|---|
| -1 | Element Atlas | <1ms | Free | Cross-task sighting history — elements seen 5+ times bypass everything |
| 0 | Spatial Cache | ~1ms | Free | Page-scoped position memory with confidence decay |
| 1 | Structural Map | ~10ms | Free | DOM-like layout relationships between elements |
| 2.5 | A11y Tree | ~10ms | Free | OS accessibility API element bounds (AT-SPI / AXUIElement) |
| 3 | Anticipation | ~5ms | Free | Crop around expected region based on action context |
| 4 | MoG Vision | 2-3s | ~$0.01 | UI-TARS 72B identifies element in screenshot |
| 5 | Adaptive Replan | ~3s | ~$0.03 | Vision model + context → alternative approach |
| 6 | Keyboard Fallback | <0.5s | Free | Hardcoded shortcuts for common actions (Ctrl+S, etc.) |
On a warm cache (returning user, familiar app), 70-80% of groundings resolve at Layer 0-2.5 for free. Vision calls (Layer 4+) happen only for novel elements.
Pilot verifies task completion through three independent channels:
-
Screenshot Comparison — Compare the initial screenshot (step 0) against the final screenshot. An LLM judges whether the visual state reflects task completion. Fast, catches obvious successes and failures.
-
Programmatic Verification — The LLM writes a Python script that reads actual state from disk (files, app configs, system settings). The script runs on the target system and returns structured pass/fail results. This catches cases where the screen looks right but the underlying data is wrong.
-
Adversarial Agent — A separate LLM instance with read-only tools tries to disprove that the task was completed. It has access to screenshots, file system, and shell commands. If it can't find evidence of failure, the task passes.
The done gate orchestrates these channels: visual check first (cheap), then programmatic verification (thorough), with up to 3 rejection cycles before accepting the agent's claim.
Bridges provide direct API access to applications, bypassing the GUI entirely. They're faster and more reliable than clicking through menus for data manipulation tasks.
Bridges communicate with LibreOffice via the UNO API (socket port 2002), operating on the live document — no close/reopen cycle, no file format conversion issues.
Impress (22 functions): get_slide_info, find_shape_by_text, write_text, set_style, set_background_color, insert_image, duplicate_slide, export_to_image, etc.
Writer (12 functions): write_text, find_and_replace, set_font, set_color, set_line_spacing, set_paragraph_alignment, insert_page_break, set_subscript, etc.
Calc (15 functions): get_workbook_info, get_active_sheet_data, set_column_values, sort_column, merge_cells, format_range, freeze_panes, rename_sheet, insert_chart, etc.
Cross-app (3 functions): get_open_documents, save_all, close_document.
Bridges for image manipulation: brightness/contrast, color balance, crop, resize, export, layer operations.
The agent calls bridges via the action format:
ACTION: agent.bridge("calc_sort_column", column_name="Revenue", ascending=False)
The prompt builder automatically lists available bridges based on the running application.
When enabled, the agent samples N candidate actions per step at elevated temperature, deduplicates, and uses an LLM judge to select the best one before grounding. This improves decision quality on ambiguous steps.
Same context → N parallel LLM calls (asyncio.gather)
→ Parse each → Deduplicate → All candidates agree? → Skip judge (free)
→ Candidates diverge? → 1 judge call → Pick best → Ground winner → Execute
Selective mode (default): triggers on ~30% of steps where alternatives matter most:
- First 2 steps (wrong start = wrong trajectory)
- After a RECOVER verdict
- After vision grounding (ambiguous element)
- After loop detection (stuck = need different approach)
Cost: +13% per task in selective mode. Unanimous rate ~60-70% (judge skipped when all candidates agree).
OPERATOR_AGENT__ENABLE_STEP_CANDIDATES=true
OPERATOR_AGENT__STEP_CANDIDATE_N=3
OPERATOR_AGENT__STEP_CANDIDATE_TEMP=0.6Task-level parallel rollouts via Temporal workflows. Run N independent attempts at the same task, verify each, pick the best result.
pilot --task "Configure proxy settings" --use-temporal --bbon 3Each rollout gets its own worker, its own memory context, and runs the full agent loop independently. The orchestrator compares verification scores and returns the highest-quality completion.
# All tests
pytest tests/ -v
# Unit tests only
pytest tests/unit/ -v
# Specific module
pytest tests/unit/test_cascade.py -v
# With coverage
pytest tests/ --cov=pilot --cov-report=htmlTests are organized by module: test_a11y.py, test_actions.py, test_cascade.py, test_done_gate.py, test_humanization.py, test_memory.py, test_verification.py, etc.
| Role | Default Model | Provider | Purpose |
|---|---|---|---|
| Reasoning | Claude Sonnet 4.5 | Anthropic | Action generation, verification, planning |
| Grounding | UI-TARS-2 72B | OpenRouter | Element identification in screenshots |
| Embeddings | sentence-transformers | Local | Memory retrieval (narrative, episodic) |
| OCR | Tesseract | Local | Text detection in screenshots |
Override models via environment:
OPERATOR_LLM__REASONING_MODEL=claude-opus-4-6
OPERATOR_LLM__GROUNDING_MODEL=bytedance/ui-tars-2-72bSee docs/IDEOLOGY.md for the full philosophy. Key points:
- Human-like interaction — Click first, hotkeys second. No hardcoded sequences. Visual-first reasoning.
- Intelligence over rigidity — Teach principles, not procedures. Trust the model. Fewer instructions > more instructions.
- Fail fast, recover smart — 2 failures = try something different. Reflection kills dead strategies.
- Generalizable fixes only — A fix for one app that breaks another is not a fix. Universal over specific.
- OS-level everything — No CDP, no Playwright, no browser APIs. Same interfaces a human uses.