Skip to content

Ruya-AI/Pilot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pilot

Computer-use agent guided by the quantum potential of LLM reasoning.

Pilot observes the screen through screenshots and accessibility trees, reasons about what to do next, and executes actions at the OS level — clicks, keystrokes, scrolls, drags — like a human sitting at the keyboard. No browser automation APIs, no CDP, no Playwright. Everything happens through the same interfaces a person would use.

Named after the pilot wave in de Broglie-Bohm quantum mechanics: the LLM is the wave function that guides the particle (cursor) through configuration space. See docs/NAME.md.


Architecture

                            +------------------+
                            |  Task Interface  |
                            |  CLI / API / WS  |
                            +--------+---------+
                                     |
                            +--------v---------+
                            |   Orchestration   |
                            | Temporal Workflows |
                            | bBoN / Playbooks  |
                            +--------+---------+
                                     |
              +----------------------+----------------------+
              |                                             |
    +---------v----------+                       +----------v---------+
    |      Manager       |                       |    Shell Agent     |
    | Task Decomposition |                       | CLI/Script Tasks   |
    | Web Knowledge      |                       +--------------------+
    +--------+-----------+
             |
    +--------v-----------+
    |       Worker        |     Per-step flat loop:
    |  Observe → Reason   |     1. Screenshot + a11y + OCR
    |  → Act → Verify     |     2. LLM generates action
    +--------+--+---------+     3. Ground → Execute → Learn
             |  |
    +--------v--v--------+
    |  Grounding Cascade  |     8 layers: atlas → spatial → structural
    |  a11y → OCR → MoG   |     → a11y → anticipation → vision
    |  → replan → keyboard |     → replan → keyboard fallback
    +--------+--+---------+
             |  |
    +--------v--v--------+      +-------------------+
    |      Memory         |      |   Verification    |
    | Narrative (ChromaDB) |      | Programmatic      |
    | Episodic (SQLite)   |      | Screenshot Diff   |
    | Spatial Cache       |      | Adversarial Agent |
    | Element Atlas       |      +-------------------+
    | Replay Cache        |
    | Working Memory      |      +-------------------+
    +---------------------+      |    Bridges        |
                                 | LibreOffice UNO   |
    +---------------------+      | GIMP Python-Fu    |
    |   Humanization      |      +-------------------+
    | Bezier Mouse        |
    | Natural Typing      |      +-------------------+
    | Scroll Physics      |      |    Dashboard      |
    | Idle Wander         |      | React + WebSocket |
    | Imperfection Engine |      | 7 views           |
    +---------------------+      +-------------------+
             |
    +--------v-----------+
    |   OS Execution      |
    | pyautogui / pynput  |
    | macOS: pyobjc a11y  |
    | Linux: AT-SPI + X11 |
    +---------------------+

How the Agent Loop Works

Each task runs as a flat loop — no hierarchical planning, no subtask decomposition. The agent observes, reasons, acts, and verifies on every step until the task is done or the budget is exhausted.

Step-by-Step Cycle

For each step (0 to max_steps):

  1. OBSERVE
     ├── Take screenshot (mss) + resize for grounding model
     ├── Build accessibility tree (AT-SPI on Linux, AXUIElement on macOS)
     ├── Run OCR (Tesseract) and merge with a11y elements
     ├── Annotate screenshot with numbered element markers (SoM)
     ├── Extract current URL, focused element, window state
     ├── Detect and dismiss popup dialogs (3-layer cascade)
     └── Track widget state changes (checkboxes, filters, toggles)

  2. REASON
     ├── Build context: system prompt + a11y text + screenshot + history
     ├── [Optional] Generate N candidate actions in parallel (elevated temp)
     ├── [Optional] LLM judge picks best candidate if divergent
     ├── Call reasoning LLM (Claude) with full conversational history
     └── Parse structured response: VERIFY → OBSERVATION → THOUGHT → ACTION

  3. GROUND
     ├── Resolve action target to pixel coordinates
     ├── 8-layer cascade (cheapest first, most expensive last):
     │   Layer -1: Element Atlas (cross-task sighting history)
     │   Layer  0: Spatial Cache (page-scoped position memory)
     │   Layer  1: Structural Map (DOM-like layout relationships)
     │   Layer 2.5: A11y Tree (OS accessibility element bounds)
     │   Layer  3: Anticipation (crop around expected region)
     │   Layer  4: MoG Vision (UI-TARS 72B via OpenRouter)
     │   Layer  5: Adaptive Replan (vision + context re-planning)
     │   Layer  6: Keyboard Fallback (hardcoded shortcuts)
     └── Return: (x, y, confidence, layer, cost, latency)

  4. EXECUTE
     ├── Apply humanization (Bezier mouse curves, typing cadence, hesitation)
     ├── Execute via pyautogui: click, type, scroll, drag, press, etc.
     ├── Safety: modifier key release (finally-based), emergency stop (F12)
     └── Take post-action screenshot

  5. VERIFY + LEARN
     ├── Compare before/after screenshots (deferred verdict for next step)
     ├── Update spatial cache (hit/miss feedback → confidence adjustment)
     ├── Record to episodic memory (action, result, element, position)
     ├── Check loop detection (repeated failures → recovery prompt)
     └── Update working memory (facts, blockers, completed items)

  6. DONE GATE (when agent calls done())
     ├── Stage 1: Visual check — screenshot comparison (initial vs final)
     ├── Stage 2: Programmatic verification — LLM writes Python script,
     │            runs on target system, reads actual file/app state
     ├── Stage 3: Store episode, update replay cache, record learnings
     └── Return: task summary + success/failure

Action Types

The agent can emit these actions on each step:

Action Description
click(element=N) Click numbered element from a11y/OCR list
click(x=, y=) Click absolute coordinates
click(description="...") Click by visual description (grounded by cascade)
type(text, enter=True) Type text, optionally press Enter
type(text, overwrite=True) Select-all then type (replace field content)
type(text, paste=True) Paste via clipboard (fast for long text)
scroll(direction, amount) Scroll up/down/left/right
press(key) Press keyboard key or combo (e.g., ctrl+c)
press(key, repeat=N) Repeat key N times
hold_key(keys, duration) Long-press with safety release
drag_to(to_x, to_y) Drag from current position
triple_click(element=N) Select entire line/paragraph
middle_click(element=N) Open link in new tab
call_code_agent(task) Delegate to iterative code agent (file manipulation)
bridge(method, **params) Call domain-specific bridge (UNO, GIMP)
shell(command) Run shell command on target system
todo_write(items) Write/update task plan (mutable checklist)
done() Declare task complete (triggers verification)
infeasible(reason) Declare task impossible

Module Map

pilot/
  agent/                    # Cognitive core
    worker.py               # Action generation (LLM call + parse)
    manager.py              # Task decomposition (optional, not in default loop)
    reflection.py           # Step-level insight generation
    code_agent.py           # Iterative code execution agent (15-step budget)
    cca_runner.py           # CCA orchestration (execute + verify + retry)
    shell_agent.py          # Shell command execution agent
    web_knowledge.py        # Web search for how-to instructions
    narrator.py             # Before/after screenshot narration
    observe.py              # Shared observation pipeline
    context.py              # Context/prompt builder
    actions.py              # Action parser (LLM text → structured action)
    action_registry.py      # Action type registry (auto-generates prompt docs)
    prompt_builder.py       # Composable system prompt from sections
    trajectory_replay.py    # Replay past successful trajectories
    widget_state.py         # Track UI widget state changes
    handlers/
      done_gate.py          # Task completion verification (3-stage)
      cca_handler.py        # Code agent + bridge dispatch
      gui_executor.py       # GUI action execution
      routing.py            # Action type → handler routing

  grounding/                # Description → pixel coordinates
    cascade.py              # 8-layer grounding cascade
    mog.py                  # Mixture-of-Grounders (vision model router)
    som.py                  # Set-of-Mark annotation (numbered element overlay)
    ocr.py                  # Tesseract OCR with element merging
    refinement.py           # Click position refinement
    scaling.py              # Resolution scaling between screen and model
    structural.py           # Structural map (DOM-like layout)
    a11y/
      base.py               # A11y provider protocol
      macos.py              # macOS AXUIElement provider
      linux.py              # Linux AT-SPI provider
      atspi.py              # AT-SPI XML parsing + active frame scoping
      budget.py             # A11y tree size budgeting

  memory/                   # Multi-layer memory system
    narrative.py            # ChromaDB vector store (task-level summaries)
    episodic.py             # SQLite episode store (step-level trajectories)
    spatial.py              # SQLite spatial cache (element position memory)
    element_atlas.py        # Cross-task element sighting history
    replay_cache.py         # Task-level replay blueprints
    failure_patterns.py     # Chronic failure pattern store
    working.py              # Per-task working memory (facts, blockers)
    vector_store.py         # ChromaDB wrapper
    workflow_store.py       # Workflow execution history
    episode_builder.py      # Episode step construction

  execution/                # OS-level action execution
    actions.py              # Full action executor (click, type, scroll, drag, etc.)
    macos.py                # macOS executor (pyautogui + pyobjc)
    linux.py                # Linux executor (pyautogui + X11/wmctrl)
    screenshot.py           # Screenshot capture (mss)
    focus.py                # Window focus management
    system_state.py         # System state reader (running apps, windows, z-order)
    base.py                 # Executor protocol

  humanization/             # Make actions look human
    mouse.py                # Bezier curve mouse movement
    typing.py               # Natural typing cadence with variance
    scroll.py               # Scroll physics (momentum, overshoot)
    idle.py                 # Idle wander (micro-movements between actions)
    browsing.py             # Pre-task browsing simulation
    imperfection.py         # Persona-based mistake injection

  verification/             # Task completion verification
    programmatic.py         # LLM-generated Python verification scripts
    screenshot.py           # Before/after screenshot comparison
    agent.py                # Adversarial verification agent
    router.py               # Verification channel router

  bridges/                  # Domain-specific command bridges
    registry.py             # Bridge registration and dispatch
    base.py                 # BridgeMethod/BridgeParam definitions
    libreoffice_uno.py      # LibreOffice Impress UNO (22 functions)
    libreoffice_uno_writer_calc.py  # Writer (12) + Calc (15) + Cross-app (3)
    libreoffice.py          # File-level LibreOffice operations
    gimp.py                 # GIMP Script-Fu/Python-Fu bridges

  safety/                   # Safety and reliability
    bbon.py                 # Best-of-N parallel rollouts
    emergency.py            # Emergency stop (F12) + modifier key release
    loop_detect.py          # Stuck detection (repeated failures)
    popup.py                # Popup dialog detection and dismissal
    live_judge.py           # Per-step live verification judge
    verification.py         # Safety verification utilities

  reasoning/                # Action selection
    candidates.py           # Step-level N-candidate generation
    judge.py                # LLM judge for candidate selection

  llm/                      # LLM integration
    engine.py               # LiteLLM wrapper with provider routing
    providers.py            # Provider configs (Anthropic, OpenRouter, etc.)
    tokens.py               # Token counting and cost estimation
    throttle.py             # Rate limiting
    canary.py               # API key validation at startup
    anthropic_oauth.py      # OAuth token management

  platform/remote/          # Remote execution (VMs, benchmarks)
    osworld_adapter.py      # OSWorld benchmark adapter
    macosworld_adapter.py   # macOSWorld benchmark adapter
    executor.py             # VNC-based remote executor
    pyautogui_executor.py   # Remote pyautogui via HTTP
    ssh_executor.py         # SSH command executor
    a11y_client.py          # Remote a11y tree client
    a11y_server.py          # Remote a11y tree server

  workflows/                # Temporal workflow definitions
    task_workflow.py         # Single-task workflow
    bbon_workflow.py         # Best-of-N parallel workflow
    playbook_workflow.py     # Playbook execution workflow
    activities.py           # Temporal activity implementations
    runtime.py              # Shared workflow runtime

  tracing/                  # Observability
    spans.py                # Span-based tracing (SQLite)
    analysis.py             # Trace analysis queries

  core/                     # Shared types
    types.py                # TaskState, HandlerResult, HandlerContext
    pipeline.py             # Pipeline utilities
    task_state_builder.py   # TaskState construction

  config/                   # Configuration
    settings.py             # Pydantic Settings (all config classes)
    dotenv.py               # .env file loading

  events/                   # Event system
    bus.py                  # Pub/sub event bus

  api/                      # API server
    server.py               # FastAPI app factory
    routes.py               # REST endpoints
    ws.py                   # WebSocket handler
    models.py               # API request/response models
    history.py              # Task history store
    state.py                # Runtime state exposure

  cli.py                    # CLI entry point + main agent loop
  runtime.py                # OperatorRuntime (component initialization)

pilotpb/                    # Playbook models and storage
  models.py                 # Playbook, PlaybookStep, StepAction
  store.py                  # SQLite playbook store
  extractor.py              # Extract playbooks from episodes
  coach.py                  # Playbook coaching (step adaptation)
  planner.py                # Playbook planning
  prompts.py                # Playbook-related prompts

dashboard/                  # React dashboard
  src/
    App.tsx                 # Main app with sidebar navigation
    views/
      TraceView.tsx         # Live screenshot + SoM + action history
      TaskView.tsx          # Task status and step counter
      CascadeView.tsx       # Grounding layer performance charts
      MemoryView.tsx        # Episodic/spatial memory browser
      ChatView.tsx          # Operator chat interface
      PlaybooksView.tsx     # Playbook management
      SettingsView.tsx      # Runtime config editor
    hooks/
      useWebSocket.ts       # WebSocket connection hook
    components/
      Sidebar.tsx           # Navigation sidebar

tests/
  unit/                     # 550+ unit tests across 20+ files
  integration/              # Integration tests (LLM contracts, flow)
  benchmark/                # Benchmark harness

Setup

Prerequisites

  • Python 3.11+
  • Tesseract OCR (brew install tesseract / apt install tesseract-ocr)
  • Node.js 18+ (for dashboard)
  • macOS: Accessibility permissions for the terminal app
  • Linux: AT-SPI2 (apt install at-spi2-core)

Install

# Clone
git clone https://github.com/Ruya-AI/Pilot.git
cd Pilot

# Create virtualenv
python3.11 -m venv .venv
source .venv/bin/activate

# Install with all extras
pip install -e ".[dev,macos,temporal]"  # macOS
pip install -e ".[dev,temporal]"         # Linux

# Dashboard
cd dashboard && npm install && cd ..

API Keys

Create a .env file in the project root:

# Required: reasoning model
ANTHROPIC_API_KEY=sk-ant-...

# Required: grounding model (UI-TARS via OpenRouter)
OPENROUTER_API_KEY=sk-or-...

# Optional: alternative providers
AZURE_API_KEY=...
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...

Configuration

Default config lives in config/operator.yaml. Override any setting via environment variables:

# Use a different reasoning model
OPERATOR_LLM__REASONING_MODEL=claude-opus-4-6

# Use a different grounding model
OPERATOR_LLM__GROUNDING_MODEL=bytedance/ui-tars-2-72b

# Disable humanization (faster, for benchmarks)
OPERATOR_EXECUTION__HUMANIZATION_ENABLED=false

# Enable step-level candidate selection
OPERATOR_AGENT__ENABLE_STEP_CANDIDATES=true
OPERATOR_AGENT__STEP_CANDIDATE_N=3

# Increase step budget
OPERATOR_AGENT__MAX_STEPS=30

Running

CLI (Direct Mode)

# Run a task
pilot --task "Open Safari and go to github.com"

# With specific model
pilot --task "Post a tweet saying hello" --provider anthropic --model claude-opus-4-6

# Benchmark mode (no humanization, faster)
pilot --task "Change Chrome homepage" --benchmark_mode

# Verbose logging
pilot --task "Open System Preferences" -v

Temporal (Workflow Mode)

# Terminal 1: Start Temporal server
temporal server start-dev

# Terminal 2: Start worker
python temporal-worker.py

# Terminal 3: Submit task
pilot --task "Book a flight to Tokyo" --use-temporal

# Best-of-N (3 parallel rollouts, pick best)
pilot --task "Configure dark mode" --use-temporal --bbon 3

API Server

# Start server
uvicorn pilot.api.server:app --port 8420

# Submit task via API
curl -X POST http://localhost:8420/api/tasks \
  -H "Content-Type: application/json" \
  -d '{"instruction": "Open Calculator and compute 42 * 17"}'

# Check status
curl http://localhost:8420/api/tasks/{task_id}

Dashboard

cd dashboard
npm run dev
# Open http://localhost:5173

The dashboard connects via WebSocket to the API server and shows:

  • Trace: Live screenshot feed with SoM annotations and action markers
  • Task: Current task status, step counter, action history
  • Cascade: Grounding layer hit rates, latency, and cost charts
  • Memory: Episodic episodes, spatial cache entries, failure patterns
  • Chat: Operator-to-agent communication channel
  • Playbooks: Recorded playbooks and execution history
  • Settings: Runtime configuration editor

Memory Systems

Pilot maintains six memory layers that persist across tasks:

Layer Storage Purpose Lifetime
Working In-memory Current task facts, blockers, completed items Per-task
Episodic SQLite Step-level trajectories (action, element, position, result) Permanent
Spatial SQLite Element positions per page (confidence-weighted, decays on miss) Permanent
Element Atlas SQLite Cross-task element sighting counts (high-sighting = fast bypass) Permanent
Narrative ChromaDB Task-level summaries (vector search for similar past tasks) Permanent
Replay Cache SQLite Full task blueprints (replay successful trajectories) Permanent

Episodic memory feeds blueprints to the agent: "Here's how a similar task was solved before." The spatial cache provides instant grounding: if the agent clicked "Save" at (1450, 32) on this page before and it worked, skip the expensive vision call.


Grounding Cascade

The cascade resolves a text description (e.g., "the Save button") to pixel coordinates. It tries cheap methods first and escalates to expensive vision models only when needed.

Layer Method Latency Cost How It Works
-1 Element Atlas <1ms Free Cross-task sighting history — elements seen 5+ times bypass everything
0 Spatial Cache ~1ms Free Page-scoped position memory with confidence decay
1 Structural Map ~10ms Free DOM-like layout relationships between elements
2.5 A11y Tree ~10ms Free OS accessibility API element bounds (AT-SPI / AXUIElement)
3 Anticipation ~5ms Free Crop around expected region based on action context
4 MoG Vision 2-3s ~$0.01 UI-TARS 72B identifies element in screenshot
5 Adaptive Replan ~3s ~$0.03 Vision model + context → alternative approach
6 Keyboard Fallback <0.5s Free Hardcoded shortcuts for common actions (Ctrl+S, etc.)

On a warm cache (returning user, familiar app), 70-80% of groundings resolve at Layer 0-2.5 for free. Vision calls (Layer 4+) happen only for novel elements.


Verification

Pilot verifies task completion through three independent channels:

  1. Screenshot Comparison — Compare the initial screenshot (step 0) against the final screenshot. An LLM judges whether the visual state reflects task completion. Fast, catches obvious successes and failures.

  2. Programmatic Verification — The LLM writes a Python script that reads actual state from disk (files, app configs, system settings). The script runs on the target system and returns structured pass/fail results. This catches cases where the screen looks right but the underlying data is wrong.

  3. Adversarial Agent — A separate LLM instance with read-only tools tries to disprove that the task was completed. It has access to screenshots, file system, and shell commands. If it can't find evidence of failure, the task passes.

The done gate orchestrates these channels: visual check first (cheap), then programmatic verification (thorough), with up to 3 rejection cycles before accepting the agent's claim.


Bridges

Bridges provide direct API access to applications, bypassing the GUI entirely. They're faster and more reliable than clicking through menus for data manipulation tasks.

LibreOffice UNO (52 functions)

Bridges communicate with LibreOffice via the UNO API (socket port 2002), operating on the live document — no close/reopen cycle, no file format conversion issues.

Impress (22 functions): get_slide_info, find_shape_by_text, write_text, set_style, set_background_color, insert_image, duplicate_slide, export_to_image, etc.

Writer (12 functions): write_text, find_and_replace, set_font, set_color, set_line_spacing, set_paragraph_alignment, insert_page_break, set_subscript, etc.

Calc (15 functions): get_workbook_info, get_active_sheet_data, set_column_values, sort_column, merge_cells, format_range, freeze_panes, rename_sheet, insert_chart, etc.

Cross-app (3 functions): get_open_documents, save_all, close_document.

GIMP Python-Fu

Bridges for image manipulation: brightness/contrast, color balance, crop, resize, export, layer operations.

Usage

The agent calls bridges via the action format:

ACTION: agent.bridge("calc_sort_column", column_name="Revenue", ascending=False)

The prompt builder automatically lists available bridges based on the running application.


Step-Level Candidate Selection

When enabled, the agent samples N candidate actions per step at elevated temperature, deduplicates, and uses an LLM judge to select the best one before grounding. This improves decision quality on ambiguous steps.

Same context → N parallel LLM calls (asyncio.gather)
→ Parse each → Deduplicate → All candidates agree? → Skip judge (free)
→ Candidates diverge? → 1 judge call → Pick best → Ground winner → Execute

Selective mode (default): triggers on ~30% of steps where alternatives matter most:

  • First 2 steps (wrong start = wrong trajectory)
  • After a RECOVER verdict
  • After vision grounding (ambiguous element)
  • After loop detection (stuck = need different approach)

Cost: +13% per task in selective mode. Unanimous rate ~60-70% (judge skipped when all candidates agree).

OPERATOR_AGENT__ENABLE_STEP_CANDIDATES=true
OPERATOR_AGENT__STEP_CANDIDATE_N=3
OPERATOR_AGENT__STEP_CANDIDATE_TEMP=0.6

Best-of-N (bBoN)

Task-level parallel rollouts via Temporal workflows. Run N independent attempts at the same task, verify each, pick the best result.

pilot --task "Configure proxy settings" --use-temporal --bbon 3

Each rollout gets its own worker, its own memory context, and runs the full agent loop independently. The orchestrator compares verification scores and returns the highest-quality completion.


Testing

# All tests
pytest tests/ -v

# Unit tests only
pytest tests/unit/ -v

# Specific module
pytest tests/unit/test_cascade.py -v

# With coverage
pytest tests/ --cov=pilot --cov-report=html

Tests are organized by module: test_a11y.py, test_actions.py, test_cascade.py, test_done_gate.py, test_humanization.py, test_memory.py, test_verification.py, etc.


Models

Role Default Model Provider Purpose
Reasoning Claude Sonnet 4.5 Anthropic Action generation, verification, planning
Grounding UI-TARS-2 72B OpenRouter Element identification in screenshots
Embeddings sentence-transformers Local Memory retrieval (narrative, episodic)
OCR Tesseract Local Text detection in screenshots

Override models via environment:

OPERATOR_LLM__REASONING_MODEL=claude-opus-4-6
OPERATOR_LLM__GROUNDING_MODEL=bytedance/ui-tars-2-72b

Design Principles

See docs/IDEOLOGY.md for the full philosophy. Key points:

  1. Human-like interaction — Click first, hotkeys second. No hardcoded sequences. Visual-first reasoning.
  2. Intelligence over rigidity — Teach principles, not procedures. Trust the model. Fewer instructions > more instructions.
  3. Fail fast, recover smart — 2 failures = try something different. Reflection kills dead strategies.
  4. Generalizable fixes only — A fix for one app that breaks another is not a fix. Universal over specific.
  5. OS-level everything — No CDP, no Playwright, no browser APIs. Same interfaces a human uses.

About

Pilot — computer-use agent guided by the quantum potential of LLM reasoning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors