fulcrumresearch
diff --git a/‎.claude/skills/debug/SKILL.md‎
Lines changed: 162 additions & 0 deletions b/‎.claude/skills/debug/SKILL.md‎
Lines changed: 162 additions & 0 deletions
diff --git a/‎.claude/skills/druids-driver/SKILL.md‎
Lines changed: 102 additions & 0 deletions b/‎.claude/skills/druids-driver/SKILL.md‎
Lines changed: 102 additions & 0 deletions
@@ -0,0 +1,162 @@
+---
+name: debug
+description: >
+  Diagnose a running or completed Druids execution. Pulls agent traces,
+  activity logs, and diffs, then produces a structured diagnostic covering
+  communication health, errors, goal progress, agent performance, and
+  behavioral bottlenecks.
+user-invocable: true
+---
+
+# Debug an Execution
+
+The user wants to understand what is happening (or what went wrong) inside a Druids execution. This skill produces a diagnostic report by pulling every available signal and analyzing it systematically.
+
+## 1. Identify the execution
+
+The user may pass a slug directly (`/debug gentle-nocturne`) or say something like "debug the current run". If no slug is given, call `list_executions` with `active_only=true` and pick the most recent one. If there are multiple active executions, ask which one.
+
+## 2. Gather all available data
+
+Make these calls in parallel where possible:
+
+**a. Execution state** -- `get_execution` for the slug. Record: status, agent names, agent types, connections, topology edges, exposed services, PR URL, branch name.
+
+**b. Full activity log** -- `get_execution_activity` with `n=200` and `compact=false`. This is the richest signal. It contains every tool call, message, error, connection event, and response across all agents. Request the full (non-compact) version so you can see tool arguments and outputs.
+
+**c. Per-agent traces** -- For each agent in the execution, call `get_agent_trace`. This gives you the coalesced view: messages, thoughts, tool calls with status, and plans. Pull traces for all agents in parallel.
+
+**d. Diff** -- `get_execution_diff`. If no diff exists yet, note that.
+
+**e. Spec** -- from the execution data. You need this to evaluate whether agents are achieving the goal.
+
+## 3. Analyze
+
+Work through each dimension below. Do not skip dimensions even if they seem fine -- explicitly confirming health is part of the diagnostic.
+
+### 3a. Communication health
+
+Questions to answer:
+
+- Are all agents connected? Check for `connected` and `disconnected` events. An agent that connected and then disconnected has a problem.
+- Is the topology correct for the program? Do agents that need to talk to each other have edges between them?
+- Are messages actually flowing? Look for `message` tool calls in the activity. Check that messages sent by one agent show up as received by the target.
+- How long between a message being sent and the receiver acting on it? Gaps longer than 30 seconds suggest the receiver is stuck or not listening.
+- Are any agents talking to themselves or sending messages that go nowhere?
+
+### 3b. Errors
+
+Questions to answer:
+
+- Are there any `error` type events in the activity? What do they say?
+- Are there tool calls that returned errors? Look at `tool_result` events with error indicators.
+- Did any agent disconnect unexpectedly?
+- Are there repeated failures on the same tool call? This usually means the agent is stuck in a retry loop.
+- Did any agent hit a timeout?
+- Are there permission errors (git push failures, file access denied, port already in use)?
+
+### 3c. Agent performance
+
+For each agent, characterize:
+
+- **Activity level**: How many tool calls has it made? Is it actively working or idle?
+- **Focus**: What is it spending its time on? (e.g., 80% file edits, 10% git, 10% messages)
+- **Progress**: Based on its trace, what has it accomplished relative to its role?
+- **Stuck indicators**: Is it repeating the same action? Has it gone silent? Is it producing long stretches of thinking without action?
+- **Tool usage patterns**: Which tools does it use most? Are there tools it should be using but isn't?
+
+Then compare across agents:
+
+- Which agent is furthest along?
+- Which agent is the weakest link (blocking others or making no progress)?
+- Is any agent doing redundant work that another agent already did?
+
+### 3d. Goal progress
+
+Compare the current state against the spec:
+
+- What did the spec ask for?
+- What has actually been built? (Use the diff and exposed services.)
+- Has anyone attempted the demo from the spec?
+- What percentage of the requirements are met?
+- What is left to do?
+
+### 3e. Behavioral bottlenecks
+
+These are the pragmatic, structural problems that slow executions down:
+
+- **File sharing**: Are agents trying to read files another agent is still writing? Look for file-not-found errors or stale reads.
+- **Info aggregation**: In programs with sub-agents, is the orchestrator actually collecting and using sub-agent output? Or is information getting lost?
+- **Messaging timeliness**: Are there long gaps where an agent should have sent a message but didn't? Calculate the longest gap between activity events for each agent.
+- **Hanging**: Is any agent completely silent for more than 2 minutes? This usually means it's stuck waiting for something or has crashed.
+- **Serialization**: Are agents doing work sequentially that could be parallel? (e.g., one agent waiting for another to finish before starting its own independent work)
+- **Scope creep**: Is any agent doing work outside its assigned role? (e.g., the reviewer starting to implement instead of reviewing)
+- **Thrashing**: Is any agent undoing and redoing work? Look for patterns like edit-revert-edit on the same files.
+
+## 4. Present the diagnostic
+
+Structure the output as follows. Be concrete and specific -- cite actual tool names, file paths, message contents, and timestamps from the trace. Do not hedge or generalize.
+
+```
+## Diagnostic: {slug}
+
+**Status**: {status} | **Agents**: {count} | **Duration**: {time since start}
+**Spec**: {one-line summary of what was asked for}
+**Branch**: {branch} | **PR**: {url or "none yet"} | **Diff**: +{added}/-{removed} lines
+
+### Communication
+{2-4 sentences on topology health, message flow, latency. Flag any issues.}
+
+### Errors
+{List each error with agent name and context. Or "No errors detected."}
+
+### Agent Performance
+
+#### {agent_name} ({agent_type})
+- Activity: {active/idle/stuck} -- {N} tool calls, last active {time}
+- Focus: {what it's spending time on}
+- Progress: {what it's accomplished}
+- Issues: {any problems, or "none"}
+
+(repeat for each agent)
+
+### Weakest link
+{Which agent is the bottleneck and why. Be direct.}
+
+### Goal Progress
+- Spec asks for: {requirements list}
+- Completed: {what's done}
+- Remaining: {what's left}
+- Estimated completion: {close / far / stuck}
+
+### Bottlenecks
+{List each bottleneck found, with evidence from the trace. Or "No structural bottlenecks detected."}
+
+### Recommended actions
+{Concrete next steps. Examples:
+- "Send builder a message: the tests are failing because X, try Y"
+- "Stop agent Z, it's been hanging for 5 minutes"
+- "The reviewer hasn't received the builder's submission -- check topology"
+- "Everything looks healthy, just needs more time"}
+```
+
+## 5. Offer to act
+
+After presenting the diagnostic, ask the user if they want to take any of the recommended actions. You can:
+
+- Send a message to an agent via `send_message`
+- Stop a stuck agent via `stop_agent`
+- Run a command on an agent's VM via `remote_exec` to inspect state
+- Check specific files or processes on the VM
+- SSH into the VM for the user via `get_agent_ssh`
+
+Do not take action without the user's confirmation.
+
+## Notes
+
+- For running executions, the trace is live. If the user asks to "keep watching", re-pull activity after a minute and report changes.
+- If the execution has already completed or failed, the diagnostic is a post-mortem. Shift language accordingly: "what happened" instead of "what's happening".
+- The activity log is the primary signal. Agent traces are secondary -- they show the agent's perspective but miss inter-agent dynamics.
+- When citing evidence, include the agent name and a brief quote or description of the event. "builder called `Write` on `src/app.py` at 14:32" is better than "an agent edited a file".
+- If you see an agent in a retry loop (same tool call 3+ times with errors), that is the highest-priority finding. Flag it first.
+- Token usage from the execution record can indicate whether an agent is doing real work (high token usage) or stuck early (low token usage).
@@ -0,0 +1,102 @@
+---
+name: druids-driver
+description: >
+  Reference for driving Druids: launching agent executions, monitoring
+  progress, writing specs, and reviewing results. Loaded automatically
+  so Claude Code always knows how to use Druids.
+user-invocable: false
+---
+
+# Druids
+
+Druids runs coding agents on remote VMs. You send a program and a spec, agents implement on isolated sandboxes, and the results come back as pull requests. Your role as the driver is to translate user intent into specs, launch executions, monitor progress, and review output.
+
+Use Druids when the user asks to build a feature, fix a bug, or do work that benefits from delegation to background agents. Do not use it for quick edits, questions about the codebase, or tasks that require real-time back-and-forth with the user.
+
+## Concepts
+
+A **devbox** is a VM snapshot with the user's repo cloned and dependencies installed. Executions fork from it so each agent starts with a working environment. Created via `druids devbox create` and `druids devbox snapshot`.
+
+A **program** is a Python file that defines `async def program(ctx, ...)`. It creates agents, registers tool handlers, and manages lifecycle. Programs live in `.druids/` in the repo. The driver reads the file and sends its source to `create_execution`.
+
+An **execution** is a running instance of a program. Gets a slug like `gentle-nocturne`. Contains one or more agents working on VMs. When agents finish, they push a branch and open a PR.
+
+An **agent** runs on a VM inside an execution. Created by programs via `ctx.agent(name, ...)`. Each has a bridge process connecting it back to the server.
+
+## Workflow
+
+1. User asks to build something.
+2. Explore the codebase. Understand conventions, test patterns, relevant files.
+3. Write a spec describing what to change. The spec is the primary input to the agent. Include file paths, function signatures, and concrete demo commands. If the `write-spec` skill is available, use it.
+4. Choose a program. `basher.py` in direct mode handles most implementation tasks (implementor + reviewer). `main.py` runs Claude and Codex in parallel for comparison. `review.py` demo-reviews an existing PR.
+5. Read the program source and call `create_execution` with `program_source` set to the file contents. Pass the spec and other parameters in `args`.
+6. Monitor with `get_execution`. Check status, agents, connections, PR URL.
+7. If an agent is stuck, check `get_execution_activity` and send guidance via `send_message` with `sender="driver"`.
+8. When agents finish, review the diff with `get_execution_diff` and report to the user with a link to the PR.
+
+## Programs in `.druids/`
+
+- `basher.py`: Implementation with review. Direct mode: pass `task_name` and `task_spec` to spawn an implementor+reviewer pair. The implementor builds on a feature branch, the reviewer demos the change and creates a PR if it works (up to 3 review rounds). Full mode: a finder agent scans for tasks and spawns pairs automatically.
+- `main.py`: Parallel comparison. Spawns a Claude agent and a Codex agent on the same spec. Both implement independently, both submit when done.
+- `build.py`: Spec-driven build with auditing. A builder implements, a critic reviews each commit for simplicity, and an auditor verifies the demo evidence is real.
+- `review.py`: Demo-review of a PR. A demo agent checks out the PR, runs the system, and tests every changed behavior from the outside. A monitor watches for lazy behavior. Takes `pr_number`, `pr_title`, `pr_body`, `repo_full_name`.
+
+## MCP tools
+
+These tools are exposed by the Druids server. They are your interface for creating and managing executions.
+
+### Creating work
+
+`create_execution`: start an execution. Required: `program_source` (Python source string). Optional: `devbox_name`, `repo_full_name` (finds devbox by repo when name not set), `git_branch`, `args` (dict of string key-value pairs). Returns `execution_slug` and `execution_id`.
+
+### Monitoring
+
+`list_executions`: list executions. Pass `active_only=false` to include stopped ones.
+
+`get_execution`: get execution by slug. Returns status, agents, connections, branch name, PR URL, exposed services, client events.
+
+`get_execution_activity`: recent trace events for an execution. Optional: `n` (default 50), `compact` (default true). Shows agent names, event count, recent activity.
+
+`get_execution_diff`: git diff from an execution's VM. Optional: `agent` (default picks the first agent with a machine).
+
+`get_agent_events`: event stream for a specific agent. Required: `slug`, `agent_name`. Optional: `after_sequence` (resume from sequence number), `limit` (default 100).
+
+### Interacting with agents
+
+`send_message`: message a running agent. Required: `execution_slug`, `receiver` (agent name), `message`. Set `sender` to `"driver"`.
+
+`remote_exec`: run a shell command on a VM. Target by `repo` (devbox) or `execution_slug` + `agent_name` (agent VM). Required: `command`. Returns `stdout`, `stderr`, `exit_code`.
+
+`stop_agent`: stop an agent. Required: `agent_name`, `execution_slug`.
+
+`get_agent_ssh`: SSH credentials for an agent's VM. Required: `agent_name`, `execution_slug`. Returns host, port, username, private_key, password.
+
+### Stopping work
+
+`update_execution`: change execution status. Set `status` to `"stopped"`, `"completed"`, or `"failed"`.
+
+## CLI commands
+
+The `druids` CLI runs on the driver's local machine:
+
+- `druids exec <program> [--devbox NAME] [--branch BRANCH] [key=value ...]`: run a program. Bare names resolve against `.druids/` (e.g. `druids exec build`).
+- `druids execution ls [--all]`: list executions.
+- `druids execution status SLUG`: check execution status.
+- `druids execution stop SLUG`: stop an execution.
+- `druids execution send SLUG MESSAGE [--agent NAME]`: send a message to a running agent.
+- `druids execution ssh SLUG [--agent NAME]`: open a shell on a VM.
+- `druids execution connect SLUG [--agent NAME]`: resume an agent's coding session.
+- `druids devbox create [--name NAME] [--repo OWNER/REPO]`: provision a devbox.
+- `druids devbox snapshot [--name NAME] [--repo OWNER/REPO]`: snapshot and save.
+- `druids devbox ls`: list all devboxes.
+- `druids devbox secret set/ls/rm`: manage devbox secrets.
+- `druids auth set-key KEY`: set auth key.
+- `druids init`: initialize repo (programs, .mcp.json, llms.txt).
+
+## Reference
+
+- Execution status values: `created`, `running`, `completed`, `stopped`, `failed`.
+- `remote_exec` can target a devbox directly (pass `repo`) without a running execution. Use for setup, debugging, or ad-hoc commands.
+- Agents call tools on the VM via `druids tool <tool_name> key=value`. Tools are defined by `@agent.on("tool_name")` in the program.
+- Built-in agent tools: `expose` (expose a port as public HTTPS URL), `message` (send message to another agent), `list_agents` (list agents in the execution).
+- Programs can use `share_machine_with=other_agent` to run two agents on the same VM.