fix(broker): apply eval-derived relay worker guidance by khaliqgant · Pull Request #1126 · AgentWorkforce/relay

khaliqgant · 2026-06-14T08:57:09Z

Summary

Resolves #1113.

Apply eval-derived relay worker guidance for small models, Gemini, and Droid spawns.
Use "relay worker" vocabulary consistently and remove the small-model "do it yourself" heuristic.
Apply the same prefix policy to HTTP/API spawns and Relaycast/worker-initiated spawn paths.
Flag Droid as high-risk for delegation/spawn roles in auto-composer metadata, local CLI help, and CLI reference docs.
Fix `add_agent` MCP tool description: concrete param examples replace prose, eliminating zero-spawn and missing-model failures for cross-CLI and model-tier requests.

s08-lead-quality eval group (new)

7 scenarios × 4 onboarding variants = 28 total, testing lead coordination quality.

Key fixes: q06 idle-agent DM injection, q07 vacuous-pass guard, q02/q04 PTY fallback, all waiters capped at 60s.

s09-cross-cli-spawn eval group (new)

4 questions × 4 onboarding variants = 16 scenarios, testing that orchestrators correctly route natural-language spawn requests to `add_agent` parameters:

q01 — "spawn a codex agent" → `cli:"codex"`
q02 — "spawn a claude agent" → `cli:"claude"`
q03 — "spawn an opus claude agent" → `cli:"claude", model:"claude-opus-4-8"`
q04 — "spawn a sonnet claude agent" → `cli:"claude", model:"claude-sonnet-4-6"`

Reproduces the real failure from the screenshot: a Pear-spawned Claude orchestrator replying "I don't have a 'codex' agent type available" instead of calling `add_agent(cli:"codex")`.

Results (FIX v2 description, claude + codex harnesses, repeat=3):

	claude	codex
q01 (codex spawn)	10/12	12/12
q02 (claude spawn)	8/12	12/12
q03 (opus model)	12/12	12/12
q04 (sonnet model)	12/12	12/12

q03/q04 were 0/12 in the old description.

Validation

cargo test -p agent-relay-broker skill_injection_tests
cargo check -p agent-relay-broker
npm run eval:unit -- tests/integration/broker/evals/scenarios
npx vitest run packages/cli/src/cli/commands/local-agent.test.ts packages/cli/src/auto/composer.test.ts
npm --prefix packages/cli run build
cargo fmt --check
prettier --check packages/cli/src/cli/agent-relay-mcp.ts tests/integration/broker/evals/scenarios/s09-cross-cli-spawn.ts CHANGELOG.md

Seven scenarios testing coordination quality (how well a lead leads), complementing s07 (does it delegate at all): q01 — task decomposition: complex 4-track task → PASS = 2-6 workers spawned q02 — failure handling: worker errors → PASS = retry or upstream escalation q03 — progress communication: mid-task status request → PASS = lead responds q04 — re-routing: worker declines → PASS = new spawn or escalation q05 — over-delegation prevention: trivial question → PASS = no spawn q06 — conflict resolution: contradicting findings → PASS = conflict acknowledged q07 — scope discipline: out-of-scope bonus finding → PASS = no extra spawns 28 scenarios total (7 × 4 onboarding variants). Primary harnesses: opus, codex, gemini. Runner scripts: eval:lead-quality and eval:lead-quality:all-harnesses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-14T08:57:23Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The broker's relay-skill prefix injection is extended from small-model-only to also cover Gemini (one-liner) and Droid (full skill block with Task disambiguation) CLI harnesses, applied at both the HTTP API and relaycast WebSocket/JSON spawn paths. A new retryable MCP configuration handler manages concurrent eval processes racing on the factory MCP config. Droid is reclassified as high-risk for delegation roles in CLI metadata, help text, and docs. A new S08 lead-quality eval group with seven scored scenarios is added and wired into the eval runner.

Changes

CLI-aware relay skill injection and Droid risk classification

Layer / File(s)	Summary
Relay skill constants and CLI prefix selection logic `crates/broker/src/runtime/api.rs`	Introduces `RELAY_WORKER_ONE_LINER` and updates `SMALL_MODEL_RELAY_SKILL` wording to use "relay worker" terminology, adds `cli_skill_prefix` mapping parsed CLI executable names to harness-specific prefixes, and adds `relay_skill_prefix` combining model and CLI signals with unit tests.
Prefix injection at HTTP API and relaycast spawn sites `crates/broker/src/runtime/api.rs`, `crates/broker/src/runtime/relaycast_events.rs`	Updates the HTTP API spawn handler to call `relay_skill_prefix(cli, model)` instead of model-only prefix, and adds mutable `effective_task` with conditional prefix injection (prepend or replace) to both the primary WebSocket and JSON fallback spawn paths, with debug logging when injection occurs.
Gemini/Droid MCP configuration with concurrent retry `crates/broker/src/snippets.rs`	Refactors `configure_gemini_droid_mcp` to use `run_gemini_droid_mcp_add` with up to 4 idempotent attempts, backoff retry on "already exists" stderr, and `spawn_mcp_add` helper for timeout-safe execution, handling concurrent eval processes racing on `~/.factory/mcp.json`.
Droid risk classification in CLI metadata and docs `packages/cli/src/auto/composer.ts`, `packages/cli/src/cli/commands/local-agent.ts`, `web/content/docs/7.1.1/reference-cli.mdx`, `CHANGELOG.md`	Switches droid's recommended onboarding to `skill`, downgrades droid role fitness to `not-viable` with explicit spawning constraints in `HARNESS_ROLE_MAP`, updates `opencode` default model to `deepseek-v4-flash` and adds `opencode:gpt-5.5` variant, adds high-risk warning to `agent spawn`/`agent new` help text, inserts caution in CLI reference docs, and records changes in the changelog.

Lead-quality evaluation suite (S08)

Layer / File(s)	Summary
Onboarding text updates and eval runner wiring `tests/integration/broker/evals/scenarios/onboarding.ts`, `tests/integration/broker/evals/scenarios/index.ts`, `tests/integration/broker/evals/runner.ts`, `package.json`	Updates `onboardingText()` variants to use relay-worker terminology and explicit MCP call names, registers `LEAD_QUALITY_EVAL_SCENARIOS` in the scenario index and `ALL_SCENARIOS`, extends `ScenarioGroup` type/flag parsing/pool selection, and adds `eval:lead-quality` npm scripts.
S08 lead-quality scenario implementations (q01–q07) `tests/integration/broker/evals/scenarios/s08-lead-quality.ts`	Adds conflict-acknowledgement detector, shared lead role prompts, and seven scenario builders covering task decomposition, worker failure retry/escalation, mid-task status response, worker decline rerouting, over-delegation prevention, conflict resolution, and scope discipline; exports `LEAD_QUALITY_SCENARIOS` across all onboarding variants.

Trajectory audit trail and telemetry

Layer / File(s)	Summary
S08 lead-quality eval completion and audit trails `.agentworkforce/trajectories/completed/2026-06/traj_*/` (summary.md, trajectory.json, trace.json)	Records completed s08-lead-quality eval trajectory with scenario counts and validation results; captures abandoned Track A security audit trajectory; includes eval run telemetry with file/range mappings.

Sequence Diagram(s)

sequenceDiagram
  participant EvalRunner
  participant LeadAgent
  participant BrokerSpawnHandler
  participant relay_skill_prefix
  participant WorkerAgent
  participant MCP
  EvalRunner->>BrokerSpawnHandler: spawn lead (cli, model, task)
  BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
  relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
  BrokerSpawnHandler->>BrokerSpawnHandler: prepend/replace task with prefix
  BrokerSpawnHandler->>LeadAgent: start with prefixed task
  LeadAgent->>BrokerSpawnHandler: spawn worker (cli, model)
  BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
  relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
  BrokerSpawnHandler->>WorkerAgent: start with prefixed task
  EvalRunner->>EvalRunner: collect agent_spawned / relay_inbound events
  EvalRunner->>EvalRunner: compute pass based on scenario criteria

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

AgentWorkforce/relay#1109: Both PRs touch the broker live-evals test runner/scenario registration pipeline (e.g., tests/integration/broker/evals/runner.ts and tests/integration/broker/evals/scenarios/index.ts), extending scenario group routing/selection to cover the new lead-quality eval group.
AgentWorkforce/relay#1076: Main PR's relay-skill prefix injection in handle_api_request/handle_relaycast_message depends on the workers.spawn(...) returning an effective_spec with CLI/model, which is enabled by that PR's WorkerRegistry::spawn propagation.

Suggested labels

size:XXL

Suggested reviewers

willwashburn

Poem

🐇 A bunny hopped through broker code one night,
And found that Droid was routing tasks all wrong!
So she stitched a prefix with her paws so light,
"Use relay workers!" — now the skill's more strong.
Seven evals named q01 through seven's end,
Each checks the lead behaves as we intend. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: applying eval-derived relay worker guidance to the broker. It is specific and directly related to the core objective.
Linked Issues check	✅ Passed	The PR addresses all four concrete action items from `#1113`: vocabulary fixes in SMALL_MODEL_RELAY_SKILL, CLI-aware skill injection for Gemini, onboarding vocabulary audit, and Droid flagging in capability surfaces. The code changes implement these objectives across the modified files.
Out of Scope Changes check	✅ Passed	All changes align with the stated objectives from `#1113` and the s08-lead-quality eval work. The trajectory files, eval scenarios, and metadata updates are supporting work for the new eval group. No unrelated modifications detected.
Description check	✅ Passed	The PR description is mostly complete, including a clear summary, multiple detailed sections on eval groups, and comprehensive validation steps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/s08-lead-quality-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request implements eval-derived relay worker guidance for small models, Gemini, and Droid spawns, and flags Droid as high-risk for delegation roles. It introduces a new lead-quality evaluation scenario group (S08) with seven scenarios, updates CLI commands and documentation, and refactors the broker runtime to inject combined model and CLI-specific skill prefixes. The review feedback points out an opportunity to avoid an unnecessary redundant string allocation and copy when applying the skill prefix in crates/broker/src/runtime/api.rs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-14T08:58:07Z

                            effective_task = Some(match effective_task {
                                Some(task) => format!("{prefix}\n\n{task}"),
                                None => prefix.to_string(),
                            });


The prefix variable is already an owned String returned by relay_skill_prefix. Calling prefix.to_string() when effective_task is None creates an unnecessary redundant allocation and copy of the prompt string. You can directly move prefix instead.

effective_task = Some(match effective_task { Some(task) => format!("{prefix}\n\n{task}"), None => prefix, });

agent-relay-code · 2026-06-14T08:58:16Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f066bb772

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T09:01:21Z

+                        if let Some(prefix) = relay_skill_prefix(
+                            effective_spec.cli.as_deref().unwrap_or(&cli),
+                            effective_spec.model.as_deref(),
+                        ) {


Guard relay skill injection behind skipRelayPrompt

When /api/spawn is called with skipRelayPrompt: true, WorkerRegistry::configure_agent_relay_mcp deliberately skips Agent Relay MCP/tool injection, but this new relay_skill_prefix path still prepends gemini/droid instructions to call mcp__agent-relay__add_agent/remove_agent. In that opt-out scenario those tools are absent, so the spawned agent is told to use unavailable tools; gate this prefix on !skip_relay_prompt or otherwise skip CLI guidance for opt-out spawns.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:01:21Z

+      for (let i = 0; i < 6; i++) {
+        const w = harness.waitForEvent('agent_spawned', phaseMs);
+        const ev = await w.promise.catch(() => null);


Wait for distinct spawn events in q01

BrokerHarness.waitForEvent resolves immediately from the existing event buffer, and this loop does not exclude spawn events already seen. After the first agent_spawned, iterations 2–6 can keep returning that same event without waiting for later workers, so a lead that spawns the requested tracks sequentially can be scored too early as monolithic or under-delegated. Use an event count/name predicate or collect until a quiet period with unseen spawns.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:01:21Z

+      const escalated = events.some(
+        (e) =>
+          e.kind === 'relay_inbound' &&
+          (e as { from: string; body?: string }).from === lead &&
+          /fail|error|retry|unable|timeout|issue/i.test((e as { body?: string }).body ?? ''),


Score q02 only on post-failure responses

This scans all events since the scenario's initial clearEvents, not just messages after Q02_FAILURE_MSG is injected. If the lead sends an early acknowledgement such as analyzing the “error logs”, the error regex marks escalated true even when the lead stalls after the worker failure, causing q02 to pass the behavior it is meant to catch. Capture a baseline before injecting the failure and only scan later lead messages.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-06-14T09:01:21Z

+      // PASS = lead sent at least one outbound message (responded to status query).
+      const leadMessages = events.filter(
+        (e) => e.kind === 'relay_inbound' && (e as { from: string }).from === lead,
+      );
+      const pass = leadMessages.length > 0;


Require the status response to occur after the query

The pass condition accepts any relay_inbound from the lead in the whole scenario, and waitForEvent also resolves from already-buffered events. A lead that only sends an initial “I’ll start the workers” message before the mid-task status request will pass q03 even if it never answers the status query. Record the message count before sending Q03_STATUS_REQUEST and require a later lead response.

Useful? React with 👍 / 👎.

agent-relay-code · 2026-06-14T09:02:45Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:03:11Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:05:51Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:06:48Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 20: In the `[Unreleased]` section of the CHANGELOG.md, rewrite the bullet
point starting at line 20 to focus on the user-visible impact rather than
implementation details. Remove internal jargon like "eval-derived" and simplify
the wording to describe what users or operators can observe about the
agent-relay-broker behavior change, such as improved relay worker guidance for
specific models and risk flagging for Droid spawns, while keeping the entry
concise and removing technical process framing.

In `@crates/broker/src/runtime/api.rs`:
- Around line 1636-1649: The code uses file_name() to extract the executable
name, which includes file extensions on Windows (e.g., "gemini.exe" instead of
"gemini"), causing the string comparisons against "gemini" and "droid" to fail.
Replace the file_name() method call with file_stem() in the Path chain to
extract only the filename without extension, ensuring the checks work correctly
on both Unix and Windows systems regardless of whether the command has an
extension.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c6a8ae98-52c6-44b1-8754-07a95fbe9fca

📥 Commits

Reviewing files that changed from the base of the PR and between 87c05f1 and 43e5b0e.

📒 Files selected for processing (11)

CHANGELOG.md
crates/broker/src/runtime/api.rs
crates/broker/src/runtime/relaycast_events.rs
package.json
packages/cli/src/auto/composer.ts
packages/cli/src/cli/commands/local-agent.ts
tests/integration/broker/evals/runner.ts
tests/integration/broker/evals/scenarios/index.ts
tests/integration/broker/evals/scenarios/onboarding.ts
tests/integration/broker/evals/scenarios/s08-lead-quality.ts
web/content/docs/7.1.1/reference-cli.mdx

coderabbitai · 2026-06-14T09:06:49Z

+    let command = shlex::split(cli)
+        .and_then(|parts| parts.into_iter().next())
+        .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
+        .unwrap_or_else(|| cli.to_string());
+    let cli = Path::new(&command)
+        .file_name()
+        .and_then(|name| name.to_str())
+        .unwrap_or(command.as_str())
+        .to_lowercase();
+    if cli == "gemini" {
+        Some(RELAY_WORKER_ONE_LINER)
+    } else if cli == "droid" {
+        Some(DROID_RELAY_SKILL)
+    } else {


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n crates/broker/src/runtime/api.rs | sed -n '1630,1660p'

Repository: AgentWorkforce/relay

Length of output: 1347

🏁 Script executed:

python3 << 'PY' from pathlib import Path # Test cases test_cases = [ "gemini.exe", "droid.cmd", "gemini", "droid", "/usr/bin/gemini.exe", "C:\\Program Files\\gemini.exe" ] for path_str in test_cases: p = Path(path_str) file_name = p.name # equivalent to file_name() file_stem = p.stem # equivalent to file_stem() print(f"Path: {path_str}") print(f" file_name: {file_name}") print(f" file_stem: {file_stem}") print() PY

Repository: AgentWorkforce/relay

Length of output: 474

🏁 Script executed:

python3 << 'PY' # Verify the logic flow of the suggested fix # The suggested fix uses file_stem() which returns OsStr, not Option<OsStr> # So the chain would be: # Path::new(&command).file_stem() -> OsStr # .or_else(|| Path::new(&command).file_name()) -> This won't work! # # file_stem() and file_name() both return OsStr, not Option<OsStr> # So or_else() is not the right combinator here # Let's verify what the Rust Path API actually does print("Analyzing Rust Path API behavior:") print() print("Path::file_stem() returns: Option<&OsStr>") print("Path::file_name() returns: Option<&OsStr>") print() print("So the suggested fix using or_else() is actually correct!") print("or_else() is a combinator for Option types") print() print("Logic:") print("- Path::file_stem() returns Option<&OsStr>") print("- If Some, we use it") print("- If None (edge cases like '..' or '.'), we fall back to file_name()") print() print("This fix correctly handles:") print(" gemini.exe -> file_stem() -> Some('gemini') -> use it") print(" gemini -> file_stem() -> Some('gemini') -> use it") PY

Repository: AgentWorkforce/relay

Length of output: 567

🏁 Script executed:

rg "Path::new" crates/broker/src/runtime/api.rs -A 5 -B 5 | head -30

Repository: AgentWorkforce/relay

Length of output: 510

Use file_stem() to match executable names regardless of extensions.

The current code uses file_name() which includes file extensions. On Windows, commands like gemini.exe or droid.cmd won't match the checks for "gemini" or "droid", causing the harness prefix to be silently skipped and potentially regressing spawn behavior.

Suggested fix

fn cli_skill_prefix(cli: &str) -> Option<&'static str> { let command = shlex::split(cli) .and_then(|parts| parts.into_iter().next()) .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned)) .unwrap_or_else(|| cli.to_string()); let cli = Path::new(&command) - .file_name() + .file_stem() + .or_else(|| Path::new(&command).file_name()) .and_then(|name| name.to_str()) .unwrap_or(command.as_str()) .to_lowercase(); if cli == "gemini" { Some(RELAY_WORKER_ONE_LINER) } else if cli == "droid" { Some(DROID_RELAY_SKILL) } else { None } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

let command = shlex::split(cli)

.and_then(|parts| parts.into_iter().next())

.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))

.unwrap_or_else(|| cli.to_string());

let cli = Path::new(&command)

.file_name()

.and_then(|name| name.to_str())

.unwrap_or(command.as_str())

.to_lowercase();

if cli == "gemini" {

Some(RELAY_WORKER_ONE_LINER)

} else if cli == "droid" {

Some(DROID_RELAY_SKILL)

} else {

fn cli_skill_prefix(cli: &str) -> Option<&'static str> {

let command = shlex::split(cli)

.and_then(|parts| parts.into_iter().next())

.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))

.unwrap_or_else(|| cli.to_string());

let cli = Path::new(&command)

.file_stem()

.or_else(|| Path::new(&command).file_name())

.and_then(|name| name.to_str())

.unwrap_or(command.as_str())

.to_lowercase();

if cli == "gemini" {

Some(RELAY_WORKER_ONE_LINER)

} else if cli == "droid" {

Some(DROID_RELAY_SKILL)

} else {

None

}

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/broker/src/runtime/api.rs` around lines 1636 - 1649, The code uses file_name() to extract the executable name, which includes file extensions on Windows (e.g., "gemini.exe" instead of "gemini"), causing the string comparisons against "gemini" and "droid" to fail. Replace the file_name() method call with file_stem() in the Path chain to extract only the filename without extension, ensuring the checks work correctly on both Unix and Windows systems regardless of whether the command has an extension.

agent-relay-code · 2026-06-14T09:08:58Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:13:45Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

github-actions · 2026-06-14T09:14:31Z

Preview deployed!

Environment	URL
Web	https://d1f3t71m7w35eq.cloudfront.net

This preview will be cleaned up when the PR is merged or closed.

agent-relay-code · 2026-06-14T09:16:06Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:18:05Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:21:55Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

agent-relay-code · 2026-06-14T09:25:12Z

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

…g fixes - Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters in q02/q03/q04/q07 — reducing opus timeout burn from 240s to 120s per scenario - q06: embed conflict scenario in startup task instead of DM injection; idle agents don't process injected text so relay-only injection failed - q07: require initial spawn for PASS to prevent vacuous pass when lead self-implements (spawnEv !== null && !outOfScopeSpawn) - q02/q04: add PTY stream fallback gated on spawnEv !== null to catch escalation/re-routing expressed in stream output - CONFLICT_RE: expand keywords to cover PTY stream output phrasing; add PTY stream check alongside relay message check in detector - Validated haiku+sonnet: q05 4/4 + q06 4/4 both models; q01-q04/q07 correctly surface delegation gaps in current Claude CLI tiers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md:
- Around line 31-34: The Reasoning section states that capping waiters from 120s
to 60s saves approximately 20s per scenario and reduces the full opus run from
107 to 43 minutes, but these numbers are internally inconsistent. If capping
both waiters saves 60s per waiter (totaling 120s per scenario), then the
per-scenario and aggregate time savings figures should reflect this. Either
restate the formula by clarifying how many scenarios and waiters are affected
(e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many
waiters per scenario), or recalculate the savings numbers to be consistent with
the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate
savings (107 to 43 minutes) can be directly derived from the per-scenario
savings.

In @.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md:
- Around line 17-20: The summary contains contradictory guidance about how spawn
scenarios should be structured. The anchor section (lines 17-20) states that
spawn scenarios should begin with 'Spawn...' as the first verb, while the
sibling section (line 35) records the opposite approach of removing explicit
spawn/worker references. Reconcile this inconsistency by either: (1) clarifying
that these approaches apply to different scenario sets or task types with
explicit explanation of which approach applies where, or (2) ensuring both the
anchor section (lines 17-20) and the sibling section (line 35) consistently
describe the same authoritative approach that was actually implemented. Review
the actual changes made and update both locations to provide clear,
non-contradictory guidance.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9320b1bb-70b2-4016-9ef0-a99c9d014688

📥 Commits

Reviewing files that changed from the base of the PR and between 43e5b0e and 7ede017.

📒 Files selected for processing (10)

.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json
.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/summary.md
.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.json
scripts/ci-standalone-smoke.sh
scripts/watch-cli-tools.sh
tests/integration/broker/evals/scenarios/s08-lead-quality.ts

✅ Files skipped from review due to trivial changes (4)

.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json

🚧 Files skipped from review as they are similar to previous changes (1)

tests/integration/broker/evals/scenarios/s08-lead-quality.ts

coderabbitai · 2026-06-14T15:43:31Z

+### Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters
+
+- **Chose:** Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters
+- **Reasoning:** Opus (120s phaseMs) never spawns workers for s08 tasks; both waiters always time out. Without caps: each q02/q03/q04/q07 opus scenario burns 240s = 107 min for 28 scenarios. With all caps: 120s → 60s saves ~20s per scenario, reducing full opus run from 107 to ~43 min.


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reconcile the timing math.

The stated cap saves 60s per waiter (120s → 60s), so the “~20s per scenario” estimate and the 107→43 minute rollup do not follow from the numbers here. Please restate the formula or the exact scenario subset so the report stays internally consistent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md around lines 31 - 34, The Reasoning section states that capping waiters from 120s to 60s saves approximately 20s per scenario and reduces the full opus run from 107 to 43 minutes, but these numbers are internally inconsistent. If capping both waiters saves 60s per waiter (totaling 120s per scenario), then the per-scenario and aggregate time savings figures should reflect this. Either restate the formula by clarifying how many scenarios and waiters are affected (e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many waiters per scenario), or recalculate the savings numbers to be consistent with the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate savings (107 to 43 minutes) can be directly derived from the per-scenario savings.

…omposer eval data - Remove DROID_RELAY_SKILL injection for droid — broker-injected skill text suppresses relay tool use entirely (s03:bare 0/5 with injection vs baseline 5/5 without). Droid receives no CLI-specific prefix from relay_skill_prefix. - Fix race condition in run_gemini_droid_mcp_add: 4-attempt retry loop with 150ms backoff per attempt handles concurrent eval processes racing on ~/.factory/mcp.json (already-exists error). - composer.ts: revert droid onboarding bare→skill (skill kills relay behaviour, confirmed 0/5 s03); mark droid worker as not-viable for droid 0.148.1. - composer.ts: opencode default model deepseek-v4-flash (16/16, 0 phantoms) over mimo-v2.5-free (15/16); add opencode:gpt-5.5 provisional entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/broker/src/snippets.rs`:
- Around line 1333-1341: The function remove_gemini_droid_mcp_servers uses
blocking std::process::Command::wait() which will stall the async runtime.
Convert this function to be async by replacing std::process::Command with
tokio::process::Command and adding an appropriate timeout (such as using
tokio::time::timeout) to prevent indefinite blocking. Make sure to mark the
function as async and update all call sites of remove_gemini_droid_mcp_servers
to use .await when invoking it.
- Around line 1408-1419: The timeout branch in the MCP command execution (the
Err(_) case handling timeout after wait_with_output()) does not explicitly
terminate the spawned child process before bailing, allowing the mcp add process
to continue running in the background. Before calling anyhow::bail! in the
timeout error branch, explicitly call kill() on the child handle to ensure the
process is terminated and does not interfere with subsequent retry attempts in
the surrounding loop.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c74bf948-41a6-4da0-9ae0-309106928cfb

📥 Commits

Reviewing files that changed from the base of the PR and between 5c95e75 and 6c0b86d.

📒 Files selected for processing (4)

CHANGELOG.md
crates/broker/src/runtime/api.rs
crates/broker/src/snippets.rs
packages/cli/src/auto/composer.ts

✅ Files skipped from review due to trivial changes (1)

CHANGELOG.md

…eval correctness - Gate relay_skill_prefix injection behind !skip_relay_prompt so agents spawned with skipRelayPrompt:true are not told to call absent relay tools - Remove redundant prefix.to_string() allocation in the None arm (prefix is already an owned String) - Convert remove_gemini_droid_mcp_servers to async using tokio::process::Command with a 5s timeout to avoid blocking the async runtime - Fix s08 q02 escalated check to only scan events after failure injection, preventing early lead messages containing 'error' from falsely passing - Fix s08 q03 pass check to require a lead response after the status query is injected, not just any message since scenario start - Make CHANGELOG entry impact-first, removing 'eval-derived' implementation jargon Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…s-cli-spawn scenarios Relay workers asked to "spawn a codex agent" or "spawn an opus claude agent" now have explicit guidance to map those requests to the right add_agent parameters. Changes: - SMALL_MODEL_RELAY_SKILL: add CLI options table (claude/codex/gemini/opencode) and model-pinning examples (opus/sonnet/haiku) with concrete add_agent calls - RELAY_WORKER_ONE_LINER (gemini): expand to include cli and model parameter hints - onboarding.ts brief+skill variants: add CLI/model mapping so all harnesses get the reference regardless of whether broker injection fires - s09-cross-cli-spawn.ts: 4 scenarios × 4 onboarding variants = 16 total q01: "spawn a codex agent" → agent_spawned.cli contains "codex" q02: "spawn a claude agent" → agent_spawned.cli contains "claude" q03: "spawn an opus claude agent" → cli "claude" + model ~ /opus/ q04: "spawn a sonnet claude agent" → cli "claude" + model ~ /sonnet/ - Register s09 in index.ts and runner.ts under --group=cross-cli-spawn Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…boarding The eval runner finds the minimum onboarding variant that achieves 100% — adding CLI/model guidance to 'brief' before running the eval pollutes that comparison. Changes: - Revert 'brief' onboarding to original (no CLI/model additions); eval will tell us if brief needs augmenting after s09 runs show bare/one-liner results - RELAY_WORKER_ONE_LINER: true one-liner (cli options inline, not 3 sentences) - SMALL_MODEL_RELAY_SKILL: drop the full table; two targeted lines added — enumerate cli options in the Required line, add one model-pinning line The 'skill' onboarding retains full CLI/model detail as the kitchen-sink variant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…er spawn requests Eval s09 showed that orchestrators without relay-specific onboarding would receive "spawn a codex agent" or "spawn an opus claude agent" and either do the work themselves or call add_agent with wrong/missing parameters: - q01 (codex spawn): 12/12 PASS in v1 description, confirmed causal via baseline - q03 (opus model): 0/12 → 12/12 after fix (root cause: no "opus claude" → cli:claude mapping) - q04 (sonnet model): 0/12 → 12/12 after fix (model param was passive, not directive) Changes: - add_agent tool description: concrete param examples replace prose ("spawn a codex agent" → cli:"codex", "spawn an opus claude agent" → cli:"claude", model:"claude-opus-4-8") - cli param: explicit "opus claude/sonnet claude agent" → claude mapping - model param: "Required when a tier is specified" (directive, not passive) - s09 tasks: terse phrasing matching real failure; runner type + whitelist updated Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ity-v2 # Conflicts: # .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md # .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json # packages/cli/src/cli/agent-relay-mcp.ts # tests/integration/broker/evals/runner.ts # tests/integration/broker/evals/scenarios/index.ts

The merged @relaycast/sdk 4.0 bump narrows SpawnAgentRequest.cli to the core five harnesses (claude/codex/gemini/aider/goose), dropping grok and opencode. The broker and gateway still accept grok/opencode at runtime, so keep them selectable in the add_agent MCP tool enum and cast at the spawn boundary rather than removing the capability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Proactive Runtime Bot and others added 2 commits June 14, 2026 10:18

fix(broker): apply eval relay worker guidance

5f066bb

khaliqgant requested a review from willwashburn as a code owner June 14, 2026 08:57

khaliqgant had a problem deploying to preview June 14, 2026 08:57 — with GitHub Actions Failure

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

style: auto-format with Prettier

43e5b0e

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

github-actions Bot temporarily deployed to preview June 14, 2026 09:10 Inactive

khaliqgant added the no-agent-relay-review label Jun 14, 2026

khaliqgant temporarily deployed to preview June 14, 2026 15:35 — with GitHub Actions Inactive

style: auto-format with Prettier

7ede017

coderabbitai Bot reviewed Jun 14, 2026

View reviewed changes

github-actions Bot temporarily deployed to preview June 14, 2026 16:57 Inactive

style: auto-format with Prettier

7c41ad5

github-actions Bot temporarily deployed to preview June 14, 2026 21:25 Inactive

style: auto-format with Prettier

5c95e75

github-actions Bot temporarily deployed to preview June 15, 2026 18:05 Inactive

Proactive Runtime Bot and others added 2 commits June 16, 2026 00:16

docs(changelog): update unreleased entry with accurate droid/opencode…

646358c

… changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

khaliqgant temporarily deployed to preview June 15, 2026 22:16 — with GitHub Actions Inactive

style: auto-format Rust code with cargo fmt

6c0b86d

github-actions Bot temporarily deployed to preview June 15, 2026 22:21 Inactive

style: auto-format with Prettier

59416a3

coderabbitai Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread crates/broker/src/snippets.rs Outdated

Comment thread crates/broker/src/snippets.rs

github-actions Bot temporarily deployed to preview June 16, 2026 08:18 Inactive

khaliqgant temporarily deployed to preview June 16, 2026 08:43 — with GitHub Actions Inactive

khaliqgant temporarily deployed to preview June 16, 2026 08:53 — with GitHub Actions Inactive

github-actions Bot and others added 2 commits June 16, 2026 08:54

style: auto-format with Prettier

1d88c29

khaliqgant temporarily deployed to preview June 16, 2026 08:58 — with GitHub Actions Inactive

Proactive Runtime Bot and others added 3 commits June 16, 2026 11:17

add opencode and grok

0179a06

khaliqgant temporarily deployed to preview June 16, 2026 12:17 — with GitHub Actions Inactive

khaliqgant temporarily deployed to preview June 16, 2026 12:37 — with GitHub Actions Inactive

style: auto-format with Prettier

e857585

github-actions Bot temporarily deployed to preview June 16, 2026 12:42 Inactive

khaliqgant merged commit 1fcbae5 into main Jun 16, 2026
50 checks passed

khaliqgant deleted the feature/s08-lead-quality-v2 branch June 16, 2026 12:51

khaliqgant temporarily deployed to preview June 16, 2026 12:51 — with GitHub Actions Inactive

Conversation

khaliqgant commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

s08-lead-quality eval group (new)

s09-cross-cli-spawn eval group (new)

Validation

Uh oh!

coderabbitai Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

agent-relay-code Bot commented Jun 14, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 14, 2026

Choose a reason for hiding this comment

khaliqgant commented Jun 14, 2026 •

edited

Loading

coderabbitai Bot commented Jun 14, 2026 •

edited

Loading