fix(broker): apply eval-derived relay worker guidance#1126
Conversation
Seven scenarios testing coordination quality (how well a lead leads), complementing s07 (does it delegate at all): q01 — task decomposition: complex 4-track task → PASS = 2-6 workers spawned q02 — failure handling: worker errors → PASS = retry or upstream escalation q03 — progress communication: mid-task status request → PASS = lead responds q04 — re-routing: worker declines → PASS = new spawn or escalation q05 — over-delegation prevention: trivial question → PASS = no spawn q06 — conflict resolution: contradicting findings → PASS = conflict acknowledged q07 — scope discipline: out-of-scope bonus finding → PASS = no extra spawns 28 scenarios total (7 × 4 onboarding variants). Primary harnesses: opus, codex, gemini. Runner scripts: eval:lead-quality and eval:lead-quality:all-harnesses. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe broker's relay-skill prefix injection is extended from small-model-only to also cover Gemini (one-liner) and Droid (full skill block with Task disambiguation) CLI harnesses, applied at both the HTTP API and relaycast WebSocket/JSON spawn paths. A new retryable MCP configuration handler manages concurrent eval processes racing on the factory MCP config. Droid is reclassified as high-risk for delegation roles in CLI metadata, help text, and docs. A new S08 lead-quality eval group with seven scored scenarios is added and wired into the eval runner. ChangesCLI-aware relay skill injection and Droid risk classification
Lead-quality evaluation suite (S08)
Trajectory audit trail and telemetry
Sequence Diagram(s)sequenceDiagram
participant EvalRunner
participant LeadAgent
participant BrokerSpawnHandler
participant relay_skill_prefix
participant WorkerAgent
participant MCP
EvalRunner->>BrokerSpawnHandler: spawn lead (cli, model, task)
BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
BrokerSpawnHandler->>BrokerSpawnHandler: prepend/replace task with prefix
BrokerSpawnHandler->>LeadAgent: start with prefixed task
LeadAgent->>BrokerSpawnHandler: spawn worker (cli, model)
BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
BrokerSpawnHandler->>WorkerAgent: start with prefixed task
EvalRunner->>EvalRunner: collect agent_spawned / relay_inbound events
EvalRunner->>EvalRunner: compute pass based on scenario criteria
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request implements eval-derived relay worker guidance for small models, Gemini, and Droid spawns, and flags Droid as high-risk for delegation roles. It introduces a new lead-quality evaluation scenario group (S08) with seven scenarios, updates CLI commands and documentation, and refactors the broker runtime to inject combined model and CLI-specific skill prefixes. The review feedback points out an opportunity to avoid an unnecessary redundant string allocation and copy when applying the skill prefix in crates/broker/src/runtime/api.rs.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| effective_task = Some(match effective_task { | ||
| Some(task) => format!("{prefix}\n\n{task}"), | ||
| None => prefix.to_string(), | ||
| }); |
There was a problem hiding this comment.
The prefix variable is already an owned String returned by relay_skill_prefix. Calling prefix.to_string() when effective_task is None creates an unnecessary redundant allocation and copy of the prompt string. You can directly move prefix instead.
effective_task = Some(match effective_task {
Some(task) => format!("{prefix}\n\n{task}"),
None => prefix,
});|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5f066bb772
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if let Some(prefix) = relay_skill_prefix( | ||
| effective_spec.cli.as_deref().unwrap_or(&cli), | ||
| effective_spec.model.as_deref(), | ||
| ) { |
There was a problem hiding this comment.
Guard relay skill injection behind skipRelayPrompt
When /api/spawn is called with skipRelayPrompt: true, WorkerRegistry::configure_agent_relay_mcp deliberately skips Agent Relay MCP/tool injection, but this new relay_skill_prefix path still prepends gemini/droid instructions to call mcp__agent-relay__add_agent/remove_agent. In that opt-out scenario those tools are absent, so the spawned agent is told to use unavailable tools; gate this prefix on !skip_relay_prompt or otherwise skip CLI guidance for opt-out spawns.
Useful? React with 👍 / 👎.
| for (let i = 0; i < 6; i++) { | ||
| const w = harness.waitForEvent('agent_spawned', phaseMs); | ||
| const ev = await w.promise.catch(() => null); |
There was a problem hiding this comment.
Wait for distinct spawn events in q01
BrokerHarness.waitForEvent resolves immediately from the existing event buffer, and this loop does not exclude spawn events already seen. After the first agent_spawned, iterations 2–6 can keep returning that same event without waiting for later workers, so a lead that spawns the requested tracks sequentially can be scored too early as monolithic or under-delegated. Use an event count/name predicate or collect until a quiet period with unseen spawns.
Useful? React with 👍 / 👎.
| const escalated = events.some( | ||
| (e) => | ||
| e.kind === 'relay_inbound' && | ||
| (e as { from: string; body?: string }).from === lead && | ||
| /fail|error|retry|unable|timeout|issue/i.test((e as { body?: string }).body ?? ''), |
There was a problem hiding this comment.
Score q02 only on post-failure responses
This scans all events since the scenario's initial clearEvents, not just messages after Q02_FAILURE_MSG is injected. If the lead sends an early acknowledgement such as analyzing the “error logs”, the error regex marks escalated true even when the lead stalls after the worker failure, causing q02 to pass the behavior it is meant to catch. Capture a baseline before injecting the failure and only scan later lead messages.
Useful? React with 👍 / 👎.
| // PASS = lead sent at least one outbound message (responded to status query). | ||
| const leadMessages = events.filter( | ||
| (e) => e.kind === 'relay_inbound' && (e as { from: string }).from === lead, | ||
| ); | ||
| const pass = leadMessages.length > 0; |
There was a problem hiding this comment.
Require the status response to occur after the query
The pass condition accepts any relay_inbound from the lead in the whole scenario, and waitForEvent also resolves from already-buffered events. A lead that only sends an initial “I’ll start the workers” message before the mid-task status request will pass q03 even if it never answers the status query. Record the message count before sending Q03_STATUS_REQUEST and require a later lead response.
Useful? React with 👍 / 👎.
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
3 similar comments
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@CHANGELOG.md`:
- Line 20: In the `[Unreleased]` section of the CHANGELOG.md, rewrite the bullet
point starting at line 20 to focus on the user-visible impact rather than
implementation details. Remove internal jargon like "eval-derived" and simplify
the wording to describe what users or operators can observe about the
agent-relay-broker behavior change, such as improved relay worker guidance for
specific models and risk flagging for Droid spawns, while keeping the entry
concise and removing technical process framing.
In `@crates/broker/src/runtime/api.rs`:
- Around line 1636-1649: The code uses file_name() to extract the executable
name, which includes file extensions on Windows (e.g., "gemini.exe" instead of
"gemini"), causing the string comparisons against "gemini" and "droid" to fail.
Replace the file_name() method call with file_stem() in the Path chain to
extract only the filename without extension, ensuring the checks work correctly
on both Unix and Windows systems regardless of whether the command has an
extension.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: c6a8ae98-52c6-44b1-8754-07a95fbe9fca
📒 Files selected for processing (11)
CHANGELOG.mdcrates/broker/src/runtime/api.rscrates/broker/src/runtime/relaycast_events.rspackage.jsonpackages/cli/src/auto/composer.tspackages/cli/src/cli/commands/local-agent.tstests/integration/broker/evals/runner.tstests/integration/broker/evals/scenarios/index.tstests/integration/broker/evals/scenarios/onboarding.tstests/integration/broker/evals/scenarios/s08-lead-quality.tsweb/content/docs/7.1.1/reference-cli.mdx
| let command = shlex::split(cli) | ||
| .and_then(|parts| parts.into_iter().next()) | ||
| .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned)) | ||
| .unwrap_or_else(|| cli.to_string()); | ||
| let cli = Path::new(&command) | ||
| .file_name() | ||
| .and_then(|name| name.to_str()) | ||
| .unwrap_or(command.as_str()) | ||
| .to_lowercase(); | ||
| if cli == "gemini" { | ||
| Some(RELAY_WORKER_ONE_LINER) | ||
| } else if cli == "droid" { | ||
| Some(DROID_RELAY_SKILL) | ||
| } else { |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n crates/broker/src/runtime/api.rs | sed -n '1630,1660p'Repository: AgentWorkforce/relay
Length of output: 1347
🏁 Script executed:
python3 << 'PY'
from pathlib import Path
# Test cases
test_cases = [
"gemini.exe",
"droid.cmd",
"gemini",
"droid",
"/usr/bin/gemini.exe",
"C:\\Program Files\\gemini.exe"
]
for path_str in test_cases:
p = Path(path_str)
file_name = p.name # equivalent to file_name()
file_stem = p.stem # equivalent to file_stem()
print(f"Path: {path_str}")
print(f" file_name: {file_name}")
print(f" file_stem: {file_stem}")
print()
PYRepository: AgentWorkforce/relay
Length of output: 474
🏁 Script executed:
python3 << 'PY'
# Verify the logic flow of the suggested fix
# The suggested fix uses file_stem() which returns OsStr, not Option<OsStr>
# So the chain would be:
# Path::new(&command).file_stem() -> OsStr
# .or_else(|| Path::new(&command).file_name()) -> This won't work!
#
# file_stem() and file_name() both return OsStr, not Option<OsStr>
# So or_else() is not the right combinator here
# Let's verify what the Rust Path API actually does
print("Analyzing Rust Path API behavior:")
print()
print("Path::file_stem() returns: Option<&OsStr>")
print("Path::file_name() returns: Option<&OsStr>")
print()
print("So the suggested fix using or_else() is actually correct!")
print("or_else() is a combinator for Option types")
print()
print("Logic:")
print("- Path::file_stem() returns Option<&OsStr>")
print("- If Some, we use it")
print("- If None (edge cases like '..' or '.'), we fall back to file_name()")
print()
print("This fix correctly handles:")
print(" gemini.exe -> file_stem() -> Some('gemini') -> use it")
print(" gemini -> file_stem() -> Some('gemini') -> use it")
PYRepository: AgentWorkforce/relay
Length of output: 567
🏁 Script executed:
rg "Path::new" crates/broker/src/runtime/api.rs -A 5 -B 5 | head -30Repository: AgentWorkforce/relay
Length of output: 510
Use file_stem() to match executable names regardless of extensions.
The current code uses file_name() which includes file extensions. On Windows, commands like gemini.exe or droid.cmd won't match the checks for "gemini" or "droid", causing the harness prefix to be silently skipped and potentially regressing spawn behavior.
Suggested fix
fn cli_skill_prefix(cli: &str) -> Option<&'static str> {
let command = shlex::split(cli)
.and_then(|parts| parts.into_iter().next())
.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
.unwrap_or_else(|| cli.to_string());
let cli = Path::new(&command)
- .file_name()
+ .file_stem()
+ .or_else(|| Path::new(&command).file_name())
.and_then(|name| name.to_str())
.unwrap_or(command.as_str())
.to_lowercase();
if cli == "gemini" {
Some(RELAY_WORKER_ONE_LINER)
} else if cli == "droid" {
Some(DROID_RELAY_SKILL)
} else {
None
}
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| let command = shlex::split(cli) | |
| .and_then(|parts| parts.into_iter().next()) | |
| .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned)) | |
| .unwrap_or_else(|| cli.to_string()); | |
| let cli = Path::new(&command) | |
| .file_name() | |
| .and_then(|name| name.to_str()) | |
| .unwrap_or(command.as_str()) | |
| .to_lowercase(); | |
| if cli == "gemini" { | |
| Some(RELAY_WORKER_ONE_LINER) | |
| } else if cli == "droid" { | |
| Some(DROID_RELAY_SKILL) | |
| } else { | |
| fn cli_skill_prefix(cli: &str) -> Option<&'static str> { | |
| let command = shlex::split(cli) | |
| .and_then(|parts| parts.into_iter().next()) | |
| .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned)) | |
| .unwrap_or_else(|| cli.to_string()); | |
| let cli = Path::new(&command) | |
| .file_stem() | |
| .or_else(|| Path::new(&command).file_name()) | |
| .and_then(|name| name.to_str()) | |
| .unwrap_or(command.as_str()) | |
| .to_lowercase(); | |
| if cli == "gemini" { | |
| Some(RELAY_WORKER_ONE_LINER) | |
| } else if cli == "droid" { | |
| Some(DROID_RELAY_SKILL) | |
| } else { | |
| None | |
| } | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/broker/src/runtime/api.rs` around lines 1636 - 1649, The code uses
file_name() to extract the executable name, which includes file extensions on
Windows (e.g., "gemini.exe" instead of "gemini"), causing the string comparisons
against "gemini" and "droid" to fail. Replace the file_name() method call with
file_stem() in the Path chain to extract only the filename without extension,
ensuring the checks work correctly on both Unix and Windows systems regardless
of whether the command has an extension.
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
Preview deployed!
This preview will be cleaned up when the PR is merged or closed. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
3 similar comments
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
|
pr-reviewer could not complete review for #1126 in AgentWorkforce/relay. |
…g fixes - Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters in q02/q03/q04/q07 — reducing opus timeout burn from 240s to 120s per scenario - q06: embed conflict scenario in startup task instead of DM injection; idle agents don't process injected text so relay-only injection failed - q07: require initial spawn for PASS to prevent vacuous pass when lead self-implements (spawnEv !== null && !outOfScopeSpawn) - q02/q04: add PTY stream fallback gated on spawnEv !== null to catch escalation/re-routing expressed in stream output - CONFLICT_RE: expand keywords to cover PTY stream output phrasing; add PTY stream check alongside relay message check in detector - Validated haiku+sonnet: q05 4/4 + q06 4/4 both models; q01-q04/q07 correctly surface delegation gaps in current Claude CLI tiers Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md:
- Around line 31-34: The Reasoning section states that capping waiters from 120s
to 60s saves approximately 20s per scenario and reduces the full opus run from
107 to 43 minutes, but these numbers are internally inconsistent. If capping
both waiters saves 60s per waiter (totaling 120s per scenario), then the
per-scenario and aggregate time savings figures should reflect this. Either
restate the formula by clarifying how many scenarios and waiters are affected
(e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many
waiters per scenario), or recalculate the savings numbers to be consistent with
the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate
savings (107 to 43 minutes) can be directly derived from the per-scenario
savings.
In @.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md:
- Around line 17-20: The summary contains contradictory guidance about how spawn
scenarios should be structured. The anchor section (lines 17-20) states that
spawn scenarios should begin with 'Spawn...' as the first verb, while the
sibling section (line 35) records the opposite approach of removing explicit
spawn/worker references. Reconcile this inconsistency by either: (1) clarifying
that these approaches apply to different scenario sets or task types with
explicit explanation of which approach applies where, or (2) ensuring both the
anchor section (lines 17-20) and the sibling section (line 35) consistently
describe the same authoritative approach that was actually implemented. Review
the actual changes made and update both locations to provide clear,
non-contradictory guidance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 9320b1bb-70b2-4016-9ef0-a99c9d014688
📒 Files selected for processing (10)
.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/summary.md.agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.jsonscripts/ci-standalone-smoke.shscripts/watch-cli-tools.shtests/integration/broker/evals/scenarios/s08-lead-quality.ts
✅ Files skipped from review due to trivial changes (4)
- .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
- .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/integration/broker/evals/scenarios/s08-lead-quality.ts
| ### Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters | ||
|
|
||
| - **Chose:** Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters | ||
| - **Reasoning:** Opus (120s phaseMs) never spawns workers for s08 tasks; both waiters always time out. Without caps: each q02/q03/q04/q07 opus scenario burns 240s = 107 min for 28 scenarios. With all caps: 120s → 60s saves ~20s per scenario, reducing full opus run from 107 to ~43 min. |
There was a problem hiding this comment.
Reconcile the timing math.
The stated cap saves 60s per waiter (120s → 60s), so the “~20s per scenario” estimate and the 107→43 minute rollup do not follow from the numbers here. Please restate the formula or the exact scenario subset so the report stays internally consistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md
around lines 31 - 34, The Reasoning section states that capping waiters from
120s to 60s saves approximately 20s per scenario and reduces the full opus run
from 107 to 43 minutes, but these numbers are internally inconsistent. If
capping both waiters saves 60s per waiter (totaling 120s per scenario), then the
per-scenario and aggregate time savings figures should reflect this. Either
restate the formula by clarifying how many scenarios and waiters are affected
(e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many
waiters per scenario), or recalculate the savings numbers to be consistent with
the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate
savings (107 to 43 minutes) can be directly derived from the per-scenario
savings.
…omposer eval data - Remove DROID_RELAY_SKILL injection for droid — broker-injected skill text suppresses relay tool use entirely (s03:bare 0/5 with injection vs baseline 5/5 without). Droid receives no CLI-specific prefix from relay_skill_prefix. - Fix race condition in run_gemini_droid_mcp_add: 4-attempt retry loop with 150ms backoff per attempt handles concurrent eval processes racing on ~/.factory/mcp.json (already-exists error). - composer.ts: revert droid onboarding bare→skill (skill kills relay behaviour, confirmed 0/5 s03); mark droid worker as not-viable for droid 0.148.1. - composer.ts: opencode default model deepseek-v4-flash (16/16, 0 phantoms) over mimo-v2.5-free (15/16); add opencode:gpt-5.5 provisional entry. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… changes Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/broker/src/snippets.rs`:
- Around line 1333-1341: The function remove_gemini_droid_mcp_servers uses
blocking std::process::Command::wait() which will stall the async runtime.
Convert this function to be async by replacing std::process::Command with
tokio::process::Command and adding an appropriate timeout (such as using
tokio::time::timeout) to prevent indefinite blocking. Make sure to mark the
function as async and update all call sites of remove_gemini_droid_mcp_servers
to use .await when invoking it.
- Around line 1408-1419: The timeout branch in the MCP command execution (the
Err(_) case handling timeout after wait_with_output()) does not explicitly
terminate the spawned child process before bailing, allowing the mcp add process
to continue running in the background. Before calling anyhow::bail! in the
timeout error branch, explicitly call kill() on the child handle to ensure the
process is terminated and does not interfere with subsequent retry attempts in
the surrounding loop.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: c74bf948-41a6-4da0-9ae0-309106928cfb
📒 Files selected for processing (4)
CHANGELOG.mdcrates/broker/src/runtime/api.rscrates/broker/src/snippets.rspackages/cli/src/auto/composer.ts
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
…eval correctness - Gate relay_skill_prefix injection behind !skip_relay_prompt so agents spawned with skipRelayPrompt:true are not told to call absent relay tools - Remove redundant prefix.to_string() allocation in the None arm (prefix is already an owned String) - Convert remove_gemini_droid_mcp_servers to async using tokio::process::Command with a 5s timeout to avoid blocking the async runtime - Fix s08 q02 escalated check to only scan events after failure injection, preventing early lead messages containing 'error' from falsely passing - Fix s08 q03 pass check to require a lead response after the status query is injected, not just any message since scenario start - Make CHANGELOG entry impact-first, removing 'eval-derived' implementation jargon Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s-cli-spawn scenarios
Relay workers asked to "spawn a codex agent" or "spawn an opus claude agent"
now have explicit guidance to map those requests to the right add_agent
parameters.
Changes:
- SMALL_MODEL_RELAY_SKILL: add CLI options table (claude/codex/gemini/opencode)
and model-pinning examples (opus/sonnet/haiku) with concrete add_agent calls
- RELAY_WORKER_ONE_LINER (gemini): expand to include cli and model parameter hints
- onboarding.ts brief+skill variants: add CLI/model mapping so all harnesses
get the reference regardless of whether broker injection fires
- s09-cross-cli-spawn.ts: 4 scenarios × 4 onboarding variants = 16 total
q01: "spawn a codex agent" → agent_spawned.cli contains "codex"
q02: "spawn a claude agent" → agent_spawned.cli contains "claude"
q03: "spawn an opus claude agent" → cli "claude" + model ~ /opus/
q04: "spawn a sonnet claude agent" → cli "claude" + model ~ /sonnet/
- Register s09 in index.ts and runner.ts under --group=cross-cli-spawn
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…boarding The eval runner finds the minimum onboarding variant that achieves 100% — adding CLI/model guidance to 'brief' before running the eval pollutes that comparison. Changes: - Revert 'brief' onboarding to original (no CLI/model additions); eval will tell us if brief needs augmenting after s09 runs show bare/one-liner results - RELAY_WORKER_ONE_LINER: true one-liner (cli options inline, not 3 sentences) - SMALL_MODEL_RELAY_SKILL: drop the full table; two targeted lines added — enumerate cli options in the Required line, add one model-pinning line The 'skill' onboarding retains full CLI/model detail as the kitchen-sink variant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…er spawn requests
Eval s09 showed that orchestrators without relay-specific onboarding would receive
"spawn a codex agent" or "spawn an opus claude agent" and either do the work themselves
or call add_agent with wrong/missing parameters:
- q01 (codex spawn): 12/12 PASS in v1 description, confirmed causal via baseline
- q03 (opus model): 0/12 → 12/12 after fix (root cause: no "opus claude" → cli:claude mapping)
- q04 (sonnet model): 0/12 → 12/12 after fix (model param was passive, not directive)
Changes:
- add_agent tool description: concrete param examples replace prose
("spawn a codex agent" → cli:"codex", "spawn an opus claude agent" → cli:"claude", model:"claude-opus-4-8")
- cli param: explicit "opus claude/sonnet claude agent" → claude mapping
- model param: "Required when a tier is specified" (directive, not passive)
- s09 tasks: terse phrasing matching real failure; runner type + whitelist updated
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity-v2 # Conflicts: # .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md # .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json # packages/cli/src/cli/agent-relay-mcp.ts # tests/integration/broker/evals/runner.ts # tests/integration/broker/evals/scenarios/index.ts
The merged @relaycast/sdk 4.0 bump narrows SpawnAgentRequest.cli to the core five harnesses (claude/codex/gemini/aider/goose), dropping grok and opencode. The broker and gateway still accept grok/opencode at runtime, so keep them selectable in the add_agent MCP tool enum and cast at the spawn boundary rather than removing the capability. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Summary
Resolves #1113.
s08-lead-quality eval group (new)
7 scenarios × 4 onboarding variants = 28 total, testing lead coordination quality.
Key fixes: q06 idle-agent DM injection, q07 vacuous-pass guard, q02/q04 PTY fallback, all waiters capped at 60s.
s09-cross-cli-spawn eval group (new)
4 questions × 4 onboarding variants = 16 scenarios, testing that orchestrators correctly route natural-language spawn requests to `add_agent` parameters:
Reproduces the real failure from the screenshot: a Pear-spawned Claude orchestrator replying "I don't have a 'codex' agent type available" instead of calling `add_agent(cli:"codex")`.
Results (FIX v2 description, claude + codex harnesses, repeat=3):
q03/q04 were 0/12 in the old description.
Validation