Skip to content

fix(broker): apply eval-derived relay worker guidance#1126

Merged
khaliqgant merged 20 commits into
mainfrom
feature/s08-lead-quality-v2
Jun 16, 2026
Merged

fix(broker): apply eval-derived relay worker guidance#1126
khaliqgant merged 20 commits into
mainfrom
feature/s08-lead-quality-v2

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 14, 2026

Copy link
Copy Markdown
Member

Summary

Resolves #1113.

  • Apply eval-derived relay worker guidance for small models, Gemini, and Droid spawns.
  • Use "relay worker" vocabulary consistently and remove the small-model "do it yourself" heuristic.
  • Apply the same prefix policy to HTTP/API spawns and Relaycast/worker-initiated spawn paths.
  • Flag Droid as high-risk for delegation/spawn roles in auto-composer metadata, local CLI help, and CLI reference docs.
  • Fix `add_agent` MCP tool description: concrete param examples replace prose, eliminating zero-spawn and missing-model failures for cross-CLI and model-tier requests.

s08-lead-quality eval group (new)

7 scenarios × 4 onboarding variants = 28 total, testing lead coordination quality.

Key fixes: q06 idle-agent DM injection, q07 vacuous-pass guard, q02/q04 PTY fallback, all waiters capped at 60s.

s09-cross-cli-spawn eval group (new)

4 questions × 4 onboarding variants = 16 scenarios, testing that orchestrators correctly route natural-language spawn requests to `add_agent` parameters:

  • q01 — "spawn a codex agent" → `cli:"codex"`
  • q02 — "spawn a claude agent" → `cli:"claude"`
  • q03 — "spawn an opus claude agent" → `cli:"claude", model:"claude-opus-4-8"`
  • q04 — "spawn a sonnet claude agent" → `cli:"claude", model:"claude-sonnet-4-6"`

Reproduces the real failure from the screenshot: a Pear-spawned Claude orchestrator replying "I don't have a 'codex' agent type available" instead of calling `add_agent(cli:"codex")`.

Results (FIX v2 description, claude + codex harnesses, repeat=3):

claude codex
q01 (codex spawn) 10/12 12/12
q02 (claude spawn) 8/12 12/12
q03 (opus model) 12/12 12/12
q04 (sonnet model) 12/12 12/12

q03/q04 were 0/12 in the old description.

Validation

  • cargo test -p agent-relay-broker skill_injection_tests
  • cargo check -p agent-relay-broker
  • npm run eval:unit -- tests/integration/broker/evals/scenarios
  • npx vitest run packages/cli/src/cli/commands/local-agent.test.ts packages/cli/src/auto/composer.test.ts
  • npm --prefix packages/cli run build
  • cargo fmt --check
  • prettier --check packages/cli/src/cli/agent-relay-mcp.ts tests/integration/broker/evals/scenarios/s09-cross-cli-spawn.ts CHANGELOG.md

Proactive Runtime Bot and others added 2 commits June 14, 2026 10:18
Seven scenarios testing coordination quality (how well a lead leads),
complementing s07 (does it delegate at all):

  q01 — task decomposition: complex 4-track task → PASS = 2-6 workers spawned
  q02 — failure handling: worker errors → PASS = retry or upstream escalation
  q03 — progress communication: mid-task status request → PASS = lead responds
  q04 — re-routing: worker declines → PASS = new spawn or escalation
  q05 — over-delegation prevention: trivial question → PASS = no spawn
  q06 — conflict resolution: contradicting findings → PASS = conflict acknowledged
  q07 — scope discipline: out-of-scope bonus finding → PASS = no extra spawns

28 scenarios total (7 × 4 onboarding variants). Primary harnesses: opus, codex,
gemini. Runner scripts: eval:lead-quality and eval:lead-quality:all-harnesses.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The broker's relay-skill prefix injection is extended from small-model-only to also cover Gemini (one-liner) and Droid (full skill block with Task disambiguation) CLI harnesses, applied at both the HTTP API and relaycast WebSocket/JSON spawn paths. A new retryable MCP configuration handler manages concurrent eval processes racing on the factory MCP config. Droid is reclassified as high-risk for delegation roles in CLI metadata, help text, and docs. A new S08 lead-quality eval group with seven scored scenarios is added and wired into the eval runner.

Changes

CLI-aware relay skill injection and Droid risk classification

Layer / File(s) Summary
Relay skill constants and CLI prefix selection logic
crates/broker/src/runtime/api.rs
Introduces RELAY_WORKER_ONE_LINER and updates SMALL_MODEL_RELAY_SKILL wording to use "relay worker" terminology, adds cli_skill_prefix mapping parsed CLI executable names to harness-specific prefixes, and adds relay_skill_prefix combining model and CLI signals with unit tests.
Prefix injection at HTTP API and relaycast spawn sites
crates/broker/src/runtime/api.rs, crates/broker/src/runtime/relaycast_events.rs
Updates the HTTP API spawn handler to call relay_skill_prefix(cli, model) instead of model-only prefix, and adds mutable effective_task with conditional prefix injection (prepend or replace) to both the primary WebSocket and JSON fallback spawn paths, with debug logging when injection occurs.
Gemini/Droid MCP configuration with concurrent retry
crates/broker/src/snippets.rs
Refactors configure_gemini_droid_mcp to use run_gemini_droid_mcp_add with up to 4 idempotent attempts, backoff retry on "already exists" stderr, and spawn_mcp_add helper for timeout-safe execution, handling concurrent eval processes racing on ~/.factory/mcp.json.
Droid risk classification in CLI metadata and docs
packages/cli/src/auto/composer.ts, packages/cli/src/cli/commands/local-agent.ts, web/content/docs/7.1.1/reference-cli.mdx, CHANGELOG.md
Switches droid's recommended onboarding to skill, downgrades droid role fitness to not-viable with explicit spawning constraints in HARNESS_ROLE_MAP, updates opencode default model to deepseek-v4-flash and adds opencode:gpt-5.5 variant, adds high-risk warning to agent spawn/agent new help text, inserts caution in CLI reference docs, and records changes in the changelog.

Lead-quality evaluation suite (S08)

Layer / File(s) Summary
Onboarding text updates and eval runner wiring
tests/integration/broker/evals/scenarios/onboarding.ts, tests/integration/broker/evals/scenarios/index.ts, tests/integration/broker/evals/runner.ts, package.json
Updates onboardingText() variants to use relay-worker terminology and explicit MCP call names, registers LEAD_QUALITY_EVAL_SCENARIOS in the scenario index and ALL_SCENARIOS, extends ScenarioGroup type/flag parsing/pool selection, and adds eval:lead-quality npm scripts.
S08 lead-quality scenario implementations (q01–q07)
tests/integration/broker/evals/scenarios/s08-lead-quality.ts
Adds conflict-acknowledgement detector, shared lead role prompts, and seven scenario builders covering task decomposition, worker failure retry/escalation, mid-task status response, worker decline rerouting, over-delegation prevention, conflict resolution, and scope discipline; exports LEAD_QUALITY_SCENARIOS across all onboarding variants.

Trajectory audit trail and telemetry

Layer / File(s) Summary
S08 lead-quality eval completion and audit trails
.agentworkforce/trajectories/completed/2026-06/traj_*/ (summary.md, trajectory.json, trace.json)
Records completed s08-lead-quality eval trajectory with scenario counts and validation results; captures abandoned Track A security audit trajectory; includes eval run telemetry with file/range mappings.

Sequence Diagram(s)

sequenceDiagram
  participant EvalRunner
  participant LeadAgent
  participant BrokerSpawnHandler
  participant relay_skill_prefix
  participant WorkerAgent
  participant MCP
  EvalRunner->>BrokerSpawnHandler: spawn lead (cli, model, task)
  BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
  relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
  BrokerSpawnHandler->>BrokerSpawnHandler: prepend/replace task with prefix
  BrokerSpawnHandler->>LeadAgent: start with prefixed task
  LeadAgent->>BrokerSpawnHandler: spawn worker (cli, model)
  BrokerSpawnHandler->>relay_skill_prefix: relay_skill_prefix(cli, model)
  relay_skill_prefix-->>BrokerSpawnHandler: Option<prefix>
  BrokerSpawnHandler->>WorkerAgent: start with prefixed task
  EvalRunner->>EvalRunner: collect agent_spawned / relay_inbound events
  EvalRunner->>EvalRunner: compute pass based on scenario criteria
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • AgentWorkforce/relay#1109: Both PRs touch the broker live-evals test runner/scenario registration pipeline (e.g., tests/integration/broker/evals/runner.ts and tests/integration/broker/evals/scenarios/index.ts), extending scenario group routing/selection to cover the new lead-quality eval group.
  • AgentWorkforce/relay#1076: Main PR's relay-skill prefix injection in handle_api_request/handle_relaycast_message depends on the workers.spawn(...) returning an effective_spec with CLI/model, which is enabled by that PR's WorkerRegistry::spawn propagation.

Suggested labels

size:XXL

Suggested reviewers

  • willwashburn

Poem

🐇 A bunny hopped through broker code one night,
And found that Droid was routing tasks all wrong!
So she stitched a prefix with her paws so light,
"Use relay workers!" — now the skill's more strong.
Seven evals named q01 through seven's end,
Each checks the lead behaves as we intend. 🌟

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.62% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: applying eval-derived relay worker guidance to the broker. It is specific and directly related to the core objective.
Linked Issues check ✅ Passed The PR addresses all four concrete action items from #1113: vocabulary fixes in SMALL_MODEL_RELAY_SKILL, CLI-aware skill injection for Gemini, onboarding vocabulary audit, and Droid flagging in capability surfaces. The code changes implement these objectives across the modified files.
Out of Scope Changes check ✅ Passed All changes align with the stated objectives from #1113 and the s08-lead-quality eval work. The trajectory files, eval scenarios, and metadata updates are supporting work for the new eval group. No unrelated modifications detected.
Description check ✅ Passed The PR description is mostly complete, including a clear summary, multiple detailed sections on eval groups, and comprehensive validation steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/s08-lead-quality-v2

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements eval-derived relay worker guidance for small models, Gemini, and Droid spawns, and flags Droid as high-risk for delegation roles. It introduces a new lead-quality evaluation scenario group (S08) with seven scenarios, updates CLI commands and documentation, and refactors the broker runtime to inject combined model and CLI-specific skill prefixes. The review feedback points out an opportunity to avoid an unnecessary redundant string allocation and copy when applying the skill prefix in crates/broker/src/runtime/api.rs.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread crates/broker/src/runtime/api.rs Outdated
Comment on lines 234 to 237
effective_task = Some(match effective_task {
Some(task) => format!("{prefix}\n\n{task}"),
None => prefix.to_string(),
});

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The prefix variable is already an owned String returned by relay_skill_prefix. Calling prefix.to_string() when effective_task is None creates an unnecessary redundant allocation and copy of the prompt string. You can directly move prefix instead.

                            effective_task = Some(match effective_task {
                                Some(task) => format!("{prefix}\n\n{task}"),
                                None => prefix,
                            });

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5f066bb772

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/broker/src/runtime/api.rs Outdated
Comment on lines +230 to +233
if let Some(prefix) = relay_skill_prefix(
effective_spec.cli.as_deref().unwrap_or(&cli),
effective_spec.model.as_deref(),
) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard relay skill injection behind skipRelayPrompt

When /api/spawn is called with skipRelayPrompt: true, WorkerRegistry::configure_agent_relay_mcp deliberately skips Agent Relay MCP/tool injection, but this new relay_skill_prefix path still prepends gemini/droid instructions to call mcp__agent-relay__add_agent/remove_agent. In that opt-out scenario those tools are absent, so the spawned agent is told to use unavailable tools; gate this prefix on !skip_relay_prompt or otherwise skip CLI guidance for opt-out spawns.

Useful? React with 👍 / 👎.

Comment on lines +122 to +124
for (let i = 0; i < 6; i++) {
const w = harness.waitForEvent('agent_spawned', phaseMs);
const ev = await w.promise.catch(() => null);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Wait for distinct spawn events in q01

BrokerHarness.waitForEvent resolves immediately from the existing event buffer, and this loop does not exclude spawn events already seen. After the first agent_spawned, iterations 2–6 can keep returning that same event without waiting for later workers, so a lead that spawns the requested tracks sequentially can be scored too early as monolithic or under-delegated. Use an event count/name predicate or collect until a quiet period with unseen spawns.

Useful? React with 👍 / 👎.

Comment on lines +251 to +255
const escalated = events.some(
(e) =>
e.kind === 'relay_inbound' &&
(e as { from: string; body?: string }).from === lead &&
/fail|error|retry|unable|timeout|issue/i.test((e as { body?: string }).body ?? ''),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Score q02 only on post-failure responses

This scans all events since the scenario's initial clearEvents, not just messages after Q02_FAILURE_MSG is injected. If the lead sends an early acknowledgement such as analyzing the “error logs”, the error regex marks escalated true even when the lead stalls after the worker failure, causing q02 to pass the behavior it is meant to catch. Capture a baseline before injecting the failure and only scan later lead messages.

Useful? React with 👍 / 👎.

Comment on lines +356 to +360
// PASS = lead sent at least one outbound message (responded to status query).
const leadMessages = events.filter(
(e) => e.kind === 'relay_inbound' && (e as { from: string }).from === lead,
);
const pass = leadMessages.length > 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require the status response to occur after the query

The pass condition accepts any relay_inbound from the lead in the whole scenario, and waitForEvent also resolves from already-buffered events. A lead that only sends an initial “I’ll start the workers” message before the mid-task status request will pass q03 even if it never answers the status query. Record the message count before sending Q03_STATUS_REQUEST and require a later lead response.

Useful? React with 👍 / 👎.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

3 similar comments
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@CHANGELOG.md`:
- Line 20: In the `[Unreleased]` section of the CHANGELOG.md, rewrite the bullet
point starting at line 20 to focus on the user-visible impact rather than
implementation details. Remove internal jargon like "eval-derived" and simplify
the wording to describe what users or operators can observe about the
agent-relay-broker behavior change, such as improved relay worker guidance for
specific models and risk flagging for Droid spawns, while keeping the entry
concise and removing technical process framing.

In `@crates/broker/src/runtime/api.rs`:
- Around line 1636-1649: The code uses file_name() to extract the executable
name, which includes file extensions on Windows (e.g., "gemini.exe" instead of
"gemini"), causing the string comparisons against "gemini" and "droid" to fail.
Replace the file_name() method call with file_stem() in the Path chain to
extract only the filename without extension, ensuring the checks work correctly
on both Unix and Windows systems regardless of whether the command has an
extension.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c6a8ae98-52c6-44b1-8754-07a95fbe9fca

📥 Commits

Reviewing files that changed from the base of the PR and between 87c05f1 and 43e5b0e.

📒 Files selected for processing (11)
  • CHANGELOG.md
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/runtime/relaycast_events.rs
  • package.json
  • packages/cli/src/auto/composer.ts
  • packages/cli/src/cli/commands/local-agent.ts
  • tests/integration/broker/evals/runner.ts
  • tests/integration/broker/evals/scenarios/index.ts
  • tests/integration/broker/evals/scenarios/onboarding.ts
  • tests/integration/broker/evals/scenarios/s08-lead-quality.ts
  • web/content/docs/7.1.1/reference-cli.mdx

Comment thread CHANGELOG.md Outdated
Comment on lines +1636 to +1649
let command = shlex::split(cli)
.and_then(|parts| parts.into_iter().next())
.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
.unwrap_or_else(|| cli.to_string());
let cli = Path::new(&command)
.file_name()
.and_then(|name| name.to_str())
.unwrap_or(command.as_str())
.to_lowercase();
if cli == "gemini" {
Some(RELAY_WORKER_ONE_LINER)
} else if cli == "droid" {
Some(DROID_RELAY_SKILL)
} else {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n crates/broker/src/runtime/api.rs | sed -n '1630,1660p'

Repository: AgentWorkforce/relay

Length of output: 1347


🏁 Script executed:

python3 << 'PY'
from pathlib import Path

# Test cases
test_cases = [
    "gemini.exe",
    "droid.cmd",
    "gemini",
    "droid",
    "/usr/bin/gemini.exe",
    "C:\\Program Files\\gemini.exe"
]

for path_str in test_cases:
    p = Path(path_str)
    file_name = p.name  # equivalent to file_name()
    file_stem = p.stem  # equivalent to file_stem()
    print(f"Path: {path_str}")
    print(f"  file_name: {file_name}")
    print(f"  file_stem: {file_stem}")
    print()
PY

Repository: AgentWorkforce/relay

Length of output: 474


🏁 Script executed:

python3 << 'PY'
# Verify the logic flow of the suggested fix
# The suggested fix uses file_stem() which returns OsStr, not Option<OsStr>
# So the chain would be:
# Path::new(&command).file_stem() -> OsStr
# .or_else(|| Path::new(&command).file_name()) -> This won't work!
# 
# file_stem() and file_name() both return OsStr, not Option<OsStr>
# So or_else() is not the right combinator here

# Let's verify what the Rust Path API actually does
print("Analyzing Rust Path API behavior:")
print()
print("Path::file_stem() returns: Option<&OsStr>")
print("Path::file_name() returns: Option<&OsStr>")
print()
print("So the suggested fix using or_else() is actually correct!")
print("or_else() is a combinator for Option types")
print()
print("Logic:")
print("- Path::file_stem() returns Option<&OsStr>")
print("- If Some, we use it")
print("- If None (edge cases like '..' or '.'), we fall back to file_name()")
print()
print("This fix correctly handles:")
print("  gemini.exe -> file_stem() -> Some('gemini') -> use it")
print("  gemini     -> file_stem() -> Some('gemini') -> use it")
PY

Repository: AgentWorkforce/relay

Length of output: 567


🏁 Script executed:

rg "Path::new" crates/broker/src/runtime/api.rs -A 5 -B 5 | head -30

Repository: AgentWorkforce/relay

Length of output: 510


Use file_stem() to match executable names regardless of extensions.

The current code uses file_name() which includes file extensions. On Windows, commands like gemini.exe or droid.cmd won't match the checks for "gemini" or "droid", causing the harness prefix to be silently skipped and potentially regressing spawn behavior.

Suggested fix
 fn cli_skill_prefix(cli: &str) -> Option<&'static str> {
     let command = shlex::split(cli)
         .and_then(|parts| parts.into_iter().next())
         .or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
         .unwrap_or_else(|| cli.to_string());
     let cli = Path::new(&command)
-        .file_name()
+        .file_stem()
+        .or_else(|| Path::new(&command).file_name())
         .and_then(|name| name.to_str())
         .unwrap_or(command.as_str())
         .to_lowercase();
     if cli == "gemini" {
         Some(RELAY_WORKER_ONE_LINER)
     } else if cli == "droid" {
         Some(DROID_RELAY_SKILL)
     } else {
         None
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
let command = shlex::split(cli)
.and_then(|parts| parts.into_iter().next())
.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
.unwrap_or_else(|| cli.to_string());
let cli = Path::new(&command)
.file_name()
.and_then(|name| name.to_str())
.unwrap_or(command.as_str())
.to_lowercase();
if cli == "gemini" {
Some(RELAY_WORKER_ONE_LINER)
} else if cli == "droid" {
Some(DROID_RELAY_SKILL)
} else {
fn cli_skill_prefix(cli: &str) -> Option<&'static str> {
let command = shlex::split(cli)
.and_then(|parts| parts.into_iter().next())
.or_else(|| cli.split_whitespace().next().map(ToOwned::to_owned))
.unwrap_or_else(|| cli.to_string());
let cli = Path::new(&command)
.file_stem()
.or_else(|| Path::new(&command).file_name())
.and_then(|name| name.to_str())
.unwrap_or(command.as_str())
.to_lowercase();
if cli == "gemini" {
Some(RELAY_WORKER_ONE_LINER)
} else if cli == "droid" {
Some(DROID_RELAY_SKILL)
} else {
None
}
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/broker/src/runtime/api.rs` around lines 1636 - 1649, The code uses
file_name() to extract the executable name, which includes file extensions on
Windows (e.g., "gemini.exe" instead of "gemini"), causing the string comparisons
against "gemini" and "droid" to fail. Replace the file_name() method call with
file_stem() in the Path chain to extract only the filename without extension,
ensuring the checks work correctly on both Unix and Windows systems regardless
of whether the command has an extension.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@github-actions

Copy link
Copy Markdown
Contributor

Preview deployed!

Environment URL
Web https://d1f3t71m7w35eq.cloudfront.net

This preview will be cleaned up when the PR is merged or closed.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

3 similar comments
@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

@agent-relay-code

Copy link
Copy Markdown
Contributor

pr-reviewer could not complete review for #1126 in AgentWorkforce/relay.
The review harness exited with code 1.
No review was posted; this needs operator attention.

…g fixes

- Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter
  and second-phase waiters in q02/q03/q04/q07 — reducing opus timeout
  burn from 240s to 120s per scenario
- q06: embed conflict scenario in startup task instead of DM injection;
  idle agents don't process injected text so relay-only injection failed
- q07: require initial spawn for PASS to prevent vacuous pass when lead
  self-implements (spawnEv !== null && !outOfScopeSpawn)
- q02/q04: add PTY stream fallback gated on spawnEv !== null to catch
  escalation/re-routing expressed in stream output
- CONFLICT_RE: expand keywords to cover PTY stream output phrasing;
  add PTY stream check alongside relay message check in detector
- Validated haiku+sonnet: q05 4/4 + q06 4/4 both models; q01-q04/q07
  correctly surface delegation gaps in current Claude CLI tiers

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md:
- Around line 31-34: The Reasoning section states that capping waiters from 120s
to 60s saves approximately 20s per scenario and reduces the full opus run from
107 to 43 minutes, but these numbers are internally inconsistent. If capping
both waiters saves 60s per waiter (totaling 120s per scenario), then the
per-scenario and aggregate time savings figures should reflect this. Either
restate the formula by clarifying how many scenarios and waiters are affected
(e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many
waiters per scenario), or recalculate the savings numbers to be consistent with
the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate
savings (107 to 43 minutes) can be directly derived from the per-scenario
savings.

In @.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md:
- Around line 17-20: The summary contains contradictory guidance about how spawn
scenarios should be structured. The anchor section (lines 17-20) states that
spawn scenarios should begin with 'Spawn...' as the first verb, while the
sibling section (line 35) records the opposite approach of removing explicit
spawn/worker references. Reconcile this inconsistency by either: (1) clarifying
that these approaches apply to different scenario sets or task types with
explicit explanation of which approach applies where, or (2) ensuring both the
anchor section (lines 17-20) and the sibling section (line 35) consistently
describe the same authoritative approach that was actually implemented. Review
the actual changes made and update both locations to provide clear,
non-contradictory guidance.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 9320b1bb-70b2-4016-9ef0-a99c9d014688

📥 Commits

Reviewing files that changed from the base of the PR and between 43e5b0e and 7ede017.

📒 Files selected for processing (10)
  • .agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md
  • .agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md
  • .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json
  • .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/summary.md
  • .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.json
  • scripts/ci-standalone-smoke.sh
  • scripts/watch-cli-tools.sh
  • tests/integration/broker/evals/scenarios/s08-lead-quality.ts
✅ Files skipped from review due to trivial changes (4)
  • .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
  • .agentworkforce/trajectories/completed/2026-06/traj_wm2hz3t6dr04.trace.json
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/integration/broker/evals/scenarios/s08-lead-quality.ts

Comment on lines +31 to +34
### Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters

- **Chose:** Cap all waiters at Math.min(phaseMs, 60_000) — both first spawnWaiter and second-phase waiters
- **Reasoning:** Opus (120s phaseMs) never spawns workers for s08 tasks; both waiters always time out. Without caps: each q02/q03/q04/q07 opus scenario burns 240s = 107 min for 28 scenarios. With all caps: 120s → 60s saves ~20s per scenario, reducing full opus run from 107 to ~43 min.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reconcile the timing math.

The stated cap saves 60s per waiter (120s → 60s), so the “~20s per scenario” estimate and the 107→43 minute rollup do not follow from the numbers here. Please restate the formula or the exact scenario subset so the report stays internally consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.agentworkforce/trajectories/completed/2026-06/traj_39p9wcn6gsep/summary.md
around lines 31 - 34, The Reasoning section states that capping waiters from
120s to 60s saves approximately 20s per scenario and reduces the full opus run
from 107 to 43 minutes, but these numbers are internally inconsistent. If
capping both waiters saves 60s per waiter (totaling 120s per scenario), then the
per-scenario and aggregate time savings figures should reflect this. Either
restate the formula by clarifying how many scenarios and waiters are affected
(e.g., specify which of q02/q03/q04/q07 scenarios are impacted and how many
waiters per scenario), or recalculate the savings numbers to be consistent with
the 60s-per-waiter cap. Ensure the math is explicit so that the aggregate
savings (107 to 43 minutes) can be directly derived from the per-scenario
savings.

Comment thread .agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md Outdated
Proactive Runtime Bot and others added 2 commits June 16, 2026 00:16
…omposer eval data

- Remove DROID_RELAY_SKILL injection for droid — broker-injected skill text
  suppresses relay tool use entirely (s03:bare 0/5 with injection vs baseline
  5/5 without). Droid receives no CLI-specific prefix from relay_skill_prefix.
- Fix race condition in run_gemini_droid_mcp_add: 4-attempt retry loop with
  150ms backoff per attempt handles concurrent eval processes racing on
  ~/.factory/mcp.json (already-exists error).
- composer.ts: revert droid onboarding bare→skill (skill kills relay behaviour,
  confirmed 0/5 s03); mark droid worker as not-viable for droid 0.148.1.
- composer.ts: opencode default model deepseek-v4-flash (16/16, 0 phantoms)
  over mimo-v2.5-free (15/16); add opencode:gpt-5.5 provisional entry.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… changes

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/broker/src/snippets.rs`:
- Around line 1333-1341: The function remove_gemini_droid_mcp_servers uses
blocking std::process::Command::wait() which will stall the async runtime.
Convert this function to be async by replacing std::process::Command with
tokio::process::Command and adding an appropriate timeout (such as using
tokio::time::timeout) to prevent indefinite blocking. Make sure to mark the
function as async and update all call sites of remove_gemini_droid_mcp_servers
to use .await when invoking it.
- Around line 1408-1419: The timeout branch in the MCP command execution (the
Err(_) case handling timeout after wait_with_output()) does not explicitly
terminate the spawned child process before bailing, allowing the mcp add process
to continue running in the background. Before calling anyhow::bail! in the
timeout error branch, explicitly call kill() on the child handle to ensure the
process is terminated and does not interfere with subsequent retry attempts in
the surrounding loop.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: c74bf948-41a6-4da0-9ae0-309106928cfb

📥 Commits

Reviewing files that changed from the base of the PR and between 5c95e75 and 6c0b86d.

📒 Files selected for processing (4)
  • CHANGELOG.md
  • crates/broker/src/runtime/api.rs
  • crates/broker/src/snippets.rs
  • packages/cli/src/auto/composer.ts
✅ Files skipped from review due to trivial changes (1)
  • CHANGELOG.md

Comment thread crates/broker/src/snippets.rs Outdated
Comment thread crates/broker/src/snippets.rs
…eval correctness

- Gate relay_skill_prefix injection behind !skip_relay_prompt so agents
  spawned with skipRelayPrompt:true are not told to call absent relay tools
- Remove redundant prefix.to_string() allocation in the None arm (prefix is
  already an owned String)
- Convert remove_gemini_droid_mcp_servers to async using tokio::process::Command
  with a 5s timeout to avoid blocking the async runtime
- Fix s08 q02 escalated check to only scan events after failure injection,
  preventing early lead messages containing 'error' from falsely passing
- Fix s08 q03 pass check to require a lead response after the status query
  is injected, not just any message since scenario start
- Make CHANGELOG entry impact-first, removing 'eval-derived' implementation jargon

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…s-cli-spawn scenarios

Relay workers asked to "spawn a codex agent" or "spawn an opus claude agent"
now have explicit guidance to map those requests to the right add_agent
parameters.

Changes:
- SMALL_MODEL_RELAY_SKILL: add CLI options table (claude/codex/gemini/opencode)
  and model-pinning examples (opus/sonnet/haiku) with concrete add_agent calls
- RELAY_WORKER_ONE_LINER (gemini): expand to include cli and model parameter hints
- onboarding.ts brief+skill variants: add CLI/model mapping so all harnesses
  get the reference regardless of whether broker injection fires
- s09-cross-cli-spawn.ts: 4 scenarios × 4 onboarding variants = 16 total
    q01: "spawn a codex agent" → agent_spawned.cli contains "codex"
    q02: "spawn a claude agent" → agent_spawned.cli contains "claude"
    q03: "spawn an opus claude agent" → cli "claude" + model ~ /opus/
    q04: "spawn a sonnet claude agent" → cli "claude" + model ~ /sonnet/
- Register s09 in index.ts and runner.ts under --group=cross-cli-spawn

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
github-actions Bot and others added 2 commits June 16, 2026 08:54
…boarding

The eval runner finds the minimum onboarding variant that achieves 100% — adding
CLI/model guidance to 'brief' before running the eval pollutes that comparison.

Changes:
- Revert 'brief' onboarding to original (no CLI/model additions); eval will tell
  us if brief needs augmenting after s09 runs show bare/one-liner results
- RELAY_WORKER_ONE_LINER: true one-liner (cli options inline, not 3 sentences)
- SMALL_MODEL_RELAY_SKILL: drop the full table; two targeted lines added —
  enumerate cli options in the Required line, add one model-pinning line

The 'skill' onboarding retains full CLI/model detail as the kitchen-sink variant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Proactive Runtime Bot and others added 3 commits June 16, 2026 11:17
…er spawn requests

Eval s09 showed that orchestrators without relay-specific onboarding would receive
"spawn a codex agent" or "spawn an opus claude agent" and either do the work themselves
or call add_agent with wrong/missing parameters:
- q01 (codex spawn): 12/12 PASS in v1 description, confirmed causal via baseline
- q03 (opus model): 0/12 → 12/12 after fix (root cause: no "opus claude" → cli:claude mapping)
- q04 (sonnet model): 0/12 → 12/12 after fix (model param was passive, not directive)

Changes:
- add_agent tool description: concrete param examples replace prose
  ("spawn a codex agent" → cli:"codex", "spawn an opus claude agent" → cli:"claude", model:"claude-opus-4-8")
- cli param: explicit "opus claude/sonnet claude agent" → claude mapping
- model param: "Required when a tier is specified" (directive, not passive)
- s09 tasks: terse phrasing matching real failure; runner type + whitelist updated

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ity-v2

# Conflicts:
#	.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/summary.md
#	.agentworkforce/trajectories/completed/2026-06/traj_aq4edi70h18a/trajectory.json
#	packages/cli/src/cli/agent-relay-mcp.ts
#	tests/integration/broker/evals/runner.ts
#	tests/integration/broker/evals/scenarios/index.ts
The merged @relaycast/sdk 4.0 bump narrows SpawnAgentRequest.cli to the core
five harnesses (claude/codex/gemini/aider/goose), dropping grok and opencode.
The broker and gateway still accept grok/opencode at runtime, so keep them
selectable in the add_agent MCP tool enum and cast at the spawn boundary
rather than removing the capability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@khaliqgant khaliqgant merged commit 1fcbae5 into main Jun 16, 2026
50 checks passed
@khaliqgant khaliqgant deleted the feature/s08-lead-quality-v2 branch June 16, 2026 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Per-harness relay worker skills, vocabulary, and onboarding guidance (from s01–s05 eval data)

1 participant