Skip to content

Stop the resume melt-down on leaked broker agent names#331

Merged
khaliqgant merged 1 commit into
mainfrom
fix/resume-already-exists
Jun 14, 2026
Merged

Stop the resume melt-down on leaked broker agent names#331
khaliqgant merged 1 commit into
mainfrom
fix/resume-already-exists

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 14, 2026

Copy link
Copy Markdown
Member

Problem

Recurring, looping error:

[factory] error HarnessDriverProtocolError: agent 'ar-255-review' already exists
  ... HarnessDriverClient.spawnPty ... InternalFleetClient.resume
  ... #resumeTrackedAgent ... #handleAgentExit
  { code: 'http_500', retryable: true, status: 500 }

An agent exits → #handleAgentExit resumes it via spawnPty with the same name → but the broker never releases the name on exit (relay#1116-family) → re-registering collides → 500. Because the error is retryable: true, every subsequent exit event re-collides — an endless 500 loop (observed on ar-255-impl, then ar-255-review).

Fix (deploy-free, pear-side)

On resume, detect the collision (isAgentAlreadyExistsError) and treat it as terminal for that name: record the resume key so later exit events short-circuit, increment resumeNameCollisions, and warn once — instead of surfacing a hard error and retrying forever. The external reaper / a broker restart reclaims the leaked name.

This stops the log spam and the loop for all roles, without waiting on the broker.

What this is NOT

  • The real upstream fix is relay#1116 (agents self-exit after their task, so no resume is needed). This is the defensive pear-side mitigation.
  • I also explored extending the Stop resuming implementers after PR completion #328 "don't resume a finished agent" guard to reviewers, but dropped it: it regressed the PR-state sweep's synthetic-issue completion flow, and part (a) already stops the melt-down regardless of whether we attempt the resume.

Tests

A resume collision is resumed once, not retried on a second exit event, counted (resumeNameCollisions), and not surfaced as a hard error. Full factory-sdk suite green (346), typecheck clean.

Refs: relay#1116 (upstream), pear#328 (implementer completion guard).

🤖 Generated with Claude Code

Review in cubic

Recurring "[factory] error HarnessDriverProtocolError: agent '<name>' already
exists" (http 500): an agent exits, #handleAgentExit resumes it via spawnPty with
the same name, but the broker never released that name on exit (relay#1116-family)
so re-registering collides. The error is marked retryable:true, so every exit
event re-collides — a 500 loop (seen on ar-255-impl then ar-255-review).

Handle it gracefully: on resume, detect the "already exists" collision
(isAgentAlreadyExistsError), record the resume key so subsequent exit events
short-circuit, increment resumeNameCollisions, and warn once — instead of
surfacing a hard error and retrying forever. The external reaper / a broker
restart reclaims the leaked name; the real upstream fix is relay#1116
(agents self-exit so no resume is needed).

Test covers: a resume collision is resumed once, not retried on a second exit
event, counted, and not surfaced as a hard error. Full factory-sdk suite green
(346), typecheck clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@khaliqgant khaliqgant added the no-agent-relay-review Disable agent-relay automated PR review/fixes label Jun 14, 2026
@coderabbitai

coderabbitai Bot commented Jun 14, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@khaliqgant, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 1 minute and 53 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 2dc1b589-d46f-4fb9-9827-a9277d919565

📥 Commits

Reviewing files that changed from the base of the PR and between 99b700d and 2f0c010.

📒 Files selected for processing (2)
  • packages/factory-sdk/src/orchestrator/factory.test.ts
  • packages/factory-sdk/src/orchestrator/factory.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/resume-already-exists

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where the broker fails to release an agent's name on exit, causing subsequent resume attempts to fail with an 'already exists' error and loop indefinitely. The changes introduce a terminal error handler for this scenario, preventing further retries and logging a warning instead. A test case has been added to verify this behavior. The reviewer suggests improving the robustness of the error message extraction in isAgentAlreadyExistsError by utilizing the existing describeError helper.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +2937 to +2942
const isAgentAlreadyExistsError = (error: unknown): boolean => {
const record = asRecord(error)
const data = asRecord(record?.data)
const message = stringValue(data?.error) ?? (error instanceof Error ? error.message : '')
return /already exists/iu.test(message)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation of isAgentAlreadyExistsError only checks error.message if error is an instance of Error. However, in JavaScript/TypeScript, errors can sometimes be thrown as plain objects, strings, or other custom shapes (especially when serialized/deserialized over the network or from API clients).

We can make this check significantly more robust and consistent by leveraging the existing describeError helper, which already handles Error instances, strings, and serializes plain objects to JSON.

Suggested change
const isAgentAlreadyExistsError = (error: unknown): boolean => {
const record = asRecord(error)
const data = asRecord(record?.data)
const message = stringValue(data?.error) ?? (error instanceof Error ? error.message : '')
return /already exists/iu.test(message)
}
const isAgentAlreadyExistsError = (error: unknown): boolean => {
const record = asRecord(error)
const data = asRecord(record?.data)
const message = stringValue(data?.error) ?? describeError(error).errorMessage
return /already exists/iu.test(message)
}

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f0c010354

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1606 to +1607
} catch (error) {
if (isAgentAlreadyExistsError(error)) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Share the collision-swallowing promise with duplicate exits

When two exit callbacks for the same issue/name/sessionRef arrive before the first resume settles, the second callback takes the existing branch above and awaits the raw #resumeTrackedAgent promise. If the broker rejects with agent ... already exists, only the creator reaches this new isAgentAlreadyExistsError catch; any waiter still receives the rejection and the outer catch records a hard [factory] error, so replayed/concurrent exit delivery can still surface the 500 that this change is meant to suppress. Store a wrapped in-flight promise that swallows/counts this collision, or apply the same collision handling to waiters.

Useful? React with 👍 / 👎.

@agent-relay-code

Copy link
Copy Markdown
Contributor

Findings

  1. packages/factory-sdk/src/orchestrator/factory.ts:2940
    isAgentAlreadyExistsError treats any resume error whose message contains already exists as the leaked-agent-name case. In #handleAgentExit, that then records the resume key as terminal and suppresses the error path at factory.ts:1607-1617. That is broader than the PR’s stated condition of broker agent-name collision and could fail closed on unrelated resume failures that happen to include “already exists”. I would tighten this in an author patch, for example by checking the structured payload/status and/or matching the expected agent name rather than a bare substring.

Addressed Comments

  • coderabbitai[bot]: reported review rate limiting only; no actionable code comment to validate or fix.
  • gemini-code-assist[bot] on packages/factory-sdk/src/orchestrator/factory.ts:2937-2942: suggested using describeError(error).errorMessage to detect string/plain-object throws. I did not apply it because it broadens runtime classification in safety-critical resume handling; the current problem should be fixed with a narrower detector tied to the broker agent-name collision, not by matching arbitrary serialized objects.

Validation

Passed:

  • npm ci
  • npm run verify:mcp-resources-drift
  • npm run lint with warnings only
  • npm run typecheck:web
  • npm run typecheck:node
  • npm test
  • npx vitest run packages/factory-sdk/src/orchestrator/factory.test.ts -t "keeps the live heartbeat fresh while draining a blocking live event burst"
  • npm run build
  • final npm run verify:mcp-resources-drift

Failed:

  • npx vitest run failed twice in existing test packages/factory-sdk/src/orchestrator/factory.test.ts:1196-1200, FactoryLoop > keeps the live heartbeat fresh while draining a blocking live event burst. First failure: heartbeat age 800 >= 500; second failure: observedStaleWhileReading became true. The focused rerun passed, so this looks timing-sensitive under full-suite load. I did not edit the test because that would be a semantic test change outside this PR’s collision fix.

GitHub connector reported PR mergeable and CodeRabbit status success, but local full Vitest is not green, so I am not marking this ready.

@khaliqgant khaliqgant merged commit 9263e0a into main Jun 14, 2026
5 checks passed
@khaliqgant khaliqgant deleted the fix/resume-already-exists branch June 14, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-agent-relay-review Disable agent-relay automated PR review/fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant