Stop the resume melt-down on leaked broker agent names#331
Conversation
Recurring "[factory] error HarnessDriverProtocolError: agent '<name>' already exists" (http 500): an agent exits, #handleAgentExit resumes it via spawnPty with the same name, but the broker never released that name on exit (relay#1116-family) so re-registering collides. The error is marked retryable:true, so every exit event re-collides — a 500 loop (seen on ar-255-impl then ar-255-review). Handle it gracefully: on resume, detect the "already exists" collision (isAgentAlreadyExistsError), record the resume key so subsequent exit events short-circuit, increment resumeNameCollisions, and warn once — instead of surfacing a hard error and retrying forever. The external reaper / a broker restart reclaims the leaked name; the real upstream fix is relay#1116 (agents self-exit so no resume is needed). Test covers: a resume collision is resumed once, not retried on a second exit event, counted, and not surfaced as a hard error. Full factory-sdk suite green (346), typecheck clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 1 minute and 53 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses an issue where the broker fails to release an agent's name on exit, causing subsequent resume attempts to fail with an 'already exists' error and loop indefinitely. The changes introduce a terminal error handler for this scenario, preventing further retries and logging a warning instead. A test case has been added to verify this behavior. The reviewer suggests improving the robustness of the error message extraction in isAgentAlreadyExistsError by utilizing the existing describeError helper.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| const isAgentAlreadyExistsError = (error: unknown): boolean => { | ||
| const record = asRecord(error) | ||
| const data = asRecord(record?.data) | ||
| const message = stringValue(data?.error) ?? (error instanceof Error ? error.message : '') | ||
| return /already exists/iu.test(message) | ||
| } |
There was a problem hiding this comment.
The current implementation of isAgentAlreadyExistsError only checks error.message if error is an instance of Error. However, in JavaScript/TypeScript, errors can sometimes be thrown as plain objects, strings, or other custom shapes (especially when serialized/deserialized over the network or from API clients).
We can make this check significantly more robust and consistent by leveraging the existing describeError helper, which already handles Error instances, strings, and serializes plain objects to JSON.
| const isAgentAlreadyExistsError = (error: unknown): boolean => { | |
| const record = asRecord(error) | |
| const data = asRecord(record?.data) | |
| const message = stringValue(data?.error) ?? (error instanceof Error ? error.message : '') | |
| return /already exists/iu.test(message) | |
| } | |
| const isAgentAlreadyExistsError = (error: unknown): boolean => { | |
| const record = asRecord(error) | |
| const data = asRecord(record?.data) | |
| const message = stringValue(data?.error) ?? describeError(error).errorMessage | |
| return /already exists/iu.test(message) | |
| } |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f0c010354
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } catch (error) { | ||
| if (isAgentAlreadyExistsError(error)) { |
There was a problem hiding this comment.
Share the collision-swallowing promise with duplicate exits
When two exit callbacks for the same issue/name/sessionRef arrive before the first resume settles, the second callback takes the existing branch above and awaits the raw #resumeTrackedAgent promise. If the broker rejects with agent ... already exists, only the creator reaches this new isAgentAlreadyExistsError catch; any waiter still receives the rejection and the outer catch records a hard [factory] error, so replayed/concurrent exit delivery can still surface the 500 that this change is meant to suppress. Store a wrapped in-flight promise that swallows/counts this collision, or apply the same collision handling to waiters.
Useful? React with 👍 / 👎.
|
Findings
Addressed Comments
Validation Passed:
Failed:
GitHub connector reported PR mergeable and CodeRabbit status success, but local full Vitest is not green, so I am not marking this ready. |
Problem
Recurring, looping error:
An agent exits →
#handleAgentExitresumes it viaspawnPtywith the same name → but the broker never releases the name on exit (relay#1116-family) → re-registering collides → 500. Because the error isretryable: true, every subsequent exit event re-collides — an endless 500 loop (observed onar-255-impl, thenar-255-review).Fix (deploy-free, pear-side)
On resume, detect the collision (
isAgentAlreadyExistsError) and treat it as terminal for that name: record the resume key so later exit events short-circuit, incrementresumeNameCollisions, and warn once — instead of surfacing a hard error and retrying forever. The external reaper / a broker restart reclaims the leaked name.This stops the log spam and the loop for all roles, without waiting on the broker.
What this is NOT
Tests
A resume collision is resumed once, not retried on a second exit event, counted (
resumeNameCollisions), and not surfaced as a hard error. Full factory-sdk suite green (346), typecheck clean.Refs: relay#1116 (upstream), pear#328 (implementer completion guard).
🤖 Generated with Claude Code