-
Notifications
You must be signed in to change notification settings - Fork 1
Stop the resume melt-down on leaked broker agent names #331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -1603,6 +1603,24 @@ export class FactoryLoop implements Factory { | |||||||||||||||||||||||||
| try { | ||||||||||||||||||||||||||
| await resume | ||||||||||||||||||||||||||
| this.#resumedExitKeys.add(resumeKey) | ||||||||||||||||||||||||||
| } catch (error) { | ||||||||||||||||||||||||||
| if (isAgentAlreadyExistsError(error)) { | ||||||||||||||||||||||||||
| // The broker never released this agent's name on exit | ||||||||||||||||||||||||||
| // (relay#1116-family), so re-registering collides with the stuck | ||||||||||||||||||||||||||
| // name. The error is marked retryable but isn't — retrying just | ||||||||||||||||||||||||||
| // re-collides forever. Treat it as terminal for this name: record | ||||||||||||||||||||||||||
| // the resume key so subsequent exit events short-circuit, count it, | ||||||||||||||||||||||||||
| // and warn once instead of spamming a 500 stack trace. The external | ||||||||||||||||||||||||||
| // reaper / a broker restart reclaims the leaked name. | ||||||||||||||||||||||||||
| this.#resumedExitKeys.add(resumeKey) | ||||||||||||||||||||||||||
| this.#increment('resumeNameCollisions') | ||||||||||||||||||||||||||
| this.#logger.warn?.('[factory] resume skipped: broker still holds agent name (relay#1116); not retrying', { | ||||||||||||||||||||||||||
| issue: record.issue.key, | ||||||||||||||||||||||||||
| name, | ||||||||||||||||||||||||||
| }) | ||||||||||||||||||||||||||
| } else { | ||||||||||||||||||||||||||
| throw error | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
| } finally { | ||||||||||||||||||||||||||
| this.#resumeInFlight.delete(resumeKey) | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
|
@@ -2912,6 +2930,17 @@ const labelName = (value: unknown): string | undefined => { | |||||||||||||||||||||||||
| const isCompletionReason = (reason?: string): boolean => | ||||||||||||||||||||||||||
| reason === 'issue-done' || reason === 'done' || reason === 'completed' | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| // The broker rejects re-registering a name it never released on exit | ||||||||||||||||||||||||||
| // (relay#1116-family) with a 500 "agent '<name>' already exists". Detect it from | ||||||||||||||||||||||||||
| // the structured payload or the message so resume can treat it as terminal | ||||||||||||||||||||||||||
| // rather than retrying the (falsely) "retryable" error forever. | ||||||||||||||||||||||||||
| const isAgentAlreadyExistsError = (error: unknown): boolean => { | ||||||||||||||||||||||||||
| const record = asRecord(error) | ||||||||||||||||||||||||||
| const data = asRecord(record?.data) | ||||||||||||||||||||||||||
| const message = stringValue(data?.error) ?? (error instanceof Error ? error.message : '') | ||||||||||||||||||||||||||
| return /already exists/iu.test(message) | ||||||||||||||||||||||||||
| } | ||||||||||||||||||||||||||
|
Comment on lines
+2937
to
+2942
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The current implementation of We can make this check significantly more robust and consistent by leveraging the existing
Suggested change
|
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
| const defaultRestartPolicy = (spec: AgentSpec): AgentSpec['restartPolicy'] | undefined => | ||||||||||||||||||||||||||
| spec.role === 'implementer' ? { maxRestarts: 3, strategy: 'resume' } as AgentSpec['restartPolicy'] : spec.restartPolicy | ||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When two exit callbacks for the same issue/name/sessionRef arrive before the first resume settles, the second callback takes the
existingbranch above and awaits the raw#resumeTrackedAgentpromise. If the broker rejects withagent ... already exists, only the creator reaches this newisAgentAlreadyExistsErrorcatch; any waiter still receives the rejection and the outer catch records a hard[factory] error, so replayed/concurrent exit delivery can still surface the 500 that this change is meant to suppress. Store a wrapped in-flight promise that swallows/counts this collision, or apply the same collision handling to waiters.Useful? React with 👍 / 👎.