Skip to content

Recover broker transport failures in workflow runs#10

Merged
khaliqgant merged 1 commit into
mainfrom
codex/issue-9-broker-recovery
Jun 16, 2026
Merged

Recover broker transport failures in workflow runs#10
khaliqgant merged 1 commit into
mainfrom
codex/issue-9-broker-recovery

Conversation

@khaliqgant

@khaliqgant khaliqgant commented Jun 16, 2026

Copy link
Copy Markdown
Member

Summary

  • restart the workflow broker and retry broker-backed operations when transient transport errors occur
  • replay transient agent-step network failures inside the current attempt instead of consuming the step retry budget
  • add regression coverage for broker restart behavior and transient agent-step network retries

Verification

  • npm run typecheck --workspace=packages/core
  • npm run test --workspace=packages/core

Closes #9


Summary by cubic

Adds automatic broker recovery and transient network retries to make workflow runs more reliable. Broker-backed operations are retried with backoff, the broker restarts on retryable relay failures, and agent-step network errors are replayed within the same attempt so they don’t consume the step retry budget. Closes #9.

  • Bug Fixes
    • Auto-restart the broker and retry broker-backed ops (spawn, list, release, send) with backoff; listen for broker exit, rewire event listeners, dedupe restarts, and refuse recovery while any agents are active.
    • Replay transient agent-step network failures within the current attempt (no step:retrying), with small capped retries and delay; release the live PTY before replay and avoid re-emitting step start/owner events.
    • Added tests covering PTY spawn transport failure recovery, broker-exit recovery, recovery safety checks, transient same-attempt retries with PTY release, and agent-step network retries in packages/core.

Written for commit 247aafb. Summary will update on new commits.

Review in cubic

@gemini-code-assist

Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds broker-recovery primitives (BrokerRunContext, withBrokerRecovery(), recoverBroker()) and transient-network same-attempt replay semantics to runner.ts, threads both mechanisms through all broker RPC call sites (PTY spawn, agent listing, stale cleanup, idle nudge), and adds corresponding integration and reliability-contract tests.

Changes

Broker Recovery and Transient Agent-Network Retry

Layer / File(s) Summary
Constants, BrokerRunContext type, and runner fields
packages/core/src/runner.ts
Adds four tuning constants for broker operation retries/backoff and transient agent-network retries/backoff, defines the BrokerRunContext interface, and adds currentBrokerContext and brokerRecoveryPromise instance fields on the runner.
Broker recovery infrastructure, startup wiring, and shutdown cleanup
packages/core/src/runner.ts
Refactors broker helpers (name derivation, event handler creation), consolidates relay listener teardown into clearRelayListeners(), implements withBrokerRecovery() and recoverBroker() with bounded retries and backoff, initializes currentBrokerContext at workflow startup via startBroker(), and clears broker recovery state on shutdown.
Broker recovery applied to PTY spawn, agent list, stale cleanup, and idle nudge
packages/core/src/runner.ts
Wraps relay.spawnPty, relay.listAgents (PID lookup), stale-retry-agent release RPCs, and idle-nudge hub/direct messages in withBrokerRecovery() so retryable broker failures at those sites trigger restart and retry.
Transient agent-network same-attempt replay in agent step loop
packages/core/src/runner.ts
Adds isSameAttemptReplay and transientNetworkRetries controls to the interactive agent step loop; gates step:started events and trajectory bookkeeping on replay status; detects transient failures via isTransientAgentNetworkError() in the error handler and replays the current attempt with backoff without consuming a normal retry slot.
Test harness updates and integration tests
packages/core/src/__tests__/workflow-runner.test.ts, packages/core/src/__tests__/workflow-reliability-contract.test.ts
Makes harness-driver spawning controllable via mockHarnessDriverSpawn, adds onBrokerExit stub to mockRelayInstance, and adds an integration test for broker restart on transient spawnPty failure (verifying spawn/shutdown call counts) and a reliability-contract test confirming agent-step transient failures retry silently without emitting step:retrying even when retries: 0.

Sequence Diagram(s)

sequenceDiagram
  participant Runner
  participant withBrokerRecovery
  participant recoverBroker
  participant HarnessDriverClient
  participant AgentStep

  Runner->>withBrokerRecovery: relay.spawnPty(...)
  withBrokerRecovery-->>Runner: TransientBrokerError

  withBrokerRecovery->>recoverBroker: restart broker (BrokerRunContext)
  recoverBroker->>HarnessDriverClient: spawn new broker
  HarnessDriverClient-->>recoverBroker: newRelay
  recoverBroker->>recoverBroker: clearRelayListeners + re-wire events
  recoverBroker-->>withBrokerRecovery: recovered relay

  withBrokerRecovery->>withBrokerRecovery: retry relay.spawnPty(...)
  withBrokerRecovery-->>Runner: success

  Runner->>AgentStep: execute (attempt N)
  AgentStep-->>Runner: isTransientAgentNetworkError → repeatSameAttempt=true
  Runner->>AgentStep: replay attempt N (suppress step:started)
  AgentStep-->>Runner: success
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 Hops along the relay wire,
When the broker starts to tire —
withBrokerRecovery leaps back in,
Transient errors? Just a spin!
Same attempt, no retry cost,
Not a single workflow lost. 🌟

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Recover broker transport failures in workflow runs' directly and specifically summarizes the main change: implementing broker recovery mechanisms for transport failures.
Linked Issues check ✅ Passed The PR successfully addresses both robustness gaps from issue #9: broker restart/recovery with retries on transient errors, and transient agent-step network error replay without consuming retry budget.
Out of Scope Changes check ✅ Passed All changes are directly scoped to addressing broker recovery and transient network error handling as required by issue #9; no unrelated modifications detected.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The pull request description is directly related to the changeset, detailing broker recovery, transient network retries, and test additions that match the file modifications.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/issue-9-broker-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/core/src/runner.ts`:
- Around line 1780-1787: The broker-exit handler at packages/core/src/runner.ts
1780-1787 clears this.relay when the broker exits, but the check at
packages/core/src/runner.ts 6792-6901 throws an error before the
withBrokerRecovery() wrapper can restart the broker. Ensure that code which
checks for broker availability and throws "AgentRelay not initialized" is
wrapped within or deferred to occur after the withBrokerRecovery() mechanism at
line 6896 so that broker-exit events are handled through the recovery path
instead of immediately failing the interactive step.
- Around line 1839-1844: The brokerRecoveryPromise block unconditionally
restarts the broker without checking if agents are actively running, which can
invalidate WorkflowAgentHandle objects for live PTY agents when best-effort RPCs
like listAgents() or sendMessage() trigger recovery. Add a guard condition to
check if agents are currently active before proceeding with the relay shutdown
and broker restart in the brokerRecoveryPromise async function. If agents are
live, either skip the restart recovery or explicitly fail and respawn those
active steps before swapping the broker. Restrict the restart recovery to safe
pre-spawn paths only.
- Around line 5007-5022: The transient network error retry block in the code
does not release any previously spawned agent PTY before replaying the same
attempt, which can result in multiple agents running concurrently for the same
logical attempt. Before setting repeatSameAttempt to true and continuing the
retry loop, add code to release the spawned agent/PTY that may have been created
by spawnAndWait(). This cleanup is necessary because spawnAndWait() does not
release the spawned agent on generic errors during retry scenarios. The same
issue occurs at another location (lines 7053-7089) where similar replay/retry
logic exists, so apply the same cleanup there as well before the corresponding
repeat attempt continues.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 6f7db716-4db9-4efa-aa4e-24d510279bdc

📥 Commits

Reviewing files that changed from the base of the PR and between b415ca4 and 6d97ca5.

📒 Files selected for processing (3)
  • packages/core/src/__tests__/workflow-reliability-contract.test.ts
  • packages/core/src/__tests__/workflow-runner.test.ts
  • packages/core/src/runner.ts

Comment thread packages/core/src/runner.ts Outdated
Comment thread packages/core/src/runner.ts
Comment thread packages/core/src/runner.ts
@khaliqgant khaliqgant force-pushed the codex/issue-9-broker-recovery branch from 6d97ca5 to e0103b6 Compare June 16, 2026 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Shared broker wedges in 'init' (no recovery) ~30min into multi-workflow runs; agent-step transient errors hard-fail

1 participant