Recover broker transport failures in workflow runs#10
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
📝 WalkthroughWalkthroughAdds broker-recovery primitives ( ChangesBroker Recovery and Transient Agent-Network Retry
Sequence Diagram(s)sequenceDiagram
participant Runner
participant withBrokerRecovery
participant recoverBroker
participant HarnessDriverClient
participant AgentStep
Runner->>withBrokerRecovery: relay.spawnPty(...)
withBrokerRecovery-->>Runner: TransientBrokerError
withBrokerRecovery->>recoverBroker: restart broker (BrokerRunContext)
recoverBroker->>HarnessDriverClient: spawn new broker
HarnessDriverClient-->>recoverBroker: newRelay
recoverBroker->>recoverBroker: clearRelayListeners + re-wire events
recoverBroker-->>withBrokerRecovery: recovered relay
withBrokerRecovery->>withBrokerRecovery: retry relay.spawnPty(...)
withBrokerRecovery-->>Runner: success
Runner->>AgentStep: execute (attempt N)
AgentStep-->>Runner: isTransientAgentNetworkError → repeatSameAttempt=true
Runner->>AgentStep: replay attempt N (suppress step:started)
AgentStep-->>Runner: success
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/core/src/runner.ts`:
- Around line 1780-1787: The broker-exit handler at packages/core/src/runner.ts
1780-1787 clears this.relay when the broker exits, but the check at
packages/core/src/runner.ts 6792-6901 throws an error before the
withBrokerRecovery() wrapper can restart the broker. Ensure that code which
checks for broker availability and throws "AgentRelay not initialized" is
wrapped within or deferred to occur after the withBrokerRecovery() mechanism at
line 6896 so that broker-exit events are handled through the recovery path
instead of immediately failing the interactive step.
- Around line 1839-1844: The brokerRecoveryPromise block unconditionally
restarts the broker without checking if agents are actively running, which can
invalidate WorkflowAgentHandle objects for live PTY agents when best-effort RPCs
like listAgents() or sendMessage() trigger recovery. Add a guard condition to
check if agents are currently active before proceeding with the relay shutdown
and broker restart in the brokerRecoveryPromise async function. If agents are
live, either skip the restart recovery or explicitly fail and respawn those
active steps before swapping the broker. Restrict the restart recovery to safe
pre-spawn paths only.
- Around line 5007-5022: The transient network error retry block in the code
does not release any previously spawned agent PTY before replaying the same
attempt, which can result in multiple agents running concurrently for the same
logical attempt. Before setting repeatSameAttempt to true and continuing the
retry loop, add code to release the spawned agent/PTY that may have been created
by spawnAndWait(). This cleanup is necessary because spawnAndWait() does not
release the spawned agent on generic errors during retry scenarios. The same
issue occurs at another location (lines 7053-7089) where similar replay/retry
logic exists, so apply the same cleanup there as well before the corresponding
repeat attempt continues.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 6f7db716-4db9-4efa-aa4e-24d510279bdc
📒 Files selected for processing (3)
packages/core/src/__tests__/workflow-reliability-contract.test.tspackages/core/src/__tests__/workflow-runner.test.tspackages/core/src/runner.ts
6d97ca5 to
e0103b6
Compare
e0103b6 to
247aafb
Compare
Summary
Verification
Closes #9
Summary by cubic
Adds automatic broker recovery and transient network retries to make workflow runs more reliable. Broker-backed operations are retried with backoff, the broker restarts on retryable relay failures, and agent-step network errors are replayed within the same attempt so they don’t consume the step retry budget. Closes #9.
packages/core.Written for commit 247aafb. Summary will update on new commits.