fix: wake coordinator sessions reliably#240
Conversation
|
@richardfogaca is attempting to deploy a commit to the Compozy Team on Vercel. A member of the Team first needs to authorize it. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (12)
🚧 Files skipped from review as they are similar to previous changes (12)
WalkthroughAdds PromptSynthetic with InterruptIfAgentWaiting and an interrupt-first synthetic prompt flow, wires it into coordinator bootstrap and harness reentry, updates task enqueue dispatch, and expands tests and stubs to validate prompt and interrupt behaviors. ChangesSynthetic Prompt Interruption with Coordinator Integration
Sequence Diagram(s)sequenceDiagram
participant SessionMgr as Manager.PromptSynthetic
participant Session as Session
participant Driver as AgentDriver
SessionMgr->>Session: isCurrentPromptAgentWaiting()
alt Agent Waiting & Interrupt Enabled
SessionMgr->>Driver: Cancel current prompt
SessionMgr->>Session: Wait for idle
SessionMgr->>SessionMgr: Submit synthetic prompt
else Retryable Error
SessionMgr->>SessionMgr: Fall back to queue/busy logic
else Non-Retryable
SessionMgr-->>SessionMgr: Return error
end
SessionMgr->>Driver: Prompt with synthetic turn source
sequenceDiagram
participant Runtime as CoordinatorRuntime
participant Sessions as SessionManager
participant Coordinator as CoordinatorSession
alt Existing Healthy Coordinator
Runtime->>Sessions: PromptSynthetic (InterruptIfAgentWaiting)
Sessions->>Coordinator: agent event stream (acp.AgentEvent)
end
alt Creating New Coordinator
Runtime->>Sessions: Create session
Runtime->>Sessions: PromptSynthetic (InterruptIfAgentWaiting)
Sessions->>Coordinator: agent event stream (acp.AgentEvent)
Runtime->>Runtime: Dispatch OnCoordinatorSpawned
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/daemon/coordinator_runtime_test.go`:
- Around line 83-97: The test currently verifies prompt count, id, metadata and
message but misses asserting the InterruptIfAgentWaiting flag; update the
synthetic prompt assertions (where you call sessions.promptCount(),
sessions.promptCall(0) and inspect prompt.opts) to add an explicit check that
prompt.opts.InterruptIfAgentWaiting == true (or use a boolean assertion) so the
InterruptIfAgentWaiting bit cannot be dropped; apply the same addition to the
other similar prompt assertion sites that inspect prompt.opts (the other
promptCall checks in this file).
In `@internal/daemon/coordinator_runtime.go`:
- Around line 388-409: The drain goroutine started by coordinatorRuntime calling
drainCoordinatorPromptEvents can leak because it blocks on range events forever;
change drainCoordinatorPromptEvents to accept a context (e.g., ctx
context.Context) and inside the loop use a select to listen for ctx.Done() in
addition to receiving from the events channel so the goroutine exits when the
runtime shuts down, and update the caller in coordinatorRuntime (where
PromptSynthetic is used) to pass the runtime shutdown context (or a managed
worker context) so each wake's drain is bounded to the runtime lifecycle;
alternatively register the goroutine with the runtime's worker manager or a
WaitGroup so it is tracked and cancelled on shutdown.
In `@internal/daemon/heartbeat_wake_runtime_test.go`:
- Around line 318-389: The test currently only asserts
dispatchHeartbeatWake(...) returned false; to exercise the full fallback, after
the handled == false check call the bridge's dispatchWake(...) (the terminal
path that issues the direct PromptSynthetic fallback) with the same
harnessSyntheticWake payload, then assert sessions.syntheticPromptCount() == 1
and re-query db.ListHeartbeatWakeEvents(...) to verify the wake event reflects
the direct reentry (check Result/Reason or new event indicating the fallback was
sent); this ensures dispatchHeartbeatWake + dispatchWake end-to-end fallback
behavior is validated rather than only the helper return value.
In `@internal/session/manager_test.go`:
- Around line 2411-2493: The test
TestPromptSyntheticInterruptsAgentWaitingTurnWhenRequested must be converted to
use the required t.Run("Should ...") subtest pattern: wrap the existing test
body inside t.Run("Should interrupt agent waiting when synthetic prompt requests
it", func(t *testing.T) { ... }), move t.Parallel() into the subtest, and ensure
all uses of t (Cleanup, Fatalf, etc.) remain using the subtest's t; keep the
existing references like h.driver.cancelHook, h.driver.promptHook,
manager.Prompt, manager.PromptSynthetic, collectEvents and managerPromptCalls
unchanged inside that subtest.
- Around line 2416-2418: The cleanup currently ignores the error returned by
h.manager.Stop(testutil.Context(t), session.ID); change the t.Cleanup closure to
handle that error instead of assigning to `_` — call h.manager.Stop with the
same args and if it returns a non-nil error report it via t.Fatalf or t.Errorf
(e.g., t.Fatalf("failed to stop session %s: %v", session.ID, err)) so the test
doesn't silently swallow Stop errors; keep the call inside the existing
t.Cleanup and preserve use of testutil.Context(t), h.manager.Stop, and
session.ID.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 05468ba9-59f6-45aa-96c0-84dfa6a85814
⛔ Files ignored due to path filters (1)
packages/site/content/runtime/core/configuration/config-toml.mdxis excluded by!**/*.mdx
📒 Files selected for processing (12)
internal/api/core/interfaces.gointernal/api/core/session_manager_stub_test.gointernal/api/testutil/session_stub.gointernal/daemon/coordinator_runtime.gointernal/daemon/coordinator_runtime_test.gointernal/daemon/harness_reentry_bridge.gointernal/daemon/heartbeat_wake_runtime_test.gointernal/session/manager_test.gointernal/session/session.gointernal/session/synthetic_prompt.gointernal/task/force_ops.gointernal/task/manager_test.go
f04c843 to
c4e924c
Compare
pedronauck
left a comment
There was a problem hiding this comment.
Automated review — self + Codex (gpt-5.5, xhigh)
Two independent reviews of PR #240 (fix: wake coordinator sessions reliably, commit c4e924c5). The fix wakes an existing detached/waiting coordinator via a synthetic prompt on the enqueue/retry/recovery paths and makes the heartbeat-wake path fall back to a direct synthetic prompt instead of dropping the run. No schema change, no raw claim_token leakage (claim_token_hash-only), %w + errors.Is/As intact, authoritative-primitive exclusivity preserved (the wake instructs the coordinator to run agh task next; it never calls ClaimNextRun).
- Self verdict: FIX_BEFORE_SHIP (2 risks)
- Codex verdict: FIX_BEFORE_SHIP (2 blockers, 4 risks, 0 nits)
Both reviews agree the change is functionally correct in the common case; the lock-hold and unbounded-interrupt paths should be fixed before relying on this under contention.
Blockers
- [codex B-1]
internal/daemon/coordinator_runtime.go:298,308,351—bootstrapRunholdscoordinatorRuntime.muacrosspromptCoordinator, whoseInterruptIfAgentWaitingpath can block up toTimeoutCancelGrace(~30s) inside a synchronous enqueue observer, serializing all coordinator enqueue/retry/recovery bootstrap behind one slow/stuck session. Fix: resolve the target session underr.mu, then release the lock before the wake (or wake on an owned goroutine afterUnlock). - [codex B-2]
internal/session/synthetic_prompt.go:90—interruptAndSubmitSyntheticPromptcallsCancelPromptundercontext.WithoutCancelwith no timeout beforewaitForPromptIdle, so a stuck cancel/tool-interrupt can block the wake indefinitely. Fix: wrap CancelPrompt + interrupt + wait in a bounded context.
Risks
- [self+codex R-1]
coordinator_runtime.go:388—go drainCoordinatorPromptEvents(...)is an untracked goroutine, not joined by any daemon WaitGroup at shutdown. Fix: track via a runtime-owned WaitGroup / lifecycle context. - [codex R-2]
heartbeat_wake_runtime_test.go:318— fallback regression coversWakeReasonSessionPromptActivebut notWakeReasonSessionPromptRace. Fix: add a siblingSessionPromptRacecase. - [codex R-3]
manager_test.go:4874— force-op enqueue-hook assertions check onlypayload.RunID, not the fullTaskRunContext. Fix: assert the fullTaskRunEnqueuedPayloadfor retry and recover. - [codex R-4] concurrency test asserts singleton creation but not the synthetic-prompt count/metadata. Fix: assert exactly one prompt with expected metadata.
Codex ran once via compozy exec --ide codex --model gpt-5.5 --reasoning-effort xhigh (exit 0, schema-valid findings). No source files modified.
| if err != nil { | ||
| return fmt.Errorf("daemon: prompt coordinator session %q: %w", sessionID, err) | ||
| } | ||
| go drainCoordinatorPromptEvents(ctx, r.logger, sessionID, strings.TrimSpace(decision.RunID), events) |
There was a problem hiding this comment.
there's an failure here: G118: Goroutine uses context.Background/TODO while request-scoped context is available (gosec)
What this changes
This PR makes task orchestration continue without manual intervention after a worker run changes state. When a task run is enqueued, retried, or recovered, AGH now wakes the detached/waiting coordinator session so it can route the next step immediately.
Why
In multi-agent task execution, a worker can finish and enqueue the next transition while the coordinator is parked. Without an explicit wake signal, the coordinator may sit idle until someone manually interrupts or prompts it. That made normal child -> receipt -> coordinator continuation unreliable.
Implementation notes
Reviewer focus
Validation
PATH="$HOME/.bun/bin:$PATH" make verifySummary by CodeRabbit
Release Notes
New Features
Bug Fixes