Skip to content

fix: wake coordinator sessions reliably#240

Open
richardfogaca wants to merge 1 commit into
compozy:mainfrom
richardfogaca:fix/session-continuation
Open

fix: wake coordinator sessions reliably#240
richardfogaca wants to merge 1 commit into
compozy:mainfrom
richardfogaca:fix/session-continuation

Conversation

@richardfogaca
Copy link
Copy Markdown
Contributor

@richardfogaca richardfogaca commented May 30, 2026

What this changes

This PR makes task orchestration continue without manual intervention after a worker run changes state. When a task run is enqueued, retried, or recovered, AGH now wakes the detached/waiting coordinator session so it can route the next step immediately.

Why

In multi-agent task execution, a worker can finish and enqueue the next transition while the coordinator is parked. Without an explicit wake signal, the coordinator may sit idle until someone manually interrupts or prompts it. That made normal child -> receipt -> coordinator continuation unreliable.

Implementation notes

  • Adds synthetic prompt support for waking waiting agent sessions.
  • Wakes coordinator sessions from enqueue/retry/recovery paths.
  • Preserves queued prompt semantics, including interrupt behavior for sessions already waiting on input.
  • Adds regression coverage around coordinator wake and heartbeat behavior.
  • Includes the existing MDX formatting cleanup required by repo verification.

Reviewer focus

  • Confirm the wake path only nudges existing coordinator sessions and does not create duplicate coordinators.
  • Confirm queued prompt/interrupt behavior is still conservative for active sessions.

Validation

  • PATH="$HOME/.bun/bin:$PATH" make verify

Summary by CodeRabbit

Release Notes

  • New Features

    • Added synthetic prompt support with ability to interrupt waiting agent sessions
    • Coordinator bootstrap now leverages synthetic prompts to wake existing coordinator sessions
  • Bug Fixes

    • Improved heartbeat wake handling when sessions have active prompts
    • Ensured task enqueue notifications dispatch correctly for retry and recovery operations

Review Change Stack

Copilot AI review requested due to automatic review settings May 30, 2026 01:11
@vercel
Copy link
Copy Markdown

vercel Bot commented May 30, 2026

@richardfogaca is attempting to deploy a commit to the Compozy Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a84c4a8a-25e9-4124-bf75-c31be19deaec

📥 Commits

Reviewing files that changed from the base of the PR and between f04c843 and c4e924c.

⛔ Files ignored due to path filters (1)
  • packages/site/content/runtime/core/configuration/config-toml.mdx is excluded by !**/*.mdx
📒 Files selected for processing (12)
  • internal/api/core/interfaces.go
  • internal/api/core/session_manager_stub_test.go
  • internal/api/testutil/session_stub.go
  • internal/daemon/coordinator_runtime.go
  • internal/daemon/coordinator_runtime_test.go
  • internal/daemon/harness_reentry_bridge.go
  • internal/daemon/heartbeat_wake_runtime_test.go
  • internal/session/manager_test.go
  • internal/session/session.go
  • internal/session/synthetic_prompt.go
  • internal/task/force_ops.go
  • internal/task/manager_test.go
🚧 Files skipped from review as they are similar to previous changes (12)
  • internal/api/core/interfaces.go
  • internal/session/synthetic_prompt.go
  • internal/task/force_ops.go
  • internal/api/testutil/session_stub.go
  • internal/daemon/heartbeat_wake_runtime_test.go
  • internal/session/session.go
  • internal/api/core/session_manager_stub_test.go
  • internal/daemon/harness_reentry_bridge.go
  • internal/session/manager_test.go
  • internal/daemon/coordinator_runtime.go
  • internal/task/manager_test.go
  • internal/daemon/coordinator_runtime_test.go

Walkthrough

Adds PromptSynthetic with InterruptIfAgentWaiting and an interrupt-first synthetic prompt flow, wires it into coordinator bootstrap and harness reentry, updates task enqueue dispatch, and expands tests and stubs to validate prompt and interrupt behaviors.

Changes

Synthetic Prompt Interruption with Coordinator Integration

Layer / File(s) Summary
Interface contracts and test infrastructure
internal/api/core/interfaces.go, internal/api/core/session_manager_stub_test.go, internal/api/testutil/session_stub.go
SessionManager gains PromptSynthetic; test stubs add callback/hooks and implementations to emulate synthetic prompt behavior in tests.
Session manager synthetic prompt with interrupt support
internal/session/synthetic_prompt.go, internal/session/session.go, internal/session/manager_test.go
SyntheticPromptOpts adds InterruptIfAgentWaiting; Manager.PromptSynthetic implements interrupt-cancel-wait-resubmit path with helpers and retryable-error classification; tests assert interrupt, cancel, and synthetic-turn behavior.
Coordinator runtime bootstrap and prompting
internal/daemon/coordinator_runtime.go, internal/daemon/coordinator_runtime_test.go
Adds coordinatorSessionManager.PromptSynthetic, implements promptCoordinator to wake coordinators with interrupt enabled, drains agent event streams, and tests record/assert prompt calls and metadata.
Harness reentry and heartbeat wake integration
internal/daemon/harness_reentry_bridge.go, internal/daemon/heartbeat_wake_runtime_test.go
dispatchWake sets InterruptIfAgentWaiting:true; heartbeat skip logic avoids finalization for active-prompt/race reasons; integration test verifies skipped heartbeat path and direct reentry prompting.
Task retry and recovery enqueue dispatch
internal/task/force_ops.go, internal/task/manager_test.go
RetryRun/RecoverRun now call dispatchTaskRunEnqueued after recording events; tests capture enqueued run IDs via hooks and assert exact order with slices.Equal.

Sequence Diagram(s)

sequenceDiagram
  participant SessionMgr as Manager.PromptSynthetic
  participant Session as Session
  participant Driver as AgentDriver
  SessionMgr->>Session: isCurrentPromptAgentWaiting()
  alt Agent Waiting & Interrupt Enabled
    SessionMgr->>Driver: Cancel current prompt
    SessionMgr->>Session: Wait for idle
    SessionMgr->>SessionMgr: Submit synthetic prompt
  else Retryable Error
    SessionMgr->>SessionMgr: Fall back to queue/busy logic
  else Non-Retryable
    SessionMgr-->>SessionMgr: Return error
  end
  SessionMgr->>Driver: Prompt with synthetic turn source
Loading
sequenceDiagram
  participant Runtime as CoordinatorRuntime
  participant Sessions as SessionManager
  participant Coordinator as CoordinatorSession
  alt Existing Healthy Coordinator
    Runtime->>Sessions: PromptSynthetic (InterruptIfAgentWaiting)
    Sessions->>Coordinator: agent event stream (acp.AgentEvent)
  end
  alt Creating New Coordinator
    Runtime->>Sessions: Create session
    Runtime->>Sessions: PromptSynthetic (InterruptIfAgentWaiting)
    Sessions->>Coordinator: agent event stream (acp.AgentEvent)
    Runtime->>Runtime: Dispatch OnCoordinatorSpawned
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • compozy/agh#44: Main PR’s synthetic prompt feature (including updating harness_reentry_bridge’s dispatchWake to set InterruptIfAgentWaiting=true on PromptSynthetic) directly intersects with PR #44’s harness reentry bridge work that dispatches synthetic wake via sessions.PromptSynthetic.
  • compozy/agh#66: Adds the agent_waiting runtime activity heartbeat signal that the interrupt flow relies on when detecting agent-waiting prompt turns.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix: wake coordinator sessions reliably' directly and clearly summarizes the main change—improving coordinator session wake behavior—which aligns with the PR's core objective of ensuring task orchestration continues automatically.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/daemon/coordinator_runtime_test.go`:
- Around line 83-97: The test currently verifies prompt count, id, metadata and
message but misses asserting the InterruptIfAgentWaiting flag; update the
synthetic prompt assertions (where you call sessions.promptCount(),
sessions.promptCall(0) and inspect prompt.opts) to add an explicit check that
prompt.opts.InterruptIfAgentWaiting == true (or use a boolean assertion) so the
InterruptIfAgentWaiting bit cannot be dropped; apply the same addition to the
other similar prompt assertion sites that inspect prompt.opts (the other
promptCall checks in this file).

In `@internal/daemon/coordinator_runtime.go`:
- Around line 388-409: The drain goroutine started by coordinatorRuntime calling
drainCoordinatorPromptEvents can leak because it blocks on range events forever;
change drainCoordinatorPromptEvents to accept a context (e.g., ctx
context.Context) and inside the loop use a select to listen for ctx.Done() in
addition to receiving from the events channel so the goroutine exits when the
runtime shuts down, and update the caller in coordinatorRuntime (where
PromptSynthetic is used) to pass the runtime shutdown context (or a managed
worker context) so each wake's drain is bounded to the runtime lifecycle;
alternatively register the goroutine with the runtime's worker manager or a
WaitGroup so it is tracked and cancelled on shutdown.

In `@internal/daemon/heartbeat_wake_runtime_test.go`:
- Around line 318-389: The test currently only asserts
dispatchHeartbeatWake(...) returned false; to exercise the full fallback, after
the handled == false check call the bridge's dispatchWake(...) (the terminal
path that issues the direct PromptSynthetic fallback) with the same
harnessSyntheticWake payload, then assert sessions.syntheticPromptCount() == 1
and re-query db.ListHeartbeatWakeEvents(...) to verify the wake event reflects
the direct reentry (check Result/Reason or new event indicating the fallback was
sent); this ensures dispatchHeartbeatWake + dispatchWake end-to-end fallback
behavior is validated rather than only the helper return value.

In `@internal/session/manager_test.go`:
- Around line 2411-2493: The test
TestPromptSyntheticInterruptsAgentWaitingTurnWhenRequested must be converted to
use the required t.Run("Should ...") subtest pattern: wrap the existing test
body inside t.Run("Should interrupt agent waiting when synthetic prompt requests
it", func(t *testing.T) { ... }), move t.Parallel() into the subtest, and ensure
all uses of t (Cleanup, Fatalf, etc.) remain using the subtest's t; keep the
existing references like h.driver.cancelHook, h.driver.promptHook,
manager.Prompt, manager.PromptSynthetic, collectEvents and managerPromptCalls
unchanged inside that subtest.
- Around line 2416-2418: The cleanup currently ignores the error returned by
h.manager.Stop(testutil.Context(t), session.ID); change the t.Cleanup closure to
handle that error instead of assigning to `_` — call h.manager.Stop with the
same args and if it returns a non-nil error report it via t.Fatalf or t.Errorf
(e.g., t.Fatalf("failed to stop session %s: %v", session.ID, err)) so the test
doesn't silently swallow Stop errors; keep the call inside the existing
t.Cleanup and preserve use of testutil.Context(t), h.manager.Stop, and
session.ID.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 05468ba9-59f6-45aa-96c0-84dfa6a85814

📥 Commits

Reviewing files that changed from the base of the PR and between cd4b155 and f04c843.

⛔ Files ignored due to path filters (1)
  • packages/site/content/runtime/core/configuration/config-toml.mdx is excluded by !**/*.mdx
📒 Files selected for processing (12)
  • internal/api/core/interfaces.go
  • internal/api/core/session_manager_stub_test.go
  • internal/api/testutil/session_stub.go
  • internal/daemon/coordinator_runtime.go
  • internal/daemon/coordinator_runtime_test.go
  • internal/daemon/harness_reentry_bridge.go
  • internal/daemon/heartbeat_wake_runtime_test.go
  • internal/session/manager_test.go
  • internal/session/session.go
  • internal/session/synthetic_prompt.go
  • internal/task/force_ops.go
  • internal/task/manager_test.go

Comment thread internal/daemon/coordinator_runtime_test.go
Comment thread internal/daemon/coordinator_runtime.go Outdated
Comment thread internal/daemon/heartbeat_wake_runtime_test.go
Comment thread internal/session/manager_test.go
Comment thread internal/session/manager_test.go Outdated
@richardfogaca richardfogaca force-pushed the fix/session-continuation branch from f04c843 to c4e924c Compare May 30, 2026 01:41
Copy link
Copy Markdown
Member

@pedronauck pedronauck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review — self + Codex (gpt-5.5, xhigh)

Two independent reviews of PR #240 (fix: wake coordinator sessions reliably, commit c4e924c5). The fix wakes an existing detached/waiting coordinator via a synthetic prompt on the enqueue/retry/recovery paths and makes the heartbeat-wake path fall back to a direct synthetic prompt instead of dropping the run. No schema change, no raw claim_token leakage (claim_token_hash-only), %w + errors.Is/As intact, authoritative-primitive exclusivity preserved (the wake instructs the coordinator to run agh task next; it never calls ClaimNextRun).

  • Self verdict: FIX_BEFORE_SHIP (2 risks)
  • Codex verdict: FIX_BEFORE_SHIP (2 blockers, 4 risks, 0 nits)

Both reviews agree the change is functionally correct in the common case; the lock-hold and unbounded-interrupt paths should be fixed before relying on this under contention.

Blockers

  • [codex B-1] internal/daemon/coordinator_runtime.go:298,308,351bootstrapRun holds coordinatorRuntime.mu across promptCoordinator, whose InterruptIfAgentWaiting path can block up to TimeoutCancelGrace (~30s) inside a synchronous enqueue observer, serializing all coordinator enqueue/retry/recovery bootstrap behind one slow/stuck session. Fix: resolve the target session under r.mu, then release the lock before the wake (or wake on an owned goroutine after Unlock).
  • [codex B-2] internal/session/synthetic_prompt.go:90interruptAndSubmitSyntheticPrompt calls CancelPrompt under context.WithoutCancel with no timeout before waitForPromptIdle, so a stuck cancel/tool-interrupt can block the wake indefinitely. Fix: wrap CancelPrompt + interrupt + wait in a bounded context.

Risks

  • [self+codex R-1] coordinator_runtime.go:388go drainCoordinatorPromptEvents(...) is an untracked goroutine, not joined by any daemon WaitGroup at shutdown. Fix: track via a runtime-owned WaitGroup / lifecycle context.
  • [codex R-2] heartbeat_wake_runtime_test.go:318 — fallback regression covers WakeReasonSessionPromptActive but not WakeReasonSessionPromptRace. Fix: add a sibling SessionPromptRace case.
  • [codex R-3] manager_test.go:4874 — force-op enqueue-hook assertions check only payload.RunID, not the full TaskRunContext. Fix: assert the full TaskRunEnqueuedPayload for retry and recover.
  • [codex R-4] concurrency test asserts singleton creation but not the synthetic-prompt count/metadata. Fix: assert exactly one prompt with expected metadata.

Codex ran once via compozy exec --ide codex --model gpt-5.5 --reasoning-effort xhigh (exit 0, schema-valid findings). No source files modified.

if err != nil {
return fmt.Errorf("daemon: prompt coordinator session %q: %w", sessionID, err)
}
go drainCoordinatorPromptEvents(ctx, r.logger, sessionID, strings.TrimSpace(decision.RunID), events)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's an failure here: G118: Goroutine uses context.Background/TODO while request-scoped context is available (gosec)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants