Write live daemon heartbeat#276
Conversation
|
Caution Review failedPull request was closed or merged during review 📝 WalkthroughWalkthroughThis PR adds a live daemon heartbeat mechanism to FactoryLoop, enabling periodic status updates during live mode execution. The implementation includes heartbeat state tracking, interval computation, lifecycle integration into startup/stop flows, and comprehensive test coverage for refresh timing and shutdown ordering. ChangesLive daemon heartbeat mechanism for FactoryLoop
Possibly related PRs
🎯 3 (Moderate) | ⏱️ ~25 minutes
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ESLint
ESLint install failed due to a network error. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a live heartbeat mechanism for the FactoryLoop daemon to track process liveness, along with corresponding unit tests. The feedback points out a significant performance bottleneck where the periodic heartbeat write triggers redundant process lookups and API calls by rewriting the in-flight registry. It is recommended to write the heartbeat file directly to avoid these unnecessary operations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| async #writeLiveHeartbeat(status: FactoryLoopHeartbeat['status']): Promise<void> { | ||
| await this.#writeLoopHeartbeat( | ||
| this.#config.loop.heartbeatPath, | ||
| this.#config.loop.registryPath, | ||
| status, | ||
| 0, | ||
| 0, | ||
| ) | ||
| } |
There was a problem hiding this comment.
Every time #writeLiveHeartbeat is called (which happens every 15 seconds by default via the live heartbeat refresh timer), it calls #writeLoopHeartbeat. #writeLoopHeartbeat in turn calls #writeInFlightRegistry, which performs process lookups (#processFinder) and potentially fleet API calls (resolveAgentPid) for every active agent.
This introduces a significant performance and efficiency bottleneck for the live daemon, causing redundant CPU usage and unnecessary API spam every 15 seconds. Since the in-flight registry is already written and kept in sync whenever agents are spawned, exited, or stopped, we can safely write the heartbeat file directly in #writeLiveHeartbeat without rewriting the registry.
async #writeLiveHeartbeat(status: FactoryLoopHeartbeat['status']): Promise<void> {
const path = this.#config.loop.heartbeatPath
const updatedAtMs = this.#clock.now()
const heartbeat: FactoryLoopHeartbeat = {
pid: process.pid,
status,
iteration: 0,
maxIterations: 0,
updatedAt: new Date(updatedAtMs).toISOString(),
updatedAtMs,
registryPath: this.#config.loop.registryPath,
}
await mkdir(dirname(path), { recursive: true })
await writeFile(path, `${JSON.stringify(heartbeat, null, 2)}\n`, 'utf8')
}|
Implemented the PR fix. Changed Added regression coverage in Addressed comments
Advisory NotesNone. Local validation run:
I did not run |
Summary
Writes and refreshes the factory loop heartbeat while
fleet factory start --mode liveis running, so the scheduled externalreap-orphanscron does not mistake a healthy live daemon for a stale crashed factory.Details
start --mode livewrites arunningheartbeat immediately before establishing the live subscription.loop.heartbeatPath/loop.registryPathevery 15s by default, with the interval capped below the configuredheartbeatStaleMs.stop()cancels the refresh timer and writesstoppingbefore releasing in-flight agents.heartbeatStaleMs, and the external reaper can reap the orphaned pair trees.Limitation
This heartbeat detects daemon process death for the crash reaper. It does not prove the Relayfile subscription is connected or making progress, because the current
MountClient.subscribe(...) -> Subscriptionport does not expose connected/keepalive/health state. Subscription-wedge detection remains a separate watchdog concern.Verification
npx vitest run packages/factory-sdk/src/orchestrator/factory.test.tsrunningheartbeat.updatedAtMs.stopping.stoppingbefore releasing live in-flight agents.npm run typecheck:nodenpx vitest run packages/factory-sdkLive cert handoff
fv2 cert target: