Skip to content

Write live daemon heartbeat#276

Merged
kjgbot merged 1 commit into
mainfrom
factory-sdk-live-heartbeat-sb-impl3
Jun 13, 2026
Merged

Write live daemon heartbeat#276
kjgbot merged 1 commit into
mainfrom
factory-sdk-live-heartbeat-sb-impl3

Conversation

@kjgbot

@kjgbot kjgbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Summary

Writes and refreshes the factory loop heartbeat while fleet factory start --mode live is running, so the scheduled external reap-orphans cron does not mistake a healthy live daemon for a stale crashed factory.

Details

  • start --mode live writes a running heartbeat immediately before establishing the live subscription.
  • The live daemon refreshes the configured loop.heartbeatPath / loop.registryPath every 15s by default, with the interval capped below the configured heartbeatStaleMs.
  • stop() cancels the refresh timer and writes stopping before releasing in-flight agents.
  • Crash-backstop semantics are preserved: if the daemon process dies, the timer stops, the heartbeat goes stale after heartbeatStaleMs, and the external reaper can reap the orphaned pair trees.

Limitation

This heartbeat detects daemon process death for the crash reaper. It does not prove the Relayfile subscription is connected or making progress, because the current MountClient.subscribe(...) -> Subscription port does not expose connected/keepalive/health state. Subscription-wedge detection remains a separate watchdog concern.

Verification

  • npx vitest run packages/factory-sdk/src/orchestrator/factory.test.ts
    • live start writes an immediate fresh running heartbeat.
    • live heartbeat refresh advances updatedAtMs.
    • live stop writes stopping.
    • stop marks stopping before releasing live in-flight agents.
  • npm run typecheck:node
  • npx vitest run packages/factory-sdk

Live cert handoff

fv2 cert target:

  • fresh live heartbeat -> scheduled cron does not reap healthy live pairs.
  • SIGKILL daemon -> heartbeat goes stale -> scheduled cron reaps orphaned pairs.
  • mutation removing live heartbeat write should RED by false-reaping live pairs.

@kjgbot kjgbot added the no-agent-relay-review Disable agent-relay automated PR review/fixes label Jun 13, 2026
@coderabbitai

coderabbitai Bot commented Jun 13, 2026

Copy link
Copy Markdown

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

This PR adds a live daemon heartbeat mechanism to FactoryLoop, enabling periodic status updates during live mode execution. The implementation includes heartbeat state tracking, interval computation, lifecycle integration into startup/stop flows, and comprehensive test coverage for refresh timing and shutdown ordering.

Changes

Live daemon heartbeat mechanism for FactoryLoop

Layer / File(s) Summary
Heartbeat state and interval utilities
packages/factory-sdk/src/orchestrator/factory.ts
DEFAULT_LIVE_HEARTBEAT_INTERVAL_MS constant and FactoryLoop heartbeat state fields (#liveHeartbeatTimer, active/in-flight flags, refresh promise) enable tracking. liveHeartbeatIntervalMs(staleMs) utility computes refresh intervals bounded from configured staleness.
Heartbeat lifecycle integration into FactoryLoop
packages/factory-sdk/src/orchestrator/factory.ts
Heartbeat starts when live subscription is initialized; stops with stopping status on live startup failure and in the main stop() method before subscriptions are torn down.
Heartbeat scheduling, refresh gating, and state writing
packages/factory-sdk/src/orchestrator/factory.ts
#startLiveHeartbeat() and #stopLiveHeartbeat(status) manage timer lifecycle; refresh scheduling triggers periodic writes at computed intervals; refresh gating prevents concurrent writes; #writeLiveHeartbeat(status) records state via existing loop heartbeat writer.
Live mode heartbeat tests
packages/factory-sdk/src/orchestrator/factory.test.ts
One test verifies initial heartbeat on live startup, clock advance triggers updatedAtMs refresh, and stop() transitions to stopping with final timestamp. Another test confirms stop() marks heartbeat as stopping before in-flight agent releases during shutdown.

Possibly related PRs

  • AgentWorkforce/pear#248: Modifies FactoryLoop heartbeat/liveness lifecycle and status transitions on stop() in the same file with aligned heartbeat state management.
  • AgentWorkforce/pear#253: Extends heartbeat writing path in FactoryLoop.stop() with registry sidecar updates and overlaps directly in heartbeat state transitions.
  • AgentWorkforce/pear#245: Modifies live subscription startup flow in FactoryLoop, directly connected to the live mode heartbeat initialization in this PR.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 A heartbeat so steady, a loop so alive,
Timestamps refreshing, the daemons thrive.
Status transitions smooth, from start to the cease,
Live mode now monitors, bringing factory peace. 🏭✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Write live daemon heartbeat' directly summarizes the main change: implementing heartbeat functionality for the live daemon mode.
Description check ✅ Passed The description is directly related to the changeset, providing context on why the heartbeat is needed, implementation details, limitations, and verification steps.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch factory-sdk-live-heartbeat-sb-impl3

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ESLint

If the error stems from missing dependencies, add them to the package.json file. For unrecoverable errors (e.g., due to private dependencies), disable the tool in the CodeRabbit configuration.

ESLint install failed due to a network error.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a live heartbeat mechanism for the FactoryLoop daemon to track process liveness, along with corresponding unit tests. The feedback points out a significant performance bottleneck where the periodic heartbeat write triggers redundant process lookups and API calls by rewriting the in-flight registry. It is recommended to write the heartbeat file directly to avoid these unnecessary operations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +364 to +372
async #writeLiveHeartbeat(status: FactoryLoopHeartbeat['status']): Promise<void> {
await this.#writeLoopHeartbeat(
this.#config.loop.heartbeatPath,
this.#config.loop.registryPath,
status,
0,
0,
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Every time #writeLiveHeartbeat is called (which happens every 15 seconds by default via the live heartbeat refresh timer), it calls #writeLoopHeartbeat. #writeLoopHeartbeat in turn calls #writeInFlightRegistry, which performs process lookups (#processFinder) and potentially fleet API calls (resolveAgentPid) for every active agent.

This introduces a significant performance and efficiency bottleneck for the live daemon, causing redundant CPU usage and unnecessary API spam every 15 seconds. Since the in-flight registry is already written and kept in sync whenever agents are spawned, exited, or stopped, we can safely write the heartbeat file directly in #writeLiveHeartbeat without rewriting the registry.

  async #writeLiveHeartbeat(status: FactoryLoopHeartbeat['status']): Promise<void> {
    const path = this.#config.loop.heartbeatPath
    const updatedAtMs = this.#clock.now()
    const heartbeat: FactoryLoopHeartbeat = {
      pid: process.pid,
      status,
      iteration: 0,
      maxIterations: 0,
      updatedAt: new Date(updatedAtMs).toISOString(),
      updatedAtMs,
      registryPath: this.#config.loop.registryPath,
    }
    await mkdir(dirname(path), { recursive: true })
    await writeFile(path, `${JSON.stringify(heartbeat, null, 2)}\n`, 'utf8')
  }

@kjgbot kjgbot merged commit de24cb3 into main Jun 13, 2026
4 of 5 checks passed
@kjgbot kjgbot deleted the factory-sdk-live-heartbeat-sb-impl3 branch June 13, 2026 11:42
@agent-relay-code

Copy link
Copy Markdown
Contributor

Implemented the PR fix.

Changed packages/factory-sdk/src/orchestrator/factory.ts:364 so live heartbeat refresh writes only the heartbeat file and no longer rewrites the in-flight registry on every timer tick.

Added regression coverage in packages/factory-sdk/src/orchestrator/factory.test.ts:939 proving a live heartbeat refresh updates heartbeat state without triggering additional process lookups or mutating the registry.

Addressed comments

  • gemini-code-assist[bot]: live heartbeat refresh was rewriting the registry and causing repeated process/PID lookups; fixed in packages/factory-sdk/src/orchestrator/factory.ts:364 and covered in packages/factory-sdk/src/orchestrator/factory.test.ts:939.
  • gemini-code-assist[bot] review summary: same registry rewrite/performance concern; fixed in packages/factory-sdk/src/orchestrator/factory.ts:364.
  • coderabbitai[bot]: review failed / walkthrough only; no actionable code finding. Its ESLint tool failure was not a repo defect; local npm run lint completed successfully.

Advisory Notes

None.

Local validation run:

  • npm ci
  • npx vitest run packages/factory-sdk/src/orchestrator/factory.test.ts
  • npm run verify:mcp-resources-drift
  • npm run lint
  • npm run typecheck:web
  • npm run typecheck:node
  • npm test
  • npx vitest run
  • npm run build
  • npm run build:web
  • npx playwright install chromium
  • npx playwright test --config playwright.fidelity.config.ts
  • npx playwright test --config playwright.redraw.config.ts

I did not run dist:mac or verify:mcp-spawn; that CI job requires the macOS packaged .app artifact and is not reproducible in this Linux workspace. I also cannot assert GitHub mergeability or live check status from the checkout, so I’m not marking this READY.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

no-agent-relay-review Disable agent-relay automated PR review/fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant