Skip to content

fix(cortex): use idle time instead of lifetime for worker timeout#332

Merged
jamiepine merged 2 commits into
mainfrom
fix/supervisor-idle-timeout
Mar 5, 2026
Merged

fix(cortex): use idle time instead of lifetime for worker timeout#332
jamiepine merged 2 commits into
mainfrom
fix/supervisor-idle-timeout

Conversation

@jamiepine

Copy link
Copy Markdown
Member

Summary

  • Workers were being killed after 300s total lifetime regardless of activity. The supervisor timeout now tracks last activity (tool calls, status updates) instead of spawn time, so only truly idle workers get cancelled.
  • Default worker_timeout_secs bumped from 300 to 600 (10 minutes of inactivity).
  • Kill logs now include both idle_secs and lifetime_secs for better diagnostics.

What changed

  • WorkerTracker gets a last_activity_at field, reset on WorkerStatus, ToolStarted, and ToolCompleted events from workers.
  • Timeout check uses last_activity_at instead of started_at.
  • Kill reason now reads "idle for Xs, exceeded Ys timeout (supervisor)" instead of "timed out after Xs (supervisor)".
  • Default timeout changed from 300s to 600s in config and docs.

Race / terminal-state reasoning

track_worker_activity is a simple timestamp update inside the existing health_runtime_state write lock, same lock used by the timeout filter in run_health_tick. No new lock ordering or contention introduced. The ToolStarted/ToolCompleted match arms only fire for ProcessId::Worker, so channel/branch tool events are unaffected.

Verification

  • cargo fmt --all -- --check — clean
  • cargo clippy --all-targets — clean (no warnings)
  • cargo test --lib -- cortex — 45/45 pass, including two new tests:
    • worker_activity_resets_idle_clock
    • worker_activity_noop_for_unknown_worker

The supervisor was killing active workers after 300s because the timeout
was measured from spawn time, ignoring ongoing activity. Workers running
tools and reporting status were still getting killed.

Track last_activity_at on WorkerTracker, refreshed by WorkerStatus,
ToolStarted, and ToolCompleted events. Timeout now fires only after
10 minutes of inactivity (up from 5 minutes total lifetime).
@coderabbitai

coderabbitai Bot commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Updated Cortex worker timeout default from 300s to 600s and added per-worker activity tracking. Idle detection and kill logic now use each worker's last_activity_at (idle) instead of started_at (lifetime); related logging, helper names, observers, and tests were updated accordingly.

Changes

Cohort / File(s) Summary
Configuration & Documentation
docs/content/docs/(configuration)/config.mdx, docs/content/docs/(core)/cortex.mdx
Bumped worker_timeout_secs default from 300 → 600 and reworded descriptions to emphasize "idle timeout" / "no activity".
Worker activity & timeout logic
src/agent/cortex.rs
Added last_activity_at to WorkerTracker; new track_worker_activity usage in health observers (WorkerStatus, ToolStarted, ToolCompleted); swapped worker idle checks to use last_activity_at; renamed helper to kill_target_last_activity; enhanced kill logging with idle_secs and lifetime_secs; updated tests and event loop handling.
Configuration schema
src/config/types.rs
Updated CortexConfig default worker_timeout_secs from 300 → 600.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: switching from lifetime-based to idle-time-based worker timeout tracking.
Description check ✅ Passed The description is well-detailed and directly related to the changeset, explaining the motivation, implementation details, and verification steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/supervisor-idle-timeout

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread src/agent/cortex.rs
ProcessEvent::WorkerComplete {
worker_id, success, ..
} => state.track_worker_complete(*worker_id, *success, threshold),
ProcessEvent::WorkerStatus { worker_id, .. } => {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: should ProcessEvent::StatusUpdate { process_id: ProcessId::Worker(..) } also reset the idle clock? SpacebotHook::send_status emits StatusUpdate, so those updates won’t count as activity here.

Suggested change
ProcessEvent::WorkerStatus { worker_id, .. } => {
ProcessEvent::StatusUpdate {
process_id: ProcessId::Worker(worker_id),
..
} => state.track_worker_activity(*worker_id),

Comment thread src/agent/cortex.rs
Comment on lines +4045 to +4064
fn worker_activity_resets_idle_clock() {
let mut state = HealthRuntimeState::default();
let worker_id = uuid::Uuid::new_v4();
state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string());

let tracker_before = state.worker_trackers.get(&worker_id).unwrap().clone();
// Simulate time passing by checking that activity updates the timestamp.
std::thread::sleep(std::time::Duration::from_millis(10));
state.track_worker_activity(worker_id);

let tracker_after = state.worker_trackers.get(&worker_id).unwrap();
assert!(
tracker_after.last_activity_at > tracker_before.last_activity_at,
"last_activity_at should advance after track_worker_activity"
);
assert_eq!(
tracker_after.started_at, tracker_before.started_at,
"started_at should not change"
);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test: avoiding thread::sleep here should make this less timing-sensitive/flaky on slow CI. You can force an older last_activity_at and assert it advances.

Suggested change
fn worker_activity_resets_idle_clock() {
let mut state = HealthRuntimeState::default();
let worker_id = uuid::Uuid::new_v4();
state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string());
let tracker_before = state.worker_trackers.get(&worker_id).unwrap().clone();
// Simulate time passing by checking that activity updates the timestamp.
std::thread::sleep(std::time::Duration::from_millis(10));
state.track_worker_activity(worker_id);
let tracker_after = state.worker_trackers.get(&worker_id).unwrap();
assert!(
tracker_after.last_activity_at > tracker_before.last_activity_at,
"last_activity_at should advance after track_worker_activity"
);
assert_eq!(
tracker_after.started_at, tracker_before.started_at,
"started_at should not change"
);
}
fn worker_activity_resets_idle_clock() {
let mut state = HealthRuntimeState::default();
let worker_id = uuid::Uuid::new_v4();
state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string());
let started_at = state.worker_trackers.get(&worker_id).unwrap().started_at;
let previous_activity_at = {
let tracker = state.worker_trackers.get_mut(&worker_id).unwrap();
tracker.last_activity_at = tracker
.last_activity_at
.checked_sub(Duration::from_secs(1))
.unwrap();
tracker.last_activity_at
};
state.track_worker_activity(worker_id);
let tracker_after = state.worker_trackers.get(&worker_id).unwrap();
assert!(
tracker_after.last_activity_at > previous_activity_at,
"last_activity_at should advance after track_worker_activity"
);
assert_eq!(
tracker_after.started_at, started_at,
"started_at should not change"
);
}

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/agent/cortex.rs`:
- Around line 895-897: The main cortex loop's tokio::select! needs to be made
deterministic to avoid a race where the tick branch (run_health_tick) can run
before queued events are processed; add the biased; marker to the select! at the
select that reads from event_rx so event branches are prioritized over the tick
branch that filters by last_activity_at and worker_timeout, ensuring activity
events update last_activity_at before run_health_tick() performs idle filtering.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5504af9b-3855-457d-bdaa-eb5fb750a1f5

📥 Commits

Reviewing files that changed from the base of the PR and between c949e0f and e24089b.

📒 Files selected for processing (4)
  • docs/content/docs/(configuration)/config.mdx
  • docs/content/docs/(core)/cortex.mdx
  • src/agent/cortex.rs
  • src/config/types.rs

Comment thread src/agent/cortex.rs
Prevents a race where run_health_tick() could filter on last_activity_at
before pending WorkerStatus/ToolStarted/ToolCompleted events update it.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
src/agent/cortex.rs (1)

831-850: ⚠️ Potential issue | 🟠 Major

Track worker StatusUpdate events as activity too.

Line 831 and surrounding match arms still miss ProcessEvent::StatusUpdate for ProcessId::Worker(_), so a worker can emit status heartbeats without resetting last_activity_at and be cancelled as idle.

Suggested fix
             ProcessEvent::WorkerStatus { worker_id, .. } => {
                 state.track_worker_activity(*worker_id);
             }
+            ProcessEvent::StatusUpdate {
+                process_id: ProcessId::Worker(worker_id),
+                ..
+            } => {
+                state.track_worker_activity(*worker_id);
+            }
             ProcessEvent::ToolStarted {
                 process_id: ProcessId::Worker(worker_id),
                 ..
             } => {
                 state.track_worker_activity(*worker_id);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/agent/cortex.rs` around lines 831 - 850, The match over ProcessEvent is
missing handling for ProcessEvent::StatusUpdate for worker processes, so add a
branch that matches ProcessEvent::StatusUpdate { process_id:
ProcessId::Worker(worker_id), .. } (or use if let ProcessId::Worker(...)
pattern) and call state.track_worker_activity(*worker_id) inside it; place this
new arm alongside the existing ProcessEvent::WorkerStatus,
ProcessEvent::ToolStarted and ProcessEvent::ToolCompleted arms so worker status
heartbeats update last_activity_at via state.track_worker_activity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/agent/cortex.rs`:
- Around line 831-850: The match over ProcessEvent is missing handling for
ProcessEvent::StatusUpdate for worker processes, so add a branch that matches
ProcessEvent::StatusUpdate { process_id: ProcessId::Worker(worker_id), .. } (or
use if let ProcessId::Worker(...) pattern) and call
state.track_worker_activity(*worker_id) inside it; place this new arm alongside
the existing ProcessEvent::WorkerStatus, ProcessEvent::ToolStarted and
ProcessEvent::ToolCompleted arms so worker status heartbeats update
last_activity_at via state.track_worker_activity.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: e3a3073c-97fe-4194-a18a-150c579bfec0

📥 Commits

Reviewing files that changed from the base of the PR and between e24089b and c119f8b.

📒 Files selected for processing (1)
  • src/agent/cortex.rs

@jamiepine jamiepine merged commit 63bfd10 into main Mar 5, 2026
4 checks passed
jamiepine added a commit that referenced this pull request Mar 5, 2026
The supervisor was still killing idle OpenCode workers after 10 minutes
of inactivity because last_activity_at stops updating when a worker
enters idle state (waiting for follow-up input).

Add is_idle flag to WorkerTracker:
- Set when WorkerIdle event is observed
- Cleared by track_worker_activity (any WorkerStatus/ToolStarted/
  ToolCompleted event means the worker resumed)
- Overdue filter skips workers with is_idle=true

This complements the last_activity_at fix from #332 — that fix
prevents killing workers that are actively running tools, this fix
prevents killing workers that are legitimately waiting for follow-up.
rktmeister pushed a commit to rktmeister/spacebot that referenced this pull request Mar 11, 2026
…r-idle-timeout

fix(cortex): use idle time instead of lifetime for worker timeout
rktmeister pushed a commit to rktmeister/spacebot that referenced this pull request Mar 11, 2026
The supervisor was still killing idle OpenCode workers after 10 minutes
of inactivity because last_activity_at stops updating when a worker
enters idle state (waiting for follow-up input).

Add is_idle flag to WorkerTracker:
- Set when WorkerIdle event is observed
- Cleared by track_worker_activity (any WorkerStatus/ToolStarted/
  ToolCompleted event means the worker resumed)
- Overdue filter skips workers with is_idle=true

This complements the last_activity_at fix from spacedriveapp#332 — that fix
prevents killing workers that are actively running tools, this fix
prevents killing workers that are legitimately waiting for follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant