fix(cortex): use idle time instead of lifetime for worker timeout#332
Conversation
The supervisor was killing active workers after 300s because the timeout was measured from spawn time, ignoring ongoing activity. Workers running tools and reporting status were still getting killed. Track last_activity_at on WorkerTracker, refreshed by WorkerStatus, ToolStarted, and ToolCompleted events. Timeout now fires only after 10 minutes of inactivity (up from 5 minutes total lifetime).
WalkthroughUpdated Cortex worker timeout default from 300s to 600s and added per-worker activity tracking. Idle detection and kill logic now use each worker's last_activity_at (idle) instead of started_at (lifetime); related logging, helper names, observers, and tests were updated accordingly. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| ProcessEvent::WorkerComplete { | ||
| worker_id, success, .. | ||
| } => state.track_worker_complete(*worker_id, *success, threshold), | ||
| ProcessEvent::WorkerStatus { worker_id, .. } => { |
There was a problem hiding this comment.
Nit: should ProcessEvent::StatusUpdate { process_id: ProcessId::Worker(..) } also reset the idle clock? SpacebotHook::send_status emits StatusUpdate, so those updates won’t count as activity here.
| ProcessEvent::WorkerStatus { worker_id, .. } => { | |
| ProcessEvent::StatusUpdate { | |
| process_id: ProcessId::Worker(worker_id), | |
| .. | |
| } => state.track_worker_activity(*worker_id), |
| fn worker_activity_resets_idle_clock() { | ||
| let mut state = HealthRuntimeState::default(); | ||
| let worker_id = uuid::Uuid::new_v4(); | ||
| state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string()); | ||
|
|
||
| let tracker_before = state.worker_trackers.get(&worker_id).unwrap().clone(); | ||
| // Simulate time passing by checking that activity updates the timestamp. | ||
| std::thread::sleep(std::time::Duration::from_millis(10)); | ||
| state.track_worker_activity(worker_id); | ||
|
|
||
| let tracker_after = state.worker_trackers.get(&worker_id).unwrap(); | ||
| assert!( | ||
| tracker_after.last_activity_at > tracker_before.last_activity_at, | ||
| "last_activity_at should advance after track_worker_activity" | ||
| ); | ||
| assert_eq!( | ||
| tracker_after.started_at, tracker_before.started_at, | ||
| "started_at should not change" | ||
| ); | ||
| } |
There was a problem hiding this comment.
Unit test: avoiding thread::sleep here should make this less timing-sensitive/flaky on slow CI. You can force an older last_activity_at and assert it advances.
| fn worker_activity_resets_idle_clock() { | |
| let mut state = HealthRuntimeState::default(); | |
| let worker_id = uuid::Uuid::new_v4(); | |
| state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string()); | |
| let tracker_before = state.worker_trackers.get(&worker_id).unwrap().clone(); | |
| // Simulate time passing by checking that activity updates the timestamp. | |
| std::thread::sleep(std::time::Duration::from_millis(10)); | |
| state.track_worker_activity(worker_id); | |
| let tracker_after = state.worker_trackers.get(&worker_id).unwrap(); | |
| assert!( | |
| tracker_after.last_activity_at > tracker_before.last_activity_at, | |
| "last_activity_at should advance after track_worker_activity" | |
| ); | |
| assert_eq!( | |
| tracker_after.started_at, tracker_before.started_at, | |
| "started_at should not change" | |
| ); | |
| } | |
| fn worker_activity_resets_idle_clock() { | |
| let mut state = HealthRuntimeState::default(); | |
| let worker_id = uuid::Uuid::new_v4(); | |
| state.track_worker_start(worker_id, Some(Arc::from("ch")), "builtin".to_string()); | |
| let started_at = state.worker_trackers.get(&worker_id).unwrap().started_at; | |
| let previous_activity_at = { | |
| let tracker = state.worker_trackers.get_mut(&worker_id).unwrap(); | |
| tracker.last_activity_at = tracker | |
| .last_activity_at | |
| .checked_sub(Duration::from_secs(1)) | |
| .unwrap(); | |
| tracker.last_activity_at | |
| }; | |
| state.track_worker_activity(worker_id); | |
| let tracker_after = state.worker_trackers.get(&worker_id).unwrap(); | |
| assert!( | |
| tracker_after.last_activity_at > previous_activity_at, | |
| "last_activity_at should advance after track_worker_activity" | |
| ); | |
| assert_eq!( | |
| tracker_after.started_at, started_at, | |
| "started_at should not change" | |
| ); | |
| } |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/agent/cortex.rs`:
- Around line 895-897: The main cortex loop's tokio::select! needs to be made
deterministic to avoid a race where the tick branch (run_health_tick) can run
before queued events are processed; add the biased; marker to the select! at the
select that reads from event_rx so event branches are prioritized over the tick
branch that filters by last_activity_at and worker_timeout, ensuring activity
events update last_activity_at before run_health_tick() performs idle filtering.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5504af9b-3855-457d-bdaa-eb5fb750a1f5
📒 Files selected for processing (4)
docs/content/docs/(configuration)/config.mdxdocs/content/docs/(core)/cortex.mdxsrc/agent/cortex.rssrc/config/types.rs
Prevents a race where run_health_tick() could filter on last_activity_at before pending WorkerStatus/ToolStarted/ToolCompleted events update it.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
src/agent/cortex.rs (1)
831-850:⚠️ Potential issue | 🟠 MajorTrack worker
StatusUpdateevents as activity too.Line 831 and surrounding match arms still miss
ProcessEvent::StatusUpdateforProcessId::Worker(_), so a worker can emit status heartbeats without resettinglast_activity_atand be cancelled as idle.Suggested fix
ProcessEvent::WorkerStatus { worker_id, .. } => { state.track_worker_activity(*worker_id); } + ProcessEvent::StatusUpdate { + process_id: ProcessId::Worker(worker_id), + .. + } => { + state.track_worker_activity(*worker_id); + } ProcessEvent::ToolStarted { process_id: ProcessId::Worker(worker_id), .. } => { state.track_worker_activity(*worker_id);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/agent/cortex.rs` around lines 831 - 850, The match over ProcessEvent is missing handling for ProcessEvent::StatusUpdate for worker processes, so add a branch that matches ProcessEvent::StatusUpdate { process_id: ProcessId::Worker(worker_id), .. } (or use if let ProcessId::Worker(...) pattern) and call state.track_worker_activity(*worker_id) inside it; place this new arm alongside the existing ProcessEvent::WorkerStatus, ProcessEvent::ToolStarted and ProcessEvent::ToolCompleted arms so worker status heartbeats update last_activity_at via state.track_worker_activity.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@src/agent/cortex.rs`:
- Around line 831-850: The match over ProcessEvent is missing handling for
ProcessEvent::StatusUpdate for worker processes, so add a branch that matches
ProcessEvent::StatusUpdate { process_id: ProcessId::Worker(worker_id), .. } (or
use if let ProcessId::Worker(...) pattern) and call
state.track_worker_activity(*worker_id) inside it; place this new arm alongside
the existing ProcessEvent::WorkerStatus, ProcessEvent::ToolStarted and
ProcessEvent::ToolCompleted arms so worker status heartbeats update
last_activity_at via state.track_worker_activity.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e3a3073c-97fe-4194-a18a-150c579bfec0
📒 Files selected for processing (1)
src/agent/cortex.rs
The supervisor was still killing idle OpenCode workers after 10 minutes of inactivity because last_activity_at stops updating when a worker enters idle state (waiting for follow-up input). Add is_idle flag to WorkerTracker: - Set when WorkerIdle event is observed - Cleared by track_worker_activity (any WorkerStatus/ToolStarted/ ToolCompleted event means the worker resumed) - Overdue filter skips workers with is_idle=true This complements the last_activity_at fix from #332 — that fix prevents killing workers that are actively running tools, this fix prevents killing workers that are legitimately waiting for follow-up.
…r-idle-timeout fix(cortex): use idle time instead of lifetime for worker timeout
The supervisor was still killing idle OpenCode workers after 10 minutes of inactivity because last_activity_at stops updating when a worker enters idle state (waiting for follow-up input). Add is_idle flag to WorkerTracker: - Set when WorkerIdle event is observed - Cleared by track_worker_activity (any WorkerStatus/ToolStarted/ ToolCompleted event means the worker resumed) - Overdue filter skips workers with is_idle=true This complements the last_activity_at fix from spacedriveapp#332 — that fix prevents killing workers that are actively running tools, this fix prevents killing workers that are legitimately waiting for follow-up.
Summary
worker_timeout_secsbumped from 300 to 600 (10 minutes of inactivity).idle_secsandlifetime_secsfor better diagnostics.What changed
WorkerTrackergets alast_activity_atfield, reset onWorkerStatus,ToolStarted, andToolCompletedevents from workers.last_activity_atinstead ofstarted_at."idle for Xs, exceeded Ys timeout (supervisor)"instead of"timed out after Xs (supervisor)".Race / terminal-state reasoning
track_worker_activityis a simple timestamp update inside the existinghealth_runtime_statewrite lock, same lock used by the timeout filter inrun_health_tick. No new lock ordering or contention introduced. TheToolStarted/ToolCompletedmatch arms only fire forProcessId::Worker, so channel/branch tool events are unaffected.Verification
cargo fmt --all -- --check— cleancargo clippy --all-targets— clean (no warnings)cargo test --lib -- cortex— 45/45 pass, including two new tests:worker_activity_resets_idle_clockworker_activity_noop_for_unknown_worker