Skip to content

Detect and replace dead worker processes#67

Merged
pattonw merged 1 commit intofunkelab:masterfrom
aschampion:pr/detect-and-replace-dead-worker-processes
Mar 17, 2026
Merged

Detect and replace dead worker processes#67
pattonw merged 1 commit intofunkelab:masterfrom
aschampion:pr/detect-and-replace-dead-worker-processes

Conversation

@aschampion
Copy link
Contributor

Discovered while diagnosing #66 and an investigating an OOMing worker (hence noticing #65)

WorkerPool.reap_dead_workers checks process.is_alive() and removes exited
workers. TaskWorkerPools.check_worker_health calls it each event loop
iteration and spawns replacements. Without this, workers that die silently
(e.g., SIGKILL, SystemExit, OOM) leave the server hanging forever since
no error is queued and no blocks are returned. Includes a previously
failing test.

LLM disclosure: while I diagnosed and debugged the root issue, LLMs were used to exclude other causes and draft fixes. But all code here has been human reviewed or created.

WorkerPool.reap_dead_workers checks process.is_alive() and removes exited
workers. TaskWorkerPools.check_worker_health calls it each event loop
iteration and spawns replacements. Without this, workers that die silently
(e.g., SIGKILL, SystemExit, OOM) leave the server hanging forever since
no error is queued and no blocks are returned. Includes a previously
failing test.
Copy link
Collaborator

@pattonw pattonw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good solution to a class of errors that daisy did not handle.

I think this will further exacerbate issue #64 since we don't have any checks on increment_workers to limit the number of workers we start. But that is a separate issue that should have its own pull request.

@pattonw pattonw merged commit 1b46a77 into funkelab:master Mar 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants