-
Notifications
You must be signed in to change notification settings - Fork 17.3k
Fix orphaned subprocesses and supervisor crash on heartbeat 409 #65738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -690,6 +690,16 @@ def start( | |
|
|
||
| pid = os.fork() | ||
| if pid == 0: | ||
| # Put the task-runner into its own session so its PGID == its own | ||
| # PID. The supervisor can then deliver signals to the whole tree | ||
| # via os.killpg() in kill(), reaching every subprocess the | ||
| # task-runner spawned (e.g. venv children from | ||
| # PythonVirtualenvOperator). Without this, a SIGTERM from kill() | ||
| # only hits the task-runner and any Popen children are reparented | ||
| # to PID 1 and leak as orphans. See issue #65505. | ||
| with suppress(OSError): | ||
| os.setsid() | ||
|
|
||
| # Close and delete of the parent end of the sockets. | ||
| cls._close_unused_sockets(read_requests, read_stdout, read_stderr, read_logs) | ||
|
|
||
|
|
@@ -1007,7 +1017,18 @@ def kill( | |
|
|
||
| for sig in escalation_path: | ||
| try: | ||
| self._process.send_signal(sig) | ||
| # Signal the whole process group so subprocesses the | ||
| # task-runner spawned (venv children, Docker exec, bash | ||
| # shells, etc.) are also reached. Requires the task-runner to | ||
| # have been placed in its own session via os.setsid() at fork | ||
| # time (see start()). See issue #65505. | ||
| try: | ||
| os.killpg(os.getpgid(self._process.pid), sig) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In both cases Either set the group race-free from the parent too ( pgid = os.getpgid(self._process.pid)
if pgid == os.getpgid(0):
self._process.send_signal(sig)
else:
os.killpg(pgid, sig)Separately: this group-signal only runs on the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thinking about this more: if not IS_WINDOWS and process_group_id == os.getpgid(0):
raise RuntimeError("I refuse to kill myself")It also covers what this loop doesn't: SIGTERM -> wait -> SIGKILL escalation via It lives in airflow-core, and the supervisor keeps its |
||
| except (ProcessLookupError, PermissionError): | ||
| # Group vanished or we lack permission (e.g. task already | ||
| # reaped, or the child never reached setsid). Fall back | ||
| # to signalling the task-runner alone. | ||
| self._process.send_signal(sig) | ||
|
|
||
| start = time.monotonic() | ||
| end = start + escalation_delay | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This
setsid()(and thekillpginkill()) lands in the baseWatchedSubprocess, andkill()is defined only here with no subclass override, so it applies to every subprocess type (DagFileProcessorProcess,TriggerRunnerSupervisor,CallbackSubprocess), not just theActivitySubprocesstask-runner this PR describes.The triggerer's long-lived async subprocess installs its own SIGINT/SIGTERM handlers and expects a graceful shutdown; making it a session leader and group-signalling it changes that. Detaching these children into their own session also means a terminal Ctrl-C (foreground-group SIGINT) no longer reaches them directly. The neighbouring
use_exechandles exactly this by being opt-in per subclass (its docstring even notes the DAG processor and triggerer as a follow-up).Suggest the same here: a
new_session: bool = Falseparam onstart()setTrueonly inActivitySubprocess, with thekillpgbranch inkill()gated on it. That keeps the change scoped to the task-runner, and confines the self-signal risk flagged on thekillpgline to the one path you've actually tested.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here: this
setsid()satisfies the# TODO: Make this process a session leaderat line 420 (top of_fork_main). That TODO is now stale and can be dropped.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On the scope point: the existing
set_new_process_group()(airflow.utils.process_utils) does this withos.setpgid(0, 0)rather thanos.setsid(). That still gives a killable process group (sokillpgreaches the whole tree) but without creating a new session / detaching the controlling terminal, so it avoids the Ctrl-C / foreground-group change noted above. Its companionreap_process_group()already has the SIGTERM -> SIGKILL escalation and EPERM/ESRCH handling.Same question as on the kill() thread: should we port/copy these over into task-sdk and use them here instead of a fresh
setsid/killpgpath? Reusing the existingsetpgidgroup + self-group guard would address both the over-broad scope and the self-signal risk in one go.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(just thinking out loud -- we can surely do it in separate PR too but wanted to raise so I don't forget either :) )