Requeue KubernetesExecutor tasks whose pod failed before execution started#69058
Requeue KubernetesExecutor tasks whose pod failed before execution started#69058seanmuth wants to merge 6 commits into
Conversation
|
Working on a test confirmation in Astro Hosted now |
| if state == TaskInstanceState.FAILED and self._is_pre_execution_failure( | ||
| state, | ||
| self._get_task_instance_state(key, session=session), | ||
| failure_details, | ||
| self.pod_launch_failure_excluded_container_reasons, | ||
| ): |
There was a problem hiding this comment.
| if state == TaskInstanceState.FAILED and self._is_pre_execution_failure( | |
| state, | |
| self._get_task_instance_state(key, session=session), | |
| failure_details, | |
| self.pod_launch_failure_excluded_container_reasons, | |
| ): | |
| if ( | |
| state == TaskInstanceState.FAILED | |
| and self.pod_launch_failure_max_retries != 0 | |
| and self._is_pre_execution_failure( | |
| state, | |
| self._get_task_instance_state(key, session=session), | |
| failure_details, | |
| self.pod_launch_failure_excluded_container_reasons, | |
| ) | |
| ): |
This saves some db queries when no retries are allowed. Not a big deal, I think? (It shouldn’t be too common to explicitly disallow retries.)
There was a problem hiding this comment.
Applied in 2352e1f453 — the pod_launch_failure_max_retries != 0 guard now short-circuits before the task-instance state lookup, so the DB query is skipped entirely when requeues are disabled.
Drafted-by: Claude Code (Opus 4.8); reviewed by @seanmuth before posting
| if state == ADOPTED: | ||
| # When the task pod is adopted by another executor, | ||
| # then remove the task from the current executor running queue. | ||
| self.last_known_jobs.pop(key, None) | ||
| try: | ||
| self.running.remove(key) | ||
| except KeyError: | ||
| self.log.debug("TI key not in running: %s", key) | ||
| return | ||
|
|
||
| if state == TaskInstanceState.RUNNING: | ||
| # The task process started, so any later failure is an execution failure that should | ||
| # not be requeued by the pre-execution path below. | ||
| self.last_known_jobs.pop(key, None) |
There was a problem hiding this comment.
These should probably also clean pod_launch_failure_max_retries?
I wonder if we should wrap the containers into an object so they are always handled together.
There was a problem hiding this comment.
Done in 2352e1f453 — the job spec, requeue count, and last-requeued pod are now wrapped in a single _PodLaunchAttempt per key (pod_launch_attempts), popped in one place across all exit paths. That also fixes the count leaking: the previous separate pod_launch_failure_attempts counter was never cleared on the ADOPTED path.
Drafted-by: Claude Code (Opus 4.8); reviewed by @seanmuth before posting
|
We should add a release note for this. Especially since this changes the default behavior—all pods are now retried once by default instead of directly fail. We should probably also call out somewhere that if you set |
arkadiuszbach
left a comment
There was a problem hiding this comment.
Adding my findings, as I was working on task failures in KubernetesExecutor due to SIGTERM handling and this PR might address the failures before the Supervisor gets to register its own SIGTERM handler, see #69034
| self.log.debug("TI key not in running: %s", key) | ||
| return | ||
|
|
||
| if state == TaskInstanceState.RUNNING: |
There was a problem hiding this comment.
Seems like this is currently a dead code, TaskInstanceState.RUNNING is never set in process_status
From what i can understand it will indicate POD running status, but POD may get Failed after being RUNNING and the task still being in QUEUED state, for example due to SIGTERM
There was a problem hiding this comment.
Right — removed the cleanup I'd added in the RUNNING branch, since the watcher never emits RUNNING. On the SIGTERM case you raised: a pod that reaches Running then Fails before the runner writes running stays QUEUED, so the pre-execution path requeues it — which is the intended outcome (the task never committed to running), and complements #69034.
Drafted-by: Claude Code (Opus 4.8); reviewed by @seanmuth before posting
| ) | ||
| # Leave the key in self.running and do not write to event_buffer: the scheduler | ||
| # never observes this failure, so no task-level retry is consumed. | ||
| self.task_queue.put(job) |
There was a problem hiding this comment.
Kubernetes may generate multiple events with pod.status.phase=Failed in a short period of time for a single pod. I was checking recently what events it would generate due to SIGTERM and noticed two MODIFIED followed by DELETED, all pod.status.phase=Failed, see below:
2026-06-29T15:26:17.051533Z [debug ] Event: test-kubernetes-0-l32duhyu had an event of type MODIFIED [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:159
2026-06-29T15:26:17.051986Z [debug ] Creating task key for annotations {'dag_id': 'test', 'task_id': 'kubernetes_0', 'logical_date': None, 'run_id': 'manual__2026-06-25T14:26:59.504591+00:00', 'try_number': '3'} [airflow.providers.cncf.kubernetes.kubernetes_helper_functions] loc=kubernetes_helper_functions.py:159
2026-06-29T15:26:17.052418Z [warning ] Event: test-kubernetes-0-l32duhyu Failed, task: test.kubernetes_0.3, annotations: <omitted> [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:308
2026-06-29T15:26:17.340492Z [debug ] Event: test-kubernetes-0-l32duhyu had an event of type MODIFIED [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:159
2026-06-29T15:26:17.340786Z [debug ] Creating task key for annotations {'dag_id': 'test', 'task_id': 'kubernetes_0', 'logical_date': None, 'run_id': 'manual__2026-06-25T14:26:59.504591+00:00', 'try_number': '3'} [airflow.providers.cncf.kubernetes.kubernetes_helper_functions] loc=kubernetes_helper_functions.py:159
2026-06-29T15:26:17.340985Z [warning ] Event: test-kubernetes-0-l32duhyu Failed, task: test.kubernetes_0.3, annotations: <omitted> [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:308
2026-06-29T15:26:17.372148Z [debug ] Running SchedulerJobRunner._create_dagruns_for_dags with retries. Try 1 of 50 [airflow.jobs.scheduler_job_runner.SchedulerJobRunner] loc=retries.py:100
2026-06-29T15:26:17.381234Z [debug ] Event: test-kubernetes-0-l32duhyu had an event of type DELETED [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:159
2026-06-29T15:26:17.381554Z [debug ] Creating task key for annotations {'dag_id': 'test', 'task_id': 'kubernetes_0', 'logical_date': None, 'run_id': 'manual__2026-06-25T14:26:59.504591+00:00', 'try_number': '3'} [airflow.providers.cncf.kubernetes.kubernetes_helper_functions] loc=kubernetes_helper_functions.py:159
2026-06-29T15:26:17.381744Z [warning ] Event: test-kubernetes-0-l32duhyu Failed, task: test.kubernetes_0.3, annotations: <omitted> [airflow.providers.cncf.kubernetes.executors.kubernetes_executor_utils.KubernetesJobWatcher] loc=kubernetes_executor_utils.py:308
so maybe self.last_known_jobs.pop(key, None) should be placed right after self.task_queue.put(job)?
There was a problem hiding this comment.
Good catch — confirmed (MODIFIED×2 + DELETED on SIGTERM). I went a slightly different way than popping the job right after put: that would cap requeues at 1, because the requeue path re-queues the stored job directly rather than going back through execute_async to re-stash it. Instead the executor records the pod each requeue was issued for (_PodLaunchAttempt.requeued_for_pod) and ignores duplicate Failed events for that same pod, while a distinct (requeued) pod still requeues. Covered by test_change_state_pre_execution_failure_dedupes_repeated_events. Fixed in 2352e1f453.
Drafted-by: Claude Code (Opus 4.8); reviewed by @seanmuth before posting
|
Pushed
Full Drafted-by: Claude Code (Opus 4.8); reviewed by @seanmuth before posting |
…arted When a worker pod is destroyed before the task process starts (node drain, autoscaler scale-down, node boot race, transient image pull failure), the task instance is still queued and no task code has run. Reporting this to the scheduler as a normal failure consumes a user-configured retry and raises a misleading failure alert for work that never executed. The executor already has the signal to tell this apart from an execution failure, so it now transparently requeues the pod without consuming a task retry, bounded by pod_launch_failure_retries and excluding container reasons in pod_launch_failure_excluded_container_reasons (default Error).
The tests replaced executor.task_queue with a MagicMock, so the executor.end() teardown looped forever on get_nowait() instead of raising Empty, hitting the 60s CI timeout. Assert the requeue via observable executor state instead of mocking the queue, and pass a valid executor_config to the stash test so execute_async does not bail before recording the job.
Kubernetes can emit several Failed events for a single worker pod (for example two MODIFIED then DELETED on SIGTERM); each one previously triggered another requeue, creating duplicate pods and over-counting attempts. Track the job spec, requeue count, and the pod a requeue was last issued for together in one per-key object so duplicate events for the same pod are ignored and the requeue count no longer leaks when a pod is adopted. Skip the task-instance state lookup when requeues are disabled, and document the default-behavior change and the accumulation risk of setting the retry count to -1.
…oss scheduler restarts Clarify that the in-memory requeue state is intentionally not persisted and that the dead-scheduler adoption path is a safe no-op for it, so a future reader doesn't mistake the lack of recovery for a bug.
7bfd8f3 to
6bde46e
Compare
|
Rebased onto current What #68674 changed under this PR: it fixed the never-drained adoption set by converting How I reconciled it:
Why it composes correctly (not just a marker resolution):
Verified locally: Rebase performed by Claude (Opus 4.8), reviewed by me. |
…equeues' The pre-execution-failure check ran a metadata-db query for every failed pod, breaking the test_pod_failure_logging_* K8s system tests (no task_instance table) and needlessly querying for adopted/finalized pods. Gate the lookup on an existing pod_launch_attempts entry so only pods this executor launched and still tracks are considered. Also add the plural 'requeues' to the docs spelling wordlist (singular forms were already present).
|
Pushed 1. K8s system tests — Fix — and a genuine improvement: reorder the guard so the cheap in-memory # Only pods this executor launched and is still tracking can be requeued; checking the
# in-memory attempt first avoids a metadata-db lookup for adopted or already-finalized pods.
attempt = self.pod_launch_attempts.get(key)
if (
attempt is not None
and state == TaskInstanceState.FAILED
and self.pod_launch_failure_max_retries != 0
and self._is_pre_execution_failure(...)
):
...2. Docs spellcheck — Verified locally: Changes by Claude (Opus 4.8), reviewed by me. |
self.task_queue is typed Queue | None; guard the requeue put() with a TYPE_CHECKING assert (matching the other call sites in this file) so mypy narrows it without tripping ruff's S101 on a runtime assert.
1a6b59c to
2af7e8a
Compare
When a worker pod is destroyed before the task process starts — a node drain, autoscaler scale-down, node boot race, or transient image pull failure — the task instance is still in
queuedstate and no task code has run. Today the KubernetesExecutor reports this to the scheduler as a normalFAILED, which consumes a user-configured task retry and raises a misleading failure alert for work that never executed.This adds a transparent, executor-level requeue for that case. In
_change_state, a pod that reportsFAILEDwhile its task instance is stillQUEUEDis requeued onto the existingtask_queue(the same mechanismtask_publish_max_retriesalready uses for pod creation failures) without writing to the event buffer, so the scheduler never observes the failure and no task-level retry is consumed.Behavior is bounded and configurable:
pod_launch_failure_retries(default1,-1unlimited,0disables) — how many times a task is transparently requeued before failing normally.pod_launch_failure_excluded_container_reasons(defaultError) — container reasons that opt out of the requeue path and consume a normal retry instead. The default excludesError, which covers a container that started executing but whose worker process exited before writingrunningto the DB (most likely an Airflow-specific startup error rather than a transient infrastructure event).The
ti_state == QUEUEDcheck is the authoritative signal: a task that was actually executing would already have transitioned torunning, so OOM-kills and other mid-execution failures are unaffected. Deferrable-operator resume pods are covered for free — when the triggerer fires, the TI returns toqueued, so a resume pod killed beforeexecute_completestarts is requeued rather than discarding already-completed external work.closes: #69052
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.8) following the guidelines