Skip to content

Fix scheduler/triggerer deadlock on deferrable task instances#65920

Closed
shaealh wants to merge 6 commits into
apache:mainfrom
shaealh:shaealh/65818
Closed

Fix scheduler/triggerer deadlock on deferrable task instances#65920
shaealh wants to merge 6 commits into
apache:mainfrom
shaealh:shaealh/65818

Conversation

@shaealh

@shaealh shaealh commented Apr 27, 2026

Copy link
Copy Markdown
Contributor

Regarding Issue #65818
related: #65836

This changes the scheduler trigger-timeout path and trigger cleanup path to lock candidate task_instance rows in deterministic primary-key order before updating them.

On MySQL/InnoDB, the previous bulk updates could reach overlapping task_instance rows through different indexes, allowing the scheduler and triggerer to acquire row/gap locks in different orders. With HA schedulers and deferrable tasks, that can deadlock.

The new flow is:

  1. Select matching task_instance.id rows ordered by primary key.
  2. Lock those rows with FOR UPDATE SKIP LOCKED.
  3. Update only the selected IDs.
  4. Repeat in bounded batches.

This keeps the existing predicates and update values intact while making lock acquisition consistent across the scheduler and triggerer writers.

Tests added:

  • scheduler timeout path processes more than one batch
  • trigger cleanup path clears trigger IDs across more than one batch

Tests run:

  • ruff check airflow-core/src/airflow/jobs/scheduler_job_runner.py airflow-core/src/airflow/models/trigger.py airflow-core/tests/unit/jobs/test_scheduler_job.py airflow-core/tests/unit/models/test_trigger.py
  • python -m compileall -q airflow-core/src/airflow/jobs/scheduler_job_runner.py airflow-core/src/airflow/models/trigger.py airflow-core/tests/unit/jobs/test_scheduler_job.py airflow-core/tests/unit/models/test_trigger.py
  • AIRFLOW_HOME=/tmp/airflow-65818-test-home PATH=/tmp/airflow-test-bin:$PATH .venv/bin/python -m pytest airflow-core/tests/unit/models/test_trigger.py -k 'clean_unused' --with-db-init
  • AIRFLOW_HOME=/tmp/airflow-65818-test-home PATH=/tmp/airflow-test-bin:$PATH .venv/bin/python -m pytest airflow-core/tests/unit/jobs/test_scheduler_job.py -k 'timeout_triggers' --with-db-init

Important

🛠️ Maintainer triage note for @shaealh · by @potiuk · 2026-06-22 06:44 UTC

This PR is being closed to keep the review queue clean:

  • The ready-for-review label was stale: ~11 days with no author response since the last maintainer comment, and the branch now has merge conflicts.

The ball is in your court — reopen this PR (or open a fresh one) once you've rebased onto main and resolved the conflicts. No rush.

Automated triage — may be imperfect; a maintainer takes the next look.

@shaealh

shaealh commented Apr 27, 2026

Copy link
Copy Markdown
Contributor Author

Hi team, can I get an approval to merge? Thanks

Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py Outdated
Comment thread airflow-core/src/airflow/models/trigger.py
@potiuk potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label May 19, 2026
@potiuk

potiuk commented May 26, 2026

Copy link
Copy Markdown
Member

@shaealh — There are 4 unresolved review thread(s) on this PR from uranusjr. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? When you believe the feedback is addressed, please mark the threads as resolved and ping the reviewer (uranusjr) for a final look. Thanks!


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@shaealh

shaealh commented May 26, 2026

Copy link
Copy Markdown
Contributor Author

@uranusjr fix has been pushed

@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 27, 2026
@eladkal

eladkal commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

@shaealh can you rebase and resolve conflicts?

@potiuk potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label Jun 22, 2026
@potiuk potiuk closed this Jun 22, 2026
@fedemgp

fedemgp commented Jun 29, 2026

Copy link
Copy Markdown

Maybe this issue is solved within the context of this PR, WDYT?

@shaealh

shaealh commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Since this PR was closed and #65818 is still open, should I refresh the PR against latest main, or do maintainers consider #66412 sufficient to address this deadlock path as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler area:Triggerer

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants