Skip to content

Fix scheduler/triggerer deadlock on task_instance for deferrable tasks#65836

Draft
rapsealk wants to merge 2 commits into
apache:mainfrom
rapsealk:fix-65818-trigger-id-deadlock
Draft

Fix scheduler/triggerer deadlock on task_instance for deferrable tasks#65836
rapsealk wants to merge 2 commits into
apache:mainfrom
rapsealk:fix-65818-trigger-id-deadlock

Conversation

@rapsealk

@rapsealk rapsealk commented Apr 25, 2026

Copy link
Copy Markdown
Contributor

On HA scheduler deployments backed by MySQL, the scheduler's SchedulerJobRunner.check_trigger_timeouts() and the triggerer's Trigger.clean_unused() both issue bulk UPDATEs against task_instance rows but reach those rows via different indexes — (state, trigger_timeout) vs (trigger_id). InnoDB therefore acquires row + gap locks in different orders for the two queries, producing classic A→B / B→A deadlocks once a deferrable workload has overlapping rows in flight (especially with multiple scheduler replicas).

This change reshapes both writers to:

  1. SELECT id ... ORDER BY id LIMIT N FOR UPDATE SKIP LOCKED
  2. UPDATE task_instance ... WHERE id IN (...)

so that both paths take row locks in deterministic primary-key order, eliminating the cross-index deadlock. SKIP LOCKED additionally keeps concurrent scheduler replicas from blocking on each other. Behaviour is unchanged: same predicates, same SET clauses, just batched and locked.

Two regression tests cover convergence when more rows match than fit in a single batch (the batch size is monkeypatched to 2 in the tests).

closes: #65818


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Opus 4.7 (1M context)

Generated-by: Claude Opus 4.7 (1M context) following the guidelines


Important

🛠️ Maintainer triage note for @rapsealk · by @potiuk · 2026-06-18 13:57 UTC

Paused pending your next update — this PR has been inactive for ~51 days, so it's been moved to draft to keep the review queue clear:

  • Rebase on the latest main, address any new failures, and mark it Ready for review when you pick it back up — no rush.
  • See the Pull Request quality criteria.

The ball is in your court — you've been assigned to this PR.

Automated triage — may be imperfect; a maintainer takes the next look.

@rapsealk rapsealk requested review from XD-DENG and ashb as code owners April 25, 2026 15:14
@boring-cyborg boring-cyborg Bot added area:Scheduler including HA (high availability) scheduler area:Triggerer labels Apr 25, 2026
The scheduler's check_trigger_timeouts() and the triggerer's Trigger.clean_unused()
both issued bulk UPDATEs against task_instance rows referenced via different index
paths (state+trigger_timeout vs trigger_id), letting InnoDB acquire row+gap locks
in different orders and producing classic A-B / B-A deadlocks under HA scheduler
deployments — especially with multiple scheduler replicas.

Both writers now select candidate ids in primary-key order with SELECT ... FOR
UPDATE SKIP LOCKED, then UPDATE ... WHERE id IN (...) in bounded batches. The
deterministic PK lock order eliminates the cross-index deadlock; SKIP LOCKED keeps
concurrent scheduler replicas from blocking each other.

closes: apache#65818

Assisted-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rapsealk rapsealk force-pushed the fix-65818-trigger-id-deadlock branch from e35be28 to aa571ff Compare April 25, 2026 15:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler area:Triggerer ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Airflow 3.2.0 scheduler/triggerer deadlock on task_instance due to concurrent updates of deferrable tasks

2 participants