AIP-104: Task Iteration and Dynamic Task Batching#62922
Conversation
d8a30b9 to
edad5de
Compare
There was a problem hiding this comment.
Thanks for working on this — excited to see DTI taking shape for Airflow 3.2. I've gone through the full diff and have feedback on the implementation, some are bugs that would crash at runtime, others are design choices worth iterating on.
A few high-level things:
-
No tests. ~700 lines of new production code with zero test coverage. We need tests for
IterableOperator,TaskExecutor,MappedTaskInstance,HybridExecutor,XComIterable,DecoratedDeferredAsyncOperator, and theiterate/iterate_kwargsmethods — covering success, failure, retry, deferral, and edge cases. -
Worker resilience. Since DTI runs N sub-tasks inside a single worker process, we need to think through what happens when that worker dies mid-execution — the scheduler has no record of which sub-tasks completed. Worth documenting the expected behavior and trade-offs here (and whether we want to add checkpointing later).
-
Thread safety. Several shared mutable structures (
contextdict,os.environ) are accessed concurrently from multiple threads without synchronization. This needs to be addressed before merge.
Inline comments below with specifics.
Thanks for pointing this out. As mentioned earlier on Slack, this PR is currently intended as an initial draft to demonstrate the concept and gather early architectural feedback. I agree that proper test coverage is essential before this can move forward. The plan is to add unit tests covering the components you mentioned (IterableOperator, TaskExecutor, MappedTaskInstance, HybridExecutor, XComIterable, DecoratedDeferredAsyncOperator, and the iterate/iterate_kwargs APIs), including scenarios for success, retries, failures, deferral, and edge cases. Once we converge on the architectural direction, I will add the corresponding test suite.
I agree this is an important architectural concern and worth discussing further. The goal of this prototype is to explore a trade-off between observability and scheduling overhead, @ashb and @potiuk mentioned the same remark before. If we try to preserve the same visibility and lifecycle guarantees as Dynamic Task Mapping, we essentially end up re-implementing DTM semantics, which brings back the same scheduler overhead that this approach is trying to avoid. This proposal intentionally explores a different point in that trade-off space: executing iterations within a single task while allowing controlled parallelism. That does mean the scheduler has indeed less visibility (but also less load) into the internal execution units.
Good point — thread safety needs to be handled carefully here. Regarding the task context, my understanding is that operators already receive a per-task context instance, but you're right that when running iterations concurrently we should avoid sharing mutable structures across threads. One possible approach would be to create a shallow or deep copy of the context for each execution unit to ensure isolation. If you have concerns about specific structures (e.g., os.environ or others), I'm happy to address them and introduce appropriate synchronization or isolation mechanisms where needed. |
960438c to
765fcfb
Compare
|
@uranusjr You should also review this PR since it touches several important modules :) |
b11f852 to
9f2c750
Compare
16ec1fc to
3242037
Compare
05fd601 to
5d7a705
Compare
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
…hread safety instead of global list and call it from within TaskExecutor
…tion as this is a runtime property which we don't want to be serialized
# Conflicts: # task-sdk/src/airflow/sdk/execution_time/task_runner.py # task-sdk/tests/task_sdk/definitions/test_xcom_arg.py
8475b56 to
cf0a71e
Compare
…with TaskExecutor
…leTaskInstanceException
… reachable via map(timeout=self.timeout) (which raises TimeoutError from the iterator, not per-task), and sync tasks in the thread pool have no enforced timeout
…ffering XCom inputs
…te_literal_cross_product
Was generative AI tooling used to co-author this PR?
Github Copilot with Claude Opus 4.6 for some parts like setting up tests or improving documentation.
Description
This PR is the initial implementation of Dynamic Task Iteration (DTI), as discussed in the devlist and building upon the foundations of AIP-104.
For further context on the use cases and performance benefits of DTI, see this Medium Article.
The XCom Database Constraint Challenge
While porting our internal "monkey-patched" version of DTI (used since Airflow 2.x) to the core, I've identified a significant technical hurdle regarding XCom handling.
The Issue
Around Airflow 2.10/2.11, a change was introduced to the database constraints for the XCom table. Specifically:
The drawback is that XCom's wouldn't automatically be removed from the database when a TaskInstance would be deleted, which is the purpose of that constraint. So for DTI it would be a good solution, but not for the DTM.
Current Workaround in this PR
To maintain functionality without immediate schema changes, I have implemented a new XComIterable class. This appends the index directly to the XCom key to bypass the constraint and manages the iteration logic internally.
I believe the cleanest path forward is to add a dedicated route (POST method) in the execution API which would allow to retrieve multiple XCom's related to a TaskInstance with multiple keys, that way there would be less interaction needed between the API server and the XComIterable from the Task SDK.
Examples
Examples
The examples below assume an HTTP connection named
pokeapipointing tohttps://pokeapi.co.Task Iteration
This example fetches a list of Pokémon from the PokéAPI and then uses Dynamic Task Iteration (DTI) to retrieve the details of each Pokémon. A single task instance processes all Pokémon URLs.
Task Iteration with Dynamic Task Batching
This example performs the same work as above, but batches the workload into two concurrent task instances. Each task instance processes approximately half of the Pokémon URLs using Task Iteration.
Comparison
get_pokemon.expand(url=urls)get_pokemon.iterate(url=urls)get_pokemon.batch(size=2).iterate(url=urls)This demonstrates how Task Iteration can significantly reduce TaskInstance creation overhead while still allowing controlled parallelism through batching.
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.