Skip to content

AIP-104: Task Iteration and Dynamic Task Batching#62922

Open
dabla wants to merge 92 commits into
apache:mainfrom
dabla:feature/dynamic-task-iteration
Open

AIP-104: Task Iteration and Dynamic Task Batching#62922
dabla wants to merge 92 commits into
apache:mainfrom
dabla:feature/dynamic-task-iteration

Conversation

@dabla

@dabla dabla commented Mar 5, 2026

Copy link
Copy Markdown
Contributor

Was generative AI tooling used to co-author this PR?
  • [ x ] Yes (please specify the tool below)

Github Copilot with Claude Opus 4.6 for some parts like setting up tests or improving documentation.

Description

This PR is the initial implementation of Dynamic Task Iteration (DTI), as discussed in the devlist and building upon the foundations of AIP-104.

For further context on the use cases and performance benefits of DTI, see this Medium Article.

The XCom Database Constraint Challenge
While porting our internal "monkey-patched" version of DTI (used since Airflow 2.x) to the core, I've identified a significant technical hurdle regarding XCom handling.

The Issue

Around Airflow 2.10/2.11, a change was introduced to the database constraints for the XCom table. Specifically:

  • Current State: The DB prevents creating indexed XComs (map_index >= 0) unless a corresponding mapped TaskInstance exists in the task_instance table.
  • The Conflict: DTI is designed to process multiple indexed XComs within a single Task Instance. Because there is no 1-to-1 mapping of map_index to a physical TI row, the DB constraint blocks the insertion of these results.

The drawback is that XCom's wouldn't automatically be removed from the database when a TaskInstance would be deleted, which is the purpose of that constraint. So for DTI it would be a good solution, but not for the DTM.

Current Workaround in this PR

To maintain functionality without immediate schema changes, I have implemented a new XComIterable class. This appends the index directly to the XCom key to bypass the constraint and manages the iteration logic internally.

I believe the cleanest path forward is to add a dedicated route (POST method) in the execution API which would allow to retrieve multiple XCom's related to a TaskInstance with multiple keys, that way there would be less interaction needed between the API server and the XComIterable from the Task SDK.

Examples

Examples

The examples below assume an HTTP connection named pokeapi pointing to https://pokeapi.co.

Task Iteration

This example fetches a list of Pokémon from the PokéAPI and then uses Dynamic Task Iteration (DTI) to retrieve the details of each Pokémon. A single task instance processes all Pokémon URLs.

from airflow.sdk import dag, task
from airflow.providers.http.hooks.http import HttpHook, HttpAsyncHook

from pendulum import datetime


@dag(
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
)
def pokemon_iteration():
    @task
    def list_pokemon() -> list[str]:
        response = HttpHook(
            http_conn_id="pokeapi",
            method="GET",
        ).run(
            endpoint="api/v2/pokemon?limit=100",
        )

        return [
            pokemon["url"].replace("https://pokeapi.co/", "")
            for pokemon in response.json()["results"]
        ]

    @task(
        retries=3,
        task_concurrency=2,
        show_return_value_in_logs=False,
    )
    async def get_pokemon(url: str):
        async with HttpAsyncHook(
            http_conn_id="pokeapi",
            method="GET",
        ).session() as session:
            response = await session.run(endpoint=url)
            return await response.json()

    get_pokemon.iterate(
        url=list_pokemon(),
    )


pokemon_iteration()

Task Iteration with Dynamic Task Batching

This example performs the same work as above, but batches the workload into two concurrent task instances. Each task instance processes approximately half of the Pokémon URLs using Task Iteration.

from airflow.sdk import dag, task
from airflow.providers.http.hooks.http import HttpHook, HttpAsyncHook

from pendulum import datetime


@dag(
    start_date=datetime(2025, 1, 1),
    schedule=None,
    catchup=False,
)
def pokemon_batched_iteration():
    @task
    def list_pokemon() -> list[str]:
        response = HttpHook(
            http_conn_id="pokeapi",
            method="GET",
        ).run(
            endpoint="api/v2/pokemon?limit=100",
        )

        return [
            pokemon["url"].replace("https://pokeapi.co/", "")
            for pokemon in response.json()["results"]
        ]

    @task(
        retries=3,
        task_concurrency=2,
        show_return_value_in_logs=False,
    )
    async def get_pokemon(url: str):
        async with HttpAsyncHook(
            http_conn_id="pokeapi",
            method="GET",
        ).session() as session:
            response = await session.run(endpoint=url)
            return await response.json()

    get_pokemon.batch(size=2).iterate(
        url=list_pokemon(),
    )


pokemon_batched_iteration()

Comparison

Pattern Task Instances Work Per Task
get_pokemon.expand(url=urls) 100 1 Pokémon
get_pokemon.iterate(url=urls) 1 100 Pokémon
get_pokemon.batch(size=2).iterate(url=urls) 2 ~50 Pokémon each

This demonstrates how Task Iteration can significantly reduce TaskInstance creation overhead while still allowing controlled parallelism through batching.

Note: batch(size=2) creates two batches and therefore two concurrent TaskInstances. Each TaskInstance then processes its assigned items using Task Iteration.


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@dabla dabla requested review from amoghrajesh, ashb and kaxil as code owners March 5, 2026 09:28
@dabla dabla marked this pull request as draft March 5, 2026 09:35
@dabla dabla force-pushed the feature/dynamic-task-iteration branch 3 times, most recently from d8a30b9 to edad5de Compare March 5, 2026 12:39
kaxil
kaxil previously requested changes Mar 5, 2026

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this — excited to see DTI taking shape for Airflow 3.2. I've gone through the full diff and have feedback on the implementation, some are bugs that would crash at runtime, others are design choices worth iterating on.

A few high-level things:

  1. No tests. ~700 lines of new production code with zero test coverage. We need tests for IterableOperator, TaskExecutor, MappedTaskInstance, HybridExecutor, XComIterable, DecoratedDeferredAsyncOperator, and the iterate/iterate_kwargs methods — covering success, failure, retry, deferral, and edge cases.

  2. Worker resilience. Since DTI runs N sub-tasks inside a single worker process, we need to think through what happens when that worker dies mid-execution — the scheduler has no record of which sub-tasks completed. Worth documenting the expected behavior and trade-offs here (and whether we want to add checkpointing later).

  3. Thread safety. Several shared mutable structures (context dict, os.environ) are accessed concurrently from multiple threads without synchronization. This needs to be addressed before merge.

Inline comments below with specifics.

Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
Comment thread task-sdk/src/airflow/sdk/bases/operator.py
Comment thread task-sdk/src/airflow/sdk/definitions/mappedoperator.py
Comment thread task-sdk/src/airflow/sdk/definitions/_internal/expandinput.py
Comment thread task-sdk/src/airflow/sdk/definitions/_internal/expandinput.py
Comment thread task-sdk/src/airflow/sdk/execution_time/executor.py
Comment thread task-sdk/src/airflow/sdk/execution_time/lazy_sequence.py
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
@dabla

dabla commented Mar 5, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for working on this — DTI is an interesting concept and I can see the use case. I've gone through the full diff and have a number of concerns, some are bugs that would crash at runtime, others are architectural questions worth discussing before this goes further.

A few high-level things:

  1. No tests. ~700 lines of new production code with zero test coverage. We need tests for IterableOperator, TaskExecutor, MappedTaskInstance, HybridExecutor, XComIterable, DecoratedDeferredAsyncOperator, and the iterate/iterate_kwargs methods — covering success, failure, retry, deferral, and edge cases.

Thanks for pointing this out. As mentioned earlier on Slack, this PR is currently intended as an initial draft to demonstrate the concept and gather early architectural feedback.

I agree that proper test coverage is essential before this can move forward. The plan is to add unit tests covering the components you mentioned (IterableOperator, TaskExecutor, MappedTaskInstance, HybridExecutor, XComIterable, DecoratedDeferredAsyncOperator, and the iterate/iterate_kwargs APIs), including scenarios for success, retries, failures, deferral, and edge cases.

Once we converge on the architectural direction, I will add the corresponding test suite.

  1. Architectural concern. This builds a mini-executor inside an operator — running N tasks in threads with in-memory XCom, custom retry logic, and sleep()-based retry delays. The scheduler has no visibility into sub-task states, so if the worker dies mid-execution there's no record of which sub-tasks completed. This feels like it needs broader design discussion (probably an AIP) before merging, since it fundamentally changes how task execution works.

I agree this is an important architectural concern and worth discussing further.

The goal of this prototype is to explore a trade-off between observability and scheduling overhead, @ashb and @potiuk mentioned the same remark before. If we try to preserve the same visibility and lifecycle guarantees as Dynamic Task Mapping, we essentially end up re-implementing DTM semantics, which brings back the same scheduler overhead that this approach is trying to avoid.

This proposal intentionally explores a different point in that trade-off space: executing iterations within a single task while allowing controlled parallelism. That does mean the scheduler has indeed less visibility (but also less load) into the internal execution units.

  1. Thread safety. Several shared mutable structures (context dict, os.environ) are accessed concurrently from multiple threads without synchronization.

Good point — thread safety needs to be handled carefully here.

Regarding the task context, my understanding is that operators already receive a per-task context instance, but you're right that when running iterations concurrently we should avoid sharing mutable structures across threads. One possible approach would be to create a shallow or deep copy of the context for each execution unit to ensure isolation.

If you have concerns about specific structures (e.g., os.environ or others), I'm happy to address them and introduce appropriate synchronization or isolation mechanisms where needed.

@dabla dabla changed the title refactor: Implemented Dynamic Task Iteration Implemented Dynamic Task Iteration Mar 5, 2026
@kaxil kaxil self-requested a review March 12, 2026 00:02
@kaxil kaxil dismissed their stale review March 12, 2026 00:02

Stale review

Comment thread task-sdk/src/airflow/sdk/bases/operator.py
Comment thread task-sdk/src/airflow/sdk/bases/operator.py Outdated
Comment thread task-sdk/src/airflow/sdk/bases/operator.py
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/executor.py
Comment thread task-sdk/src/airflow/sdk/definitions/_internal/expandinput.py
Comment thread task-sdk/tests/task_sdk/definitions/conftest.py Outdated
@dabla dabla force-pushed the feature/dynamic-task-iteration branch from 960438c to 765fcfb Compare March 18, 2026 23:10
@dabla dabla marked this pull request as ready for review March 19, 2026 17:05
@dabla dabla requested a review from kaxil March 19, 2026 20:27
@kaxil kaxil requested a review from uranusjr March 20, 2026 00:19
Comment thread task-sdk/src/airflow/sdk/definitions/iterableoperator.py Outdated
Comment thread task-sdk/src/airflow/sdk/bases/operator.py Outdated
Comment thread task-sdk/src/airflow/sdk/definitions/mappedoperator.py Outdated
Comment thread task-sdk/tests/task_sdk/definitions/conftest.py Outdated
@kaxil

kaxil commented Mar 20, 2026

Copy link
Copy Markdown
Member

@uranusjr You should also review this PR since it touches several important modules :)

@dabla dabla force-pushed the feature/dynamic-task-iteration branch from b11f852 to 9f2c750 Compare March 20, 2026 08:35
@dabla dabla requested a review from kaxil March 20, 2026 16:01
@dabla dabla force-pushed the feature/dynamic-task-iteration branch from 16ec1fc to 3242037 Compare March 20, 2026 17:59
@dabla dabla changed the title Implemented Dynamic Task Iteration AIP-98: Dynamic Task Iteration Mar 21, 2026
@dabla

dabla commented Mar 21, 2026

Copy link
Copy Markdown
Contributor Author

@uranusjr @kaxil In our patched Airflow installation I had to register the XComIterable manually with serde for serialization. How do I make sure it’s automatically registered with serde?

@dabla dabla marked this pull request as draft April 14, 2026 19:43
@dabla dabla changed the title AIP-98: Dynamic Task Iteration AIP-104: Dynamic Task Iteration and Dynamic Task Partitioning Apr 17, 2026
@dabla dabla force-pushed the feature/dynamic-task-iteration branch from 05fd601 to 5d7a705 Compare June 29, 2026 07:10
Comment thread task-sdk/src/airflow/sdk/definitions/_internal/expandinput.py Outdated
Comment thread task-sdk/src/airflow/sdk/definitions/_internal/expandinput.py Outdated
Comment thread task-sdk/src/airflow/sdk/execution_time/task_runner.py
Comment thread task-sdk/pyproject.toml Outdated
Comment thread .github/CODEOWNERS Outdated
@dabla dabla requested review from uranusjr and removed request for bolkedebruin July 2, 2026 15:47
# Conflicts:
#	task-sdk/src/airflow/sdk/execution_time/task_runner.py
#	task-sdk/tests/task_sdk/definitions/test_xcom_arg.py
@dabla dabla force-pushed the feature/dynamic-task-iteration branch from 8475b56 to cf0a71e Compare July 2, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants