Skip to content

Introducing object store backend for task and asset store#68283

Merged
amoghrajesh merged 19 commits into
apache:mainfrom
astronomer:aip-103-object-storage-backend
Jun 30, 2026
Merged

Introducing object store backend for task and asset store#68283
amoghrajesh merged 19 commits into
apache:mainfrom
astronomer:aip-103-object-storage-backend

Conversation

@amoghrajesh

@amoghrajesh amoghrajesh commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Was generative AI tooling used to co-author this PR?
  • Yes: claude sonnet 4.6

Built atop: #68274 (only last commit is relevant)

What

AIP-103 introduced task and asset state management backed by the metadata database. For deployments that need to store large state values as a custom backend, this PR adds an object storage alternative via apache-airflow-providers-common-io, similar to https://airflow.apache.org/docs/apache-airflow-providers-common-io/stable/xcom_backend.html

Current behaviour

Task and asset state can only be stored in the Airflow metadata database (MetastoreStateBackend). There is a way to offload values to S3, GCS, Azure Blob, or any other object storage, but there is no such example backend to do so.

Proposed change

Adds StoreObjectStorageBackend to providers/common/io, which stores task and asset state on object storage using ObjectStoragePath. The backend supports:

  • Threshold-based offloading: store_objectstorage_threshold = 0 (default) offloads all values to the backend, set a positive byte count to keep small values inline in the database and only offload large ones.
  • Optional compression: set store_objectstorage_compression = gzip (or any fsspec-supported codec).
  • Stable paths across retries: task state is keyed on (dag_id, run_id, task_id, map_index), making this backend suitable for operators using ResumableJobMixin.

Enable it by setting in airflow.cfg (or env vars using COMMON_IO as the section, e.g. AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_PATH):

[state_store]
backend = airflow.providers.common.io.store.backend.StoreObjectStorageBackend

[common.io]
state_store_objectstorage_path = s3://conn_id@mybucket/task-state/
statestore_objectstorage_threshold = 0
state_store_objectstorage_compression = gzip  # optional

Changes of Note

Values persist until explicitly deleted. Use your object storage providers lifecycle policies (S3 lifecycle rules, GCS object lifecycle, etc.) to expire old state automatically.

The backend requires airflow >= 3.3 (when BaseStoreBackend and the AssetScope/TaskScope types were introduced).

User implications / backcompat

New opt-in feature, no changes to existing deployments. Requires adding a connection in Airflow with the object storage credentials (e.g. endpoint_url for minio).

Testing

Verified end-to-end using the example_task_store example Dag with minio running locally. Three scenarios were tested:

Setup

  1. Start MinIO: docker run -p 29000:9000 -p 29001:9001 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin quay.io/minio/minio server /data --console-address ":9001"
  2. Create bucket airflow-task-state via minio console (http://localhost:29001).
image
  1. Add an Airflow connection minio with conn_type=aws, login=minioadmin, password=minioadmin, extra={"endpoint_url": "http://host.docker.internal:29000"} (use host.docker.internal when running inside Breeze/Docker).
  2. Set the base env vars and restart Airflow so and workers inherit them:
export AIRFLOW__WORKERS__STATE_STORE_BACKEND=airflow.providers.common.io.store.backend.StoreObjectStorageBackend
export AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_PATH=s3://minio@airflow-task-state/task-state/
export AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_THRESHOLD=0

export AIRFLOW_CONN_MINIO='{"conn_type": "aws", "login": "minioadmin", "password": "minioadmin", "extra": {"endpoint_url": "http://host.docker.internal:29000"}}'

Scenario 1 — All values offloaded (threshold=0)

AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_THRESHOLD=0

Verified: files appear in MinIO under task-state/example_task_store/.../. On try 1, the job ID is written and the task intentionally fails. On try 2, the job ID is read back from MinIO and the task reattaches to the existing job.

Try 1:
image

Try 2:

image

Task store tab:

image

Minio:
image


Scenario 2 — Threshold: small values stay in DB, large ones go to MinIO

AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_THRESHOLD=50

Result:

Out of the 4 task store values, only result is greater than 50bytes and is stored in the minio backend. Rest in database

Try 1:
image

Try 2:
image

image image

Scenario 3 — Compression with gzip

AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_THRESHOLD=0
AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_COMPRESSION=gzip

Verified: files appear in MinIO with a .gz suffix (e.g. task-state/example_task_store/.../job_id.gz). Decompression is inferred automatically on read (compression="infer"). Task completes successfully.

Try 1:

image

Try 2:
image

Minio:
image

~/D/O/r/airflow ❯❯❯ cd ~/Downloads                                                                                                                                                aip
~/Downloads ❯❯❯ open result.gz

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@ashb

ashb commented Jun 9, 2026

Copy link
Copy Markdown
Member
[common.io]
store_objectstorage_path = s3://conn_id@mybucket/task-state/

This feels subtly wrong. Why isn't this "just" using the existing configuration for the Object storage?

@amoghrajesh

Copy link
Copy Markdown
Contributor Author

@ashb I can imagine a case when someone ONLY wants to use custom backend for task store + asset store and not for xcoms. Both should be independently configurable I think

kaxil
kaxil previously requested changes Jun 9, 2026

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually rethink this, since on the other PR we are saying we want to limit "State Store" to certain MB anyway.

Going to defer this to after 3.3

@ianbuss ianbuss left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional clarification in doc

Comment thread providers/common/io/docs/state_store_backend.rst Outdated
@amoghrajesh

Copy link
Copy Markdown
Contributor Author

Needs a little more update due to #68438

@kaxil kaxil dismissed their stale review June 25, 2026 14:36

Outdated

Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/docs/state_store_backend.rst Outdated
Comment thread providers/common/io/provider.yaml Outdated
@vatsrahul1001 vatsrahul1001 removed this from the Airflow 3.3.0 milestone Jun 29, 2026
@amoghrajesh

Copy link
Copy Markdown
Contributor Author

Tested it end to end again, works as expected.

@uranusjr

Copy link
Copy Markdown
Member

From what I can tell, the config in main is [workers] state_store_backend instead of [state_store] backend. Is this something you intend to change in a differnt PR? I didn’t find mentions to this discrepency.

Comment thread providers/common/io/src/airflow/providers/common/io/version_compat.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
@amoghrajesh

amoghrajesh commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

From what I can tell, the config in main is [workers] state_store_backend instead of [state_store] backend. Is this something you intend to change in a differnt PR? I didn’t find mentions to this discrepency.

Good catch, both sections exist but serve different roles. [state_store] backend is the server-side backend (controls how the API server persists state to the metadata DB). [workers] state_store_backend is the worker-side offload backend that StateStoreObjectStorageBackend plugs into and workers use it to write large values directly to object storage and store only a reference in the DB. The docs have the wrong section; fixed in 0461d81

Comment thread generated/provider_dependencies.json.sha256sum Outdated
Comment thread providers/common/io/docs/state_store_backend.rst Outdated
Comment thread providers/common/io/docs/state_store_backend.rst Outdated

@Lee-W Lee-W left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nitpicks on tests and private methods

Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated
@amoghrajesh

Copy link
Copy Markdown
Contributor Author

Mostly nitpicks on tests and private methods

Handled all in comments from wei

Comment thread providers/common/io/src/airflow/providers/common/io/state_store/backend.py Outdated

@kaxil kaxil left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 1 comment, feel free to merge after addressing

@amoghrajesh amoghrajesh merged commit 1667c6b into apache:main Jun 30, 2026
77 of 78 checks passed
@amoghrajesh amoghrajesh deleted the aip-103-object-storage-backend branch June 30, 2026 11:42
karenbraganz pushed a commit to karenbraganz/airflow that referenced this pull request Jun 30, 2026
…he#68283)

Introduces `StateStoreObjectStorageBackend` in common.io provider.
Workers can now offload task and asset state to any fsspec-supported object store
(S3, GCS, Azure, local FS) instead of routing everything through the metadata DB.

A threshold config lets operators keep small values in the DB and only offload
larger ones. Compression is optional. The backend degrades transparently on Airflow < 3.3.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants