Introducing object store backend for task and asset store#68283
Conversation
[common.io]
store_objectstorage_path = s3://conn_id@mybucket/task-state/This feels subtly wrong. Why isn't this "just" using the existing configuration for the Object storage? |
|
@ashb I can imagine a case when someone ONLY wants to use custom backend for task store + asset store and not for xcoms. Both should be independently configurable I think |
kaxil
left a comment
There was a problem hiding this comment.
Let's actually rethink this, since on the other PR we are saying we want to limit "State Store" to certain MB anyway.
Going to defer this to after 3.3
ianbuss
left a comment
There was a problem hiding this comment.
Optional clarification in doc
|
Needs a little more update due to #68438 |
|
Tested it end to end again, works as expected. |
|
From what I can tell, the config in main is |
Good catch, both sections exist but serve different roles. |
Lee-W
left a comment
There was a problem hiding this comment.
Mostly nitpicks on tests and private methods
Handled all in comments from wei |
kaxil
left a comment
There was a problem hiding this comment.
Left 1 comment, feel free to merge after addressing
…he#68283) Introduces `StateStoreObjectStorageBackend` in common.io provider. Workers can now offload task and asset state to any fsspec-supported object store (S3, GCS, Azure, local FS) instead of routing everything through the metadata DB. A threshold config lets operators keep small values in the DB and only offload larger ones. Compression is optional. The backend degrades transparently on Airflow < 3.3.
Was generative AI tooling used to co-author this PR?
Built atop: #68274 (only last commit is relevant)
What
AIP-103 introduced task and asset state management backed by the metadata database. For deployments that need to store large state values as a custom backend, this PR adds an object storage alternative via
apache-airflow-providers-common-io, similar to https://airflow.apache.org/docs/apache-airflow-providers-common-io/stable/xcom_backend.htmlCurrent behaviour
Task and asset state can only be stored in the Airflow metadata database (
MetastoreStateBackend). There is a way to offload values to S3, GCS, Azure Blob, or any other object storage, but there is no such example backend to do so.Proposed change
Adds
StoreObjectStorageBackendtoproviders/common/io, which stores task and asset state on object storage usingObjectStoragePath. The backend supports:store_objectstorage_threshold = 0(default) offloads all values to the backend, set a positive byte count to keep small values inline in the database and only offload large ones.store_objectstorage_compression = gzip(or any fsspec-supported codec).(dag_id, run_id, task_id, map_index), making this backend suitable for operators usingResumableJobMixin.Enable it by setting in
airflow.cfg(or env vars usingCOMMON_IOas the section, e.g.AIRFLOW__COMMON_IO__STATE_STORE_OBJECTSTORAGE_PATH):Changes of Note
Values persist until explicitly deleted. Use your object storage providers lifecycle policies (S3 lifecycle rules, GCS object lifecycle, etc.) to expire old state automatically.
The backend requires airflow >= 3.3 (when
BaseStoreBackendand theAssetScope/TaskScopetypes were introduced).User implications / backcompat
New opt-in feature, no changes to existing deployments. Requires adding a connection in Airflow with the object storage credentials (e.g.
endpoint_urlfor minio).Testing
Verified end-to-end using the
example_task_storeexample Dag with minio running locally. Three scenarios were tested:Setup
docker run -p 29000:9000 -p 29001:9001 -e MINIO_ROOT_USER=minioadmin -e MINIO_ROOT_PASSWORD=minioadmin quay.io/minio/minio server /data --console-address ":9001"airflow-task-statevia minio console (http://localhost:29001).miniowithconn_type=aws,login=minioadmin,password=minioadmin,extra={"endpoint_url": "http://host.docker.internal:29000"}(usehost.docker.internalwhen running inside Breeze/Docker).Scenario 1 — All values offloaded (threshold=0)
Verified: files appear in MinIO under
task-state/example_task_store/.../. On try 1, the job ID is written and the task intentionally fails. On try 2, the job ID is read back from MinIO and the task reattaches to the existing job.Try 1:

Try 2:
Task store tab:
Minio:

Scenario 2 — Threshold: small values stay in DB, large ones go to MinIO
Result:
Out of the 4 task store values, only
resultis greater than 50bytes and is stored in the minio backend. Rest in databaseTry 1:

Try 2:

Scenario 3 — Compression with gzip
Verified: files appear in MinIO with a
.gzsuffix (e.g.task-state/example_task_store/.../job_id.gz). Decompression is inferred automatically on read (compression="infer"). Task completes successfully.Try 1:
Try 2:

Minio:

{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.