Fetch deadline callback context via Execution API at runtime by seanghaeli · Pull Request #66608 · apache/airflow

seanghaeli · 2026-05-08T22:02:58Z

Summary

Replace the simple context workaround from #55241 that stored serialized context in trigger kwargs (DB). Now that #55068 gives the triggerer API access, fetch the DagRun at execution time via the Execution API and build context fresh.

This avoids DB bloat from serialized context, provides fresh (not stale) context, and builds a richer context dict including logical_date, ds, ts, conf, data_interval_start/end, and the deadline info.

Changes

deadline.py: Remove get_simple_context(). Store only identifiers (dag_id, run_id, deadline_id, deadline_time) in callback kwargs.
callback.py: Add _build_context() that fetches DagRun via SUPERVISOR_COMMS.asend(GetDagRun(...)). Backward compat: old callbacks with "context" key still work.
triggerer_job_runner.py: Add GetDagRun to ToTriggerSupervisor union, DagRunResult to ToTriggerRunner union, handler in _handle_request.
callback_supervisor.py: Add GetDagRun to CallbackToSupervisor union + handler for executor callback path.
Tests: Updated deadline model tests, added context-fetching test, backward-compat test, GetDagRun handler test.
serialization/definitions/dag.py: Also fixes scheduled DAG runs silently getting no deadline — this includes a pre-existing main bug (a VariableInterval deadline committed mid-scheduler and got dropped), kept here because it's the same code block this PR rewrites.

Testing

Ran in Breeze to verify the comms plumbing works e2e:

Confirmed GetDagRun round-trips through the triggerer's ToTriggerSupervisor → _handle_request → DagRunResult response path without breaking existing trigger handling
Verified SUPERVISOR_COMMS.asend() is the correct async calling pattern — uses TriggerCommsDecoder from init_comms() with async lock for coroutine safety in the trigger event loop
Verified the DagRun generated model has all fields accessed in _build_context: logical_date, data_interval_start, data_interval_end, conf
Backward compat confirmed: old callbacks with stored "context" key (queued before this change) still work

Motivation

Per @ramitkataria's feedback on #64984: context should not be stored in the DB. The triggerer now has API access (#55068), so fetch it at runtime like tasks do.

Addresses feedback from Add Jinja template rendering and richer context for async deadline callbacks #64984 (closed)
Follows Re-enable start_from_trigger feature with rendering of template fields #55068 (triggerer API access) and reverts the context-storage approach from Include simple context in triggerer async callback #55241

Important

🛠️ Maintainer triage note for @seanghaeli · by @potiuk · 2026-06-22 06:31 UTC

Helpful heads-up from the maintainers — please address before this PR can be reviewed (see the Pull Request quality criteria):

❌ Merge conflicts — branch conflicts with main; rebase onto the latest main and push.

The ball is in your court — you've been assigned to this PR. Fix the above, then mark it Ready for review.

_{Automated triage — may be imperfect; a maintainer takes the next look.}

seanghaeli · 2026-05-08T22:13:00Z

@ramitkataria incorporated your feedback from #64984
@ferruzzi

your reviews would be much appreciated!

ferruzzi

Just a quick question, otherwise LGTM.

ferruzzi

Approved pending CI passing

ramitkataria

Thanks for pivoting away from the previous approach. This is in the right direction but I think there's still work to be done. Removing the context from DB is good but like I said in #64984, we should follow the approach used in #55068. I also want to point out that the way context works in ExecutorCallback also needs to be updated because it was using the same "temporary solution" and will break if this PR is merged.

I did a deep dive to reduce the number of iterations we have to go through and here's what I recommend based on my findings:

Context and kwargs:

Let's use the standard Context TypedDict for the context parameter (dag_run, run_id, logical_date, etc., with task-specific fields absent)
For deadline-specific info (deadline_id, deadline_time), let's add those to kwargs, since that's what they defined when registering the callback.

handle_miss (deadline.py):

`{"deadline": "id": ..., "time": ...} goes in callback.data["kwargs"]
Let's not put put context or DagRun identifiers in kwargs.

Triggerer path:

In _create_workload (triggerer_job_runner.py), when trigger.task_instance is None but trigger.callback exists with dag_id/run_id in its data, fetch the DagRun and put it in dag_run_data on the workload (same field start_from_trigger uses).
In create_triggers, when the workload has dag_run_data but no ti, build a Context (dag_run, run_id, logical_date, etc.) and set it as an attribute on the trigger instance (e.g. trigger_instance.context = built_context), same pattern as trigger_instance.task_instance = ti.
CallbackTrigger.run() reads self.context instead of popping from kwargs.

Executor path:

Adding GetDagRun to CallbackToSupervisor is good so let's keep that. Use it from inside execute_callback (the subprocess function), not from inside the trigger. When execute_callback detects it needs context (identifiers present on callback.data), it sends GetDagRun via SUPERVISOR_COMMS, builds a Context from the response, and passes it to the user's callback as a separate context parameter.
This matches how tasks work: the subprocess asks for what it needs through comms.

This way, the implementation for context in tasks and callbacks would become similar which is the goal.

ferruzzi

Withdrawing my approval for now. Ramit has put a lot of thought and planning into this project already so I'll defer to his thoughts here. Sorry for the churn.

When a deadline callback fires, it now fetches DagRun context from the Execution API and passes it to the callback as context["deadline"] and context["dag_run"]. This enables Jinja template rendering ({{ dag_run }}, {{ deadline.deadline_time }}) and enriched notifications. Key changes: - Add GetDagRun comms message and handler in callback supervisor - Wire dag_id/run_id/deadline_id/deadline_time through the workload path - Build context dict from DagRun response via build_context_from_dag_run - Fail callback (for retry) when context fetch fails instead of running degraded - Restrict token:workload to read-only Execution API routes (GET only) - Add hash verification in _ensure_bundle_module_registered to prevent wrong-bundle loading when multiple bundles share a filename - Run bundle initialization off triggerer event loop (asyncio.to_thread) - Extract _load_mangled_module to airflow._shared.module_loading as a shared helper for both triggerer and executor paths

Plain function callbacks can now use Jinja2 templates in their kwargs: callback_kwargs={"text": "DAG {{ dag_run.dag_id }} missed at {{ deadline.deadline_time }}"} String values containing `{{` are rendered against the callback context (which includes dag_run, deadline, run_id, ds, ts, etc.) before the callback is invoked. Non-string values, the context key itself, and strings without template markers pass through unchanged. Render failures fall back to the raw string with a warning. Notifier classes (BaseNotifier subclasses) are unaffected — they handle their own rendering via __call__.

Add resilience fixes and regression tests across the deadline callback subsystem surfaced by deep QA: - triggerer: isolate per-trigger workload build failures so one bad trigger cannot abort the whole batch - scheduler: row-lock executor-callback pickup with skip_locked - models: fix DeadlineAlert.__repr__ raising on JSON dict intervals; make handle_miss idempotent; harden handle_event against uncoercible event bodies - task-sdk: render callback kwargs safely; guard context construction against missing/edge-case DagRun fields Adds comprehensive adversarial and resilience test coverage for the callback state machine, scheduler queries, migration, and context build.

…er prohibit_commit) Scheduled DagRuns silently got no deadline and no callback: deadline creation runs inside create_dagrun under the scheduler's prohibit_commit guard, and two code paths issued a commit on that guarded session, which the guard rejects (UNEXPECTED COMMIT) and the per-alert except block then swallowed. Only manually-triggered runs worked. Two causes, both fixed: - _process_dagrun_deadline_alerts wrapped each alert in session.begin_nested(); the SAVEPOINT release commits. Replace it with a plain try/except for per-alert isolation (the only DB mutation is the final session.add, so nothing partial needs rolling back; the caller's outer transaction persists the Deadline). - VariableInterval.resolve() -> Variable.get() -> MetastoreBackend.get_variable (@provide_session) opens a scoped session that commits on exit (same thread-local session as the scheduler's). Read the Variable on the passed-in session instead and validate via the new VariableInterval.coerce_to_timedelta(), split out of resolve(). Tests: update the create_dagrun VariableInterval tests to seed real Variables (the scheduler path no longer calls Variable.get), add coerce_to_timedelta unit tests.

The synchronous executor path renders Jinja templates in string-valued callback kwargs (e.g. {{ dag_run.run_id }}), but the asynchronous triggerer path (CallbackTrigger.run) passed callback_kwargs through verbatim — a silent sync/async divergence where the same DeadlineAlert rendered on one path and not the other. Extract the rendering into a shared helper (shared/template_rendering.render_callback_kwargs) and call it from CallbackTrigger.run so both paths render identically. airflow-core already depends on the template_rendering shared lib. Adds a trigger test asserting Jinja kwargs render and non-template / non-string kwargs pass through untouched.

- Point the executor path (callback_supervisor) at the shared render_callback_kwargs helper instead of its own local copy, so the sync and async callback paths render kwargs through one implementation and can't drift. Declare apache-airflow-shared-template-rendering in task-sdk's shared distributions (force-include + wheel packaging already covered it). - Update the deadline-alerts docs: Jinja IS now rendered on string callback kwargs (both paths); the old 'Airflow does not run jinja-templating on the kwargs' note was stale after the rendering support landed. - Repoint the supervisor render tests at the shared helper.

…m_dag_run Both callback paths built context['deadline'] separately after calling the shared build_context_from_dag_run helper (executor set it directly; triggerer popped _deadline and assigned). Move the deadline dict into build_context_from_dag_run via an optional deadline= arg so the executor and triggerer paths produce identical context by construction. Thread the dict through _fetch_and_build_context (executor) and pass the popped _deadline (triggerer). Update tests to assert the deadline is passed into the helper, and add positive coverage that deadline= populates context['deadline'].

When an executor-run deadline callback cannot fetch its DagRun context (an API blip, network partition, or token expiry), it previously exited 1 and was marked terminally FAILED, permanently dropping the callback. The triggerer path retries such transient failures by re-evaluating on the next loop; the executor path now matches. The callback subprocess exits with a distinct code (EX_TEMPFAIL, 75) when the context fetch fails. supervise_callback maps that code to a new CallbackContextFetchError; the LocalExecutor worker reports the workload's retry_state (PENDING) for that error instead of failure_state, and the scheduler resets the callback to PENDING so the next loop re-picks and retries it. Any other non-zero exit still fails the callback terminally.

The test suite for this PR had grown to ~8.2k lines against ~1.3k lines of production code (≈6:1). Much of it tested pre-existing logic (Trigger assign_unassigned / SKIP LOCKED row-locking, FK cascade behavior), third-party/stdlib behavior (Pydantic/Fernet round-trips, strftime, json coercion), structural SQL string-matching, or asserted the absence of a behavior — none of which is changed by this PR. Pruned to the tests that fail without a production line this PR adds, per the repo standard of covering exactly what the change introduces. Removed two wholly out-of-scope files (callback concurrency, executor adversarial) and trimmed the rest; swapped a datetime.now() use for time_machine. Net ~-4.2k test lines; every retained test passes.

Two static-check failures on the prior pushes: - local_executor.py imports CallbackContextFetchError from airflow.sdk inside the worker except-block (the same lazy worker-isolation pattern base_executor already uses). The check-sdk-imports hook flags it; mark it intentional with noqa: SDK001. - coerce_to_timedelta took 'object', which makes int(value) fail mypy's call-overload check. Narrow the parameter to str | int | float | None (what a Variable value actually is); non-numeric input is still handled by the except.

Three PR-added tests in test_dagrun.py used datetime.now() for a far-future FIXED_DATETIME, which violates the repo's time_machine/no-now() test standard (and was flagged in review). They only need a date past the wall clock so the deadline never fires, so a fixed 2099-01-01 literal is both compliant and deterministic.

The fixed-date conversion used 2099-01-01, which overflows MySQL 8.0's TIMESTAMP range (max 2038-01-19) and failed the MySQL core serialization suite with 'Incorrect datetime value' on the deadline_time column. Use 2037-01-01 instead: still far enough ahead that the FIXED_DATETIME deadline never fires, but valid on all backends. Verified on MySQL.

This PR's callback comms only use GetDagRun/GetConnection/GetVariable, never GetXCom, so the token:workload scope changes on the XCom execution-API route (and their TestWorkloadTokenScopeEnforcement tests) had no consumer here — they are apache#66611's deliverable (callback XCom read access) and were riding along. Revert xcoms.py and test_xcoms.py to main so this PR stays scoped to callback context fetching; apache#66611 owns the XCom route + comms changes.

…standard Cut ~1,400 added test lines (4,006 -> ~2,600, ~2:1 test:prod) per the repo standard: 100% of what the PR changes, no more; every test must fail without the change; no testing of pre-existing/stdlib/third-party logic; consolidate variations with parametrize. - Delete three over-coverage 'wave' suites that tested pre-existing flow or re-implemented logic in mocks: test_callback_upgrade_seams.py (x2), test_callback_bundle_init_faults.py. - test_callback_supervisor.py: drop tests of pre-existing supervisor plumbing (logging/upload/bundle/handle_request dispatch) and collapse the duplicate render-kwargs suites; keep the context-fetch, exit-code-mapping, timeout, and module-registration tests that defend this PR's lines. - Trim resilience/reentrancy/adversarial files to the single test that defends each new production branch (begin_nested isolation, PENDING requeue, flag_modified persistence, handle_miss idempotency). - Drop tests that exercised pydantic round-trip or re-implemented route scope logic in fixtures rather than the real code path. - Collapse over-parametrized validation/equality cases. Every retained test maps to a distinct production line changed by this PR.

…ider docs)

The scheduler-side VariableInterval resolver read only the variable table, bypassing AIRFLOW_VAR_* env vars and secrets backends. A Variable living there resolved to None, and the per-alert except silently dropped the deadline. Iterate the secrets backends (env -> secrets -> metastore) like Variable.get, passing the scheduler session to the metastore backend so the DB read stays commit-safe under prohibit_commit. Fail loudly (isolated per-alert) when nothing resolves, instead of a silent skip. Tests: replace the dead Variable.get mock with a real seeded Variable row, and add a regression test that a VariableInterval backed only by an AIRFLOW_VAR_* env var resolves and creates the Deadline.

…ue-key race, unrelated to this PR)

callback is a Trigger-model attr, not on BaseEventTrigger — read via getattr so asset-only triggers and spec'd test mocks don't AttributeError. Fixes watched_assets test failures.

The in-process Execution API test harness overrides auth dependencies (_jwt_bearer, has_*_access) with always_allow so dry-run clients with an empty token can call the routes. This PR added a router-level Security(require_auth, scopes=...) dependency to the variables/connections routes, but require_auth was not in the override map. With token='', the real require_auth path ran on the in-process event loop and never returned cleanly, so the lifespan could not close and test_processor / test_triggerer_job tests timed out (Error while closing in-process execution API lifespan -> TimeoutError). Add require_auth to the in-process override map, mirroring _jwt_bearer.

ashb · 2026-06-23T15:54:18Z

 @router.get(
    "/{connection_id}",
+    dependencies=[
+        Security(require_auth, scopes=["token:execution", "token:workload"]),


@seanghaeli I'm sorry I didn't catch this earlier, but this change is regression on security, so this part at least (if not the entire PR?) needs reverting.

token:workload is essentially used for long-lived tokens (~24hrs) when the TI is in queued state between the executor Queueing the task, where a worker calls the ti /run endpoint to exchange it for a short lived (5-10mins) token that has more permissions.

This primary driver for this change was to make the tokens that are visible via workers (either in the Celery message bus, or in the KE pod spec itself) only useable once (i.e. can't be replayed, handled by the TI state transiation requirements) and for a single thing (just for calling the run endpoint)

What's the right fix here? Between this comment and Kaxil's, is there a way to make this work within the existing token types or do we need a new token type entirely, maybe, instead of trying to shoe-horn what he is trying to do into a system that wasn't built for it?

Also, I clearly need to brush up on the tokens and their intended uses. I didn't know about that intentional down-scoping.

Also, maybe this is a hint from the universe that we need tests around this to prevent accidental (or at least un-discussed) token scope changes? Let me see what I can come up with and I'll tag you in it for a review.

Do we need a new token for this kind of exchange @ashb? Also the code/functionality that should invalidate the workload token after the first use (since we're only intending for it to be used once for the exchange for the short lived token) doesn't seem to be running, otherwise testing would have caught that here? Any who, I think we should circle back on this one and regroup a bit.

Shall we revert this one? @seanghaeli @ferruzzi?

Yes we should revert to unblock 3.3.0b2 and we can figure out solution for this meanwhile

boring-cyborg Bot added area:deadline-alerts AIP-86 (former AIP-57) area:task-sdk area:Triggerer labels May 8, 2026

seanghaeli marked this pull request as ready for review May 9, 2026 03:24

seanghaeli requested review from XD-DENG, amoghrajesh, ashb, dstandish, hussein-awala and kaxil as code owners May 9, 2026 03:24

potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 11, 2026

ferruzzi reviewed May 12, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/triggers/callback.py Outdated

ferruzzi approved these changes May 12, 2026

View reviewed changes

seanghaeli force-pushed the ghaeli/callback-context-execution-api branch from c93d733 to 82efc3e Compare May 13, 2026 19:27

ramitkataria reviewed May 15, 2026

View reviewed changes

Comment thread airflow-core/src/airflow/triggers/callback.py Outdated

Comment thread airflow-core/src/airflow/triggers/callback.py Outdated

Comment thread airflow-core/src/airflow/models/deadline.py Outdated

ferruzzi previously requested changes May 15, 2026

View reviewed changes

seanghaeli force-pushed the ghaeli/callback-context-execution-api branch 2 times, most recently from 023f30b to 8fddb2b Compare May 22, 2026 23:10

seanghaeli requested review from dheerajturaga, o-nikolas and pierrejeambrun as code owners May 22, 2026 23:10

potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label May 24, 2026

seanghaeli mentioned this pull request May 25, 2026

Add XCom read access to callback supervisor comms channel #66611

Closed

1 task

seanghaeli requested review from bbovenzi, choo121600, guan404ming, ryanahamilton, shubhamraj-git and vatsrahul1001 as code owners May 26, 2026 07:15

Sean Ghaeli added 24 commits June 23, 2026 03:28

Retrigger CI

dbe8e66

Retrigger CI after infra timeouts (kind/gradle download 504)

a25a2e2

Reformat: collapse wrapped isinstance to satisfy ruff-format

e028ab9

Re-trigger CI (transient boto3 inventory 403 in unrelated amazon-prov…

7d96957

…ider docs)

Re-trigger CI (flaky example_trigger_controller_dag e2e: dag_run uniq…

18d7f67

…ue-key race, unrelated to this PR)

Fix rebase fallout: guard trigger.callback access in _create_workload

c67b2c2

callback is a Trigger-model attr, not on BaseEventTrigger — read via getattr so asset-only triggers and spec'd test mocks don't AttributeError. Fixes watched_assets test failures.

Re-trigger CI

ae4941a

Re-trigger CI

c99bfb5

ashb reviewed Jun 23, 2026

View reviewed changes

This was referenced Jun 23, 2026

Revert "Fetch deadline callback context via Execution API at runtime (#66608)" #68909

Merged

Add API endpoint and UI for deadline callback log visibility #66610

Closed

Resolve VariableInterval deadlines safely at DagRun creation #68917

Open

ferruzzi mentioned this pull request Jun 23, 2026

Add token scope tests for Execution API routes. #68918

Merged

seanghaeli mentioned this pull request Jun 23, 2026

Make deadline reads and serialization robust to dynamic/malformed intervals #68919

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fetch deadline callback context via Execution API at runtime#66608

Fetch deadline callback context via Execution API at runtime#66608
o-nikolas merged 24 commits into
apache:mainfrom
aws-mwaa:ghaeli/callback-context-execution-api

seanghaeli commented May 8, 2026 •

edited by potiuk

Loading

Uh oh!

seanghaeli commented May 8, 2026

Uh oh!

ferruzzi left a comment

Uh oh!

Uh oh!

ferruzzi left a comment

Uh oh!

ramitkataria left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ferruzzi left a comment

Uh oh!

ashb Jun 23, 2026

Uh oh!

ferruzzi Jun 23, 2026

Uh oh!

ferruzzi Jun 23, 2026 •

edited

Loading

Uh oh!

o-nikolas Jun 23, 2026

Uh oh!

vatsrahul1001 Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Uh oh!

Conversation

seanghaeli commented May 8, 2026 • edited by potiuk Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Motivation

Related

Uh oh!

seanghaeli commented May 8, 2026

Uh oh!

ferruzzi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ferruzzi left a comment

Choose a reason for hiding this comment

Uh oh!

ramitkataria left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ferruzzi left a comment

Choose a reason for hiding this comment

Uh oh!

ashb Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

ferruzzi Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

ferruzzi Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

o-nikolas Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

vatsrahul1001 Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

seanghaeli commented May 8, 2026 •

edited by potiuk

Loading

ramitkataria left a comment •

edited

Loading

ferruzzi Jun 23, 2026 •

edited

Loading