Skip to content

Fetch deadline callback context via Execution API at runtime#66608

Merged
o-nikolas merged 24 commits into
apache:mainfrom
aws-mwaa:ghaeli/callback-context-execution-api
Jun 23, 2026
Merged

Fetch deadline callback context via Execution API at runtime#66608
o-nikolas merged 24 commits into
apache:mainfrom
aws-mwaa:ghaeli/callback-context-execution-api

Conversation

@seanghaeli

@seanghaeli seanghaeli commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Replace the simple context workaround from #55241 that stored serialized context in trigger kwargs (DB). Now that #55068 gives the triggerer API access, fetch the DagRun at execution time via the Execution API and build context fresh.

This avoids DB bloat from serialized context, provides fresh (not stale) context, and builds a richer context dict including logical_date, ds, ts, conf, data_interval_start/end, and the deadline info.

Changes

  • deadline.py: Remove get_simple_context(). Store only identifiers (dag_id, run_id, deadline_id, deadline_time) in callback kwargs.

  • callback.py: Add _build_context() that fetches DagRun via SUPERVISOR_COMMS.asend(GetDagRun(...)). Backward compat: old callbacks with "context" key still work.

  • triggerer_job_runner.py: Add GetDagRun to ToTriggerSupervisor union, DagRunResult to ToTriggerRunner union, handler in _handle_request.

  • callback_supervisor.py: Add GetDagRun to CallbackToSupervisor union + handler for executor callback path.

  • Tests: Updated deadline model tests, added context-fetching test, backward-compat test, GetDagRun handler test.

  • serialization/definitions/dag.py: Also fixes scheduled DAG runs silently getting no deadline — this includes a pre-existing main bug (a VariableInterval deadline committed mid-scheduler and got dropped), kept here because it's the same code block this PR rewrites.

Testing

Ran in Breeze to verify the comms plumbing works e2e:

  • Confirmed GetDagRun round-trips through the triggerer's ToTriggerSupervisor_handle_requestDagRunResult response path without breaking existing trigger handling
  • Verified SUPERVISOR_COMMS.asend() is the correct async calling pattern — uses TriggerCommsDecoder from init_comms() with async lock for coroutine safety in the trigger event loop
  • Verified the DagRun generated model has all fields accessed in _build_context: logical_date, data_interval_start, data_interval_end, conf
  • Backward compat confirmed: old callbacks with stored "context" key (queued before this change) still work

Motivation

Per @ramitkataria's feedback on #64984: context should not be stored in the DB. The triggerer now has API access (#55068), so fetch it at runtime like tasks do.

Related


Important

🛠️ Maintainer triage note for @seanghaeli · by @potiuk · 2026-06-22 06:31 UTC

Helpful heads-up from the maintainers — please address before this PR can be reviewed (see the Pull Request quality criteria):

  • Merge conflicts — branch conflicts with main; rebase onto the latest main and push.

The ball is in your court — you've been assigned to this PR. Fix the above, then mark it Ready for review.

Automated triage — may be imperfect; a maintainer takes the next look.

@seanghaeli

Copy link
Copy Markdown
Contributor Author

@ramitkataria incorporated your feedback from #64984
@ferruzzi

your reviews would be much appreciated!

@seanghaeli seanghaeli marked this pull request as ready for review May 9, 2026 03:24
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 11, 2026

@ferruzzi ferruzzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question, otherwise LGTM.

Comment thread airflow-core/src/airflow/triggers/callback.py Outdated

@ferruzzi ferruzzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved pending CI passing

@seanghaeli seanghaeli force-pushed the ghaeli/callback-context-execution-api branch from c93d733 to 82efc3e Compare May 13, 2026 19:27

@ramitkataria ramitkataria left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pivoting away from the previous approach. This is in the right direction but I think there's still work to be done. Removing the context from DB is good but like I said in #64984, we should follow the approach used in #55068. I also want to point out that the way context works in ExecutorCallback also needs to be updated because it was using the same "temporary solution" and will break if this PR is merged.

I did a deep dive to reduce the number of iterations we have to go through and here's what I recommend based on my findings:

Context and kwargs:

  • Let's use the standard Context TypedDict for the context parameter (dag_run, run_id, logical_date, etc., with task-specific fields absent)
  • For deadline-specific info (deadline_id, deadline_time), let's add those to kwargs, since that's what they defined when registering the callback.

handle_miss (deadline.py):

  • `{"deadline": "id": ..., "time": ...} goes in callback.data["kwargs"]
  • Let's not put put context or DagRun identifiers in kwargs.

Triggerer path:

  • In _create_workload (triggerer_job_runner.py), when trigger.task_instance is None but trigger.callback exists with dag_id/run_id in its data, fetch the DagRun and put it in dag_run_data on the workload (same field start_from_trigger uses).
  • In create_triggers, when the workload has dag_run_data but no ti, build a Context (dag_run, run_id, logical_date, etc.) and set it as an attribute on the trigger instance (e.g. trigger_instance.context = built_context), same pattern as trigger_instance.task_instance = ti.
  • CallbackTrigger.run() reads self.context instead of popping from kwargs.

Executor path:

  • Adding GetDagRun to CallbackToSupervisor is good so let's keep that. Use it from inside execute_callback (the subprocess function), not from inside the trigger. When execute_callback detects it needs context (identifiers present on callback.data), it sends GetDagRun via SUPERVISOR_COMMS, builds a Context from the response, and passes it to the user's callback as a separate context parameter.
  • This matches how tasks work: the subprocess asks for what it needs through comms.

This way, the implementation for context in tasks and callbacks would become similar which is the goal.

Comment thread airflow-core/src/airflow/triggers/callback.py Outdated
Comment thread airflow-core/src/airflow/triggers/callback.py Outdated
Comment thread airflow-core/src/airflow/models/deadline.py Outdated

@ferruzzi ferruzzi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Withdrawing my approval for now. Ramit has put a lot of thought and planning into this project already so I'll defer to his thoughts here. Sorry for the churn.

@seanghaeli seanghaeli force-pushed the ghaeli/callback-context-execution-api branch 2 times, most recently from 023f30b to 8fddb2b Compare May 22, 2026 23:10
@potiuk potiuk removed the ready for maintainer review Set after triaging when all criteria pass. label May 24, 2026
Sean Ghaeli added 24 commits June 23, 2026 03:28
When a deadline callback fires, it now fetches DagRun context from the
Execution API and passes it to the callback as context["deadline"] and
context["dag_run"]. This enables Jinja template rendering ({{ dag_run }},
{{ deadline.deadline_time }}) and enriched notifications.

Key changes:
- Add GetDagRun comms message and handler in callback supervisor
- Wire dag_id/run_id/deadline_id/deadline_time through the workload path
- Build context dict from DagRun response via build_context_from_dag_run
- Fail callback (for retry) when context fetch fails instead of running
  degraded
- Restrict token:workload to read-only Execution API routes (GET only)
- Add hash verification in _ensure_bundle_module_registered to prevent
  wrong-bundle loading when multiple bundles share a filename
- Run bundle initialization off triggerer event loop (asyncio.to_thread)
- Extract _load_mangled_module to airflow._shared.module_loading as a
  shared helper for both triggerer and executor paths
Plain function callbacks can now use Jinja2 templates in their kwargs:

    callback_kwargs={"text": "DAG {{ dag_run.dag_id }} missed at {{ deadline.deadline_time }}"}

String values containing `{{` are rendered against the callback context
(which includes dag_run, deadline, run_id, ds, ts, etc.) before the
callback is invoked. Non-string values, the context key itself, and
strings without template markers pass through unchanged. Render failures
fall back to the raw string with a warning.

Notifier classes (BaseNotifier subclasses) are unaffected — they handle
their own rendering via __call__.
Add resilience fixes and regression tests across the deadline callback
subsystem surfaced by deep QA:

- triggerer: isolate per-trigger workload build failures so one bad
  trigger cannot abort the whole batch
- scheduler: row-lock executor-callback pickup with skip_locked
- models: fix DeadlineAlert.__repr__ raising on JSON dict intervals;
  make handle_miss idempotent; harden handle_event against
  uncoercible event bodies
- task-sdk: render callback kwargs safely; guard context construction
  against missing/edge-case DagRun fields

Adds comprehensive adversarial and resilience test coverage for the
callback state machine, scheduler queries, migration, and context build.
…er prohibit_commit)

Scheduled DagRuns silently got no deadline and no callback: deadline creation runs
inside create_dagrun under the scheduler's prohibit_commit guard, and two code paths
issued a commit on that guarded session, which the guard rejects (UNEXPECTED COMMIT)
and the per-alert except block then swallowed. Only manually-triggered runs worked.

Two causes, both fixed:
- _process_dagrun_deadline_alerts wrapped each alert in session.begin_nested(); the
  SAVEPOINT release commits. Replace it with a plain try/except for per-alert
  isolation (the only DB mutation is the final session.add, so nothing partial needs
  rolling back; the caller's outer transaction persists the Deadline).
- VariableInterval.resolve() -> Variable.get() -> MetastoreBackend.get_variable
  (@provide_session) opens a scoped session that commits on exit (same thread-local
  session as the scheduler's). Read the Variable on the passed-in session instead and
  validate via the new VariableInterval.coerce_to_timedelta(), split out of resolve().

Tests: update the create_dagrun VariableInterval tests to seed real Variables (the
scheduler path no longer calls Variable.get), add coerce_to_timedelta unit tests.
The synchronous executor path renders Jinja templates in string-valued callback
kwargs (e.g. {{ dag_run.run_id }}), but the asynchronous triggerer path
(CallbackTrigger.run) passed callback_kwargs through verbatim — a silent sync/async
divergence where the same DeadlineAlert rendered on one path and not the other.

Extract the rendering into a shared helper (shared/template_rendering.render_callback_kwargs)
and call it from CallbackTrigger.run so both paths render identically. airflow-core
already depends on the template_rendering shared lib.

Adds a trigger test asserting Jinja kwargs render and non-template / non-string
kwargs pass through untouched.
- Point the executor path (callback_supervisor) at the shared
  render_callback_kwargs helper instead of its own local copy, so the sync and
  async callback paths render kwargs through one implementation and can't drift.
  Declare apache-airflow-shared-template-rendering in task-sdk's shared
  distributions (force-include + wheel packaging already covered it).
- Update the deadline-alerts docs: Jinja IS now rendered on string callback
  kwargs (both paths); the old 'Airflow does not run jinja-templating on the
  kwargs' note was stale after the rendering support landed.
- Repoint the supervisor render tests at the shared helper.
…m_dag_run

Both callback paths built context['deadline'] separately after calling the shared
build_context_from_dag_run helper (executor set it directly; triggerer popped
_deadline and assigned). Move the deadline dict into build_context_from_dag_run via
an optional deadline= arg so the executor and triggerer paths produce identical
context by construction. Thread the dict through _fetch_and_build_context (executor)
and pass the popped _deadline (triggerer). Update tests to assert the deadline is
passed into the helper, and add positive coverage that deadline= populates
context['deadline'].
When an executor-run deadline callback cannot fetch its DagRun context (an
API blip, network partition, or token expiry), it previously exited 1 and was
marked terminally FAILED, permanently dropping the callback. The triggerer
path retries such transient failures by re-evaluating on the next loop; the
executor path now matches.

The callback subprocess exits with a distinct code (EX_TEMPFAIL, 75) when the
context fetch fails. supervise_callback maps that code to a new
CallbackContextFetchError; the LocalExecutor worker reports the workload's
retry_state (PENDING) for that error instead of failure_state, and the
scheduler resets the callback to PENDING so the next loop re-picks and retries
it. Any other non-zero exit still fails the callback terminally.
The test suite for this PR had grown to ~8.2k lines against ~1.3k lines of
production code (≈6:1). Much of it tested pre-existing logic (Trigger
assign_unassigned / SKIP LOCKED row-locking, FK cascade behavior),
third-party/stdlib behavior (Pydantic/Fernet round-trips, strftime, json
coercion), structural SQL string-matching, or asserted the absence of a
behavior — none of which is changed by this PR.

Pruned to the tests that fail without a production line this PR adds, per
the repo standard of covering exactly what the change introduces. Removed
two wholly out-of-scope files (callback concurrency, executor adversarial)
and trimmed the rest; swapped a datetime.now() use for time_machine. Net
~-4.2k test lines; every retained test passes.
Two static-check failures on the prior pushes:
- local_executor.py imports CallbackContextFetchError from airflow.sdk inside
  the worker except-block (the same lazy worker-isolation pattern base_executor
  already uses). The check-sdk-imports hook flags it; mark it intentional with
  noqa: SDK001.
- coerce_to_timedelta took 'object', which makes int(value) fail mypy's
  call-overload check. Narrow the parameter to str | int | float | None (what a
  Variable value actually is); non-numeric input is still handled by the except.
Three PR-added tests in test_dagrun.py used datetime.now() for a far-future
FIXED_DATETIME, which violates the repo's time_machine/no-now() test standard
(and was flagged in review). They only need a date past the wall clock so the
deadline never fires, so a fixed 2099-01-01 literal is both compliant and
deterministic.
The fixed-date conversion used 2099-01-01, which overflows MySQL 8.0's
TIMESTAMP range (max 2038-01-19) and failed the MySQL core serialization
suite with 'Incorrect datetime value' on the deadline_time column. Use
2037-01-01 instead: still far enough ahead that the FIXED_DATETIME deadline
never fires, but valid on all backends. Verified on MySQL.
This PR's callback comms only use GetDagRun/GetConnection/GetVariable, never
GetXCom, so the token:workload scope changes on the XCom execution-API route
(and their TestWorkloadTokenScopeEnforcement tests) had no consumer here — they
are apache#66611's deliverable (callback XCom read access) and were riding along.
Revert xcoms.py and test_xcoms.py to main so this PR stays scoped to callback
context fetching; apache#66611 owns the XCom route + comms changes.
…standard

Cut ~1,400 added test lines (4,006 -> ~2,600, ~2:1 test:prod) per the repo
standard: 100% of what the PR changes, no more; every test must fail without
the change; no testing of pre-existing/stdlib/third-party logic; consolidate
variations with parametrize.

- Delete three over-coverage 'wave' suites that tested pre-existing flow or
  re-implemented logic in mocks: test_callback_upgrade_seams.py (x2),
  test_callback_bundle_init_faults.py.
- test_callback_supervisor.py: drop tests of pre-existing supervisor plumbing
  (logging/upload/bundle/handle_request dispatch) and collapse the duplicate
  render-kwargs suites; keep the context-fetch, exit-code-mapping, timeout, and
  module-registration tests that defend this PR's lines.
- Trim resilience/reentrancy/adversarial files to the single test that defends
  each new production branch (begin_nested isolation, PENDING requeue,
  flag_modified persistence, handle_miss idempotency).
- Drop tests that exercised pydantic round-trip or re-implemented route scope
  logic in fixtures rather than the real code path.
- Collapse over-parametrized validation/equality cases.

Every retained test maps to a distinct production line changed by this PR.
The scheduler-side VariableInterval resolver read only the variable table, bypassing
AIRFLOW_VAR_* env vars and secrets backends. A Variable living there resolved to None,
and the per-alert except silently dropped the deadline.

Iterate the secrets backends (env -> secrets -> metastore) like Variable.get, passing
the scheduler session to the metastore backend so the DB read stays commit-safe under
prohibit_commit. Fail loudly (isolated per-alert) when nothing resolves, instead of a
silent skip.

Tests: replace the dead Variable.get mock with a real seeded Variable row, and add a
regression test that a VariableInterval backed only by an AIRFLOW_VAR_* env var resolves
and creates the Deadline.
callback is a Trigger-model attr, not on BaseEventTrigger — read via getattr so asset-only
triggers and spec'd test mocks don't AttributeError. Fixes watched_assets test failures.
The in-process Execution API test harness overrides auth dependencies
(_jwt_bearer, has_*_access) with always_allow so dry-run clients with an
empty token can call the routes. This PR added a router-level
Security(require_auth, scopes=...) dependency to the variables/connections
routes, but require_auth was not in the override map. With token='', the
real require_auth path ran on the in-process event loop and never returned
cleanly, so the lifespan could not close and test_processor /
test_triggerer_job tests timed out (Error while closing in-process
execution API lifespan -> TimeoutError).

Add require_auth to the in-process override map, mirroring _jwt_bearer.
@router.get(
"/{connection_id}",
dependencies=[
Security(require_auth, scopes=["token:execution", "token:workload"]),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanghaeli I'm sorry I didn't catch this earlier, but this change is regression on security, so this part at least (if not the entire PR?) needs reverting.

token:workload is essentially used for long-lived tokens (~24hrs) when the TI is in queued state between the executor Queueing the task, where a worker calls the ti /run endpoint to exchange it for a short lived (5-10mins) token that has more permissions.

This primary driver for this change was to make the tokens that are visible via workers (either in the Celery message bus, or in the KE pod spec itself) only useable once (i.e. can't be replayed, handled by the TI state transiation requirements) and for a single thing (just for calling the run endpoint)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right fix here? Between this comment and Kaxil's, is there a way to make this work within the existing token types or do we need a new token type entirely, maybe, instead of trying to shoe-horn what he is trying to do into a system that wasn't built for it?

Also, I clearly need to brush up on the tokens and their intended uses. I didn't know about that intentional down-scoping.

@ferruzzi ferruzzi Jun 23, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe this is a hint from the universe that we need tests around this to prevent accidental (or at least un-discussed) token scope changes? Let me see what I can come up with and I'll tag you in it for a review.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new token for this kind of exchange @ashb? Also the code/functionality that should invalidate the workload token after the first use (since we're only intending for it to be used once for the exchange for the short lived token) doesn't seem to be running, otherwise testing would have caught that here? Any who, I think we should circle back on this one and regroup a bit.

Shall we revert this one? @seanghaeli @ferruzzi?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should revert to unblock 3.3.0b2 and we can figure out solution for this meanwhile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:deadline-alerts AIP-86 (former AIP-57) area:task-sdk area:Triggerer ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants