Fetch deadline callback context via Execution API at runtime#66608
Conversation
|
@ramitkataria incorporated your feedback from #64984 your reviews would be much appreciated! |
ferruzzi
left a comment
There was a problem hiding this comment.
Just a quick question, otherwise LGTM.
ferruzzi
left a comment
There was a problem hiding this comment.
Approved pending CI passing
c93d733 to
82efc3e
Compare
There was a problem hiding this comment.
Thanks for pivoting away from the previous approach. This is in the right direction but I think there's still work to be done. Removing the context from DB is good but like I said in #64984, we should follow the approach used in #55068. I also want to point out that the way context works in ExecutorCallback also needs to be updated because it was using the same "temporary solution" and will break if this PR is merged.
I did a deep dive to reduce the number of iterations we have to go through and here's what I recommend based on my findings:
Context and kwargs:
- Let's use the standard Context TypedDict for the context parameter (dag_run, run_id, logical_date, etc., with task-specific fields absent)
- For deadline-specific info (deadline_id, deadline_time), let's add those to kwargs, since that's what they defined when registering the callback.
handle_miss (deadline.py):
- `{"deadline": "id": ..., "time": ...} goes in callback.data["kwargs"]
- Let's not put put context or DagRun identifiers in kwargs.
Triggerer path:
- In _create_workload (triggerer_job_runner.py), when trigger.task_instance is None but trigger.callback exists with dag_id/run_id in its data, fetch the DagRun and put it in dag_run_data on the workload (same field start_from_trigger uses).
- In create_triggers, when the workload has dag_run_data but no ti, build a
Context(dag_run, run_id, logical_date, etc.) and set it as an attribute on the trigger instance (e.g. trigger_instance.context = built_context), same pattern as trigger_instance.task_instance = ti. - CallbackTrigger.run() reads self.context instead of popping from kwargs.
Executor path:
- Adding GetDagRun to CallbackToSupervisor is good so let's keep that. Use it from inside execute_callback (the subprocess function), not from inside the trigger. When execute_callback detects it needs context (identifiers present on callback.data), it sends GetDagRun via SUPERVISOR_COMMS, builds a
Contextfrom the response, and passes it to the user's callback as a separate context parameter. - This matches how tasks work: the subprocess asks for what it needs through comms.
This way, the implementation for context in tasks and callbacks would become similar which is the goal.
ferruzzi
left a comment
There was a problem hiding this comment.
Withdrawing my approval for now. Ramit has put a lot of thought and planning into this project already so I'll defer to his thoughts here. Sorry for the churn.
023f30b to
8fddb2b
Compare
When a deadline callback fires, it now fetches DagRun context from the
Execution API and passes it to the callback as context["deadline"] and
context["dag_run"]. This enables Jinja template rendering ({{ dag_run }},
{{ deadline.deadline_time }}) and enriched notifications.
Key changes:
- Add GetDagRun comms message and handler in callback supervisor
- Wire dag_id/run_id/deadline_id/deadline_time through the workload path
- Build context dict from DagRun response via build_context_from_dag_run
- Fail callback (for retry) when context fetch fails instead of running
degraded
- Restrict token:workload to read-only Execution API routes (GET only)
- Add hash verification in _ensure_bundle_module_registered to prevent
wrong-bundle loading when multiple bundles share a filename
- Run bundle initialization off triggerer event loop (asyncio.to_thread)
- Extract _load_mangled_module to airflow._shared.module_loading as a
shared helper for both triggerer and executor paths
Plain function callbacks can now use Jinja2 templates in their kwargs:
callback_kwargs={"text": "DAG {{ dag_run.dag_id }} missed at {{ deadline.deadline_time }}"}
String values containing `{{` are rendered against the callback context
(which includes dag_run, deadline, run_id, ds, ts, etc.) before the
callback is invoked. Non-string values, the context key itself, and
strings without template markers pass through unchanged. Render failures
fall back to the raw string with a warning.
Notifier classes (BaseNotifier subclasses) are unaffected — they handle
their own rendering via __call__.
Add resilience fixes and regression tests across the deadline callback subsystem surfaced by deep QA: - triggerer: isolate per-trigger workload build failures so one bad trigger cannot abort the whole batch - scheduler: row-lock executor-callback pickup with skip_locked - models: fix DeadlineAlert.__repr__ raising on JSON dict intervals; make handle_miss idempotent; harden handle_event against uncoercible event bodies - task-sdk: render callback kwargs safely; guard context construction against missing/edge-case DagRun fields Adds comprehensive adversarial and resilience test coverage for the callback state machine, scheduler queries, migration, and context build.
…er prohibit_commit) Scheduled DagRuns silently got no deadline and no callback: deadline creation runs inside create_dagrun under the scheduler's prohibit_commit guard, and two code paths issued a commit on that guarded session, which the guard rejects (UNEXPECTED COMMIT) and the per-alert except block then swallowed. Only manually-triggered runs worked. Two causes, both fixed: - _process_dagrun_deadline_alerts wrapped each alert in session.begin_nested(); the SAVEPOINT release commits. Replace it with a plain try/except for per-alert isolation (the only DB mutation is the final session.add, so nothing partial needs rolling back; the caller's outer transaction persists the Deadline). - VariableInterval.resolve() -> Variable.get() -> MetastoreBackend.get_variable (@provide_session) opens a scoped session that commits on exit (same thread-local session as the scheduler's). Read the Variable on the passed-in session instead and validate via the new VariableInterval.coerce_to_timedelta(), split out of resolve(). Tests: update the create_dagrun VariableInterval tests to seed real Variables (the scheduler path no longer calls Variable.get), add coerce_to_timedelta unit tests.
The synchronous executor path renders Jinja templates in string-valued callback
kwargs (e.g. {{ dag_run.run_id }}), but the asynchronous triggerer path
(CallbackTrigger.run) passed callback_kwargs through verbatim — a silent sync/async
divergence where the same DeadlineAlert rendered on one path and not the other.
Extract the rendering into a shared helper (shared/template_rendering.render_callback_kwargs)
and call it from CallbackTrigger.run so both paths render identically. airflow-core
already depends on the template_rendering shared lib.
Adds a trigger test asserting Jinja kwargs render and non-template / non-string
kwargs pass through untouched.
- Point the executor path (callback_supervisor) at the shared render_callback_kwargs helper instead of its own local copy, so the sync and async callback paths render kwargs through one implementation and can't drift. Declare apache-airflow-shared-template-rendering in task-sdk's shared distributions (force-include + wheel packaging already covered it). - Update the deadline-alerts docs: Jinja IS now rendered on string callback kwargs (both paths); the old 'Airflow does not run jinja-templating on the kwargs' note was stale after the rendering support landed. - Repoint the supervisor render tests at the shared helper.
…m_dag_run Both callback paths built context['deadline'] separately after calling the shared build_context_from_dag_run helper (executor set it directly; triggerer popped _deadline and assigned). Move the deadline dict into build_context_from_dag_run via an optional deadline= arg so the executor and triggerer paths produce identical context by construction. Thread the dict through _fetch_and_build_context (executor) and pass the popped _deadline (triggerer). Update tests to assert the deadline is passed into the helper, and add positive coverage that deadline= populates context['deadline'].
When an executor-run deadline callback cannot fetch its DagRun context (an API blip, network partition, or token expiry), it previously exited 1 and was marked terminally FAILED, permanently dropping the callback. The triggerer path retries such transient failures by re-evaluating on the next loop; the executor path now matches. The callback subprocess exits with a distinct code (EX_TEMPFAIL, 75) when the context fetch fails. supervise_callback maps that code to a new CallbackContextFetchError; the LocalExecutor worker reports the workload's retry_state (PENDING) for that error instead of failure_state, and the scheduler resets the callback to PENDING so the next loop re-picks and retries it. Any other non-zero exit still fails the callback terminally.
The test suite for this PR had grown to ~8.2k lines against ~1.3k lines of production code (≈6:1). Much of it tested pre-existing logic (Trigger assign_unassigned / SKIP LOCKED row-locking, FK cascade behavior), third-party/stdlib behavior (Pydantic/Fernet round-trips, strftime, json coercion), structural SQL string-matching, or asserted the absence of a behavior — none of which is changed by this PR. Pruned to the tests that fail without a production line this PR adds, per the repo standard of covering exactly what the change introduces. Removed two wholly out-of-scope files (callback concurrency, executor adversarial) and trimmed the rest; swapped a datetime.now() use for time_machine. Net ~-4.2k test lines; every retained test passes.
Two static-check failures on the prior pushes: - local_executor.py imports CallbackContextFetchError from airflow.sdk inside the worker except-block (the same lazy worker-isolation pattern base_executor already uses). The check-sdk-imports hook flags it; mark it intentional with noqa: SDK001. - coerce_to_timedelta took 'object', which makes int(value) fail mypy's call-overload check. Narrow the parameter to str | int | float | None (what a Variable value actually is); non-numeric input is still handled by the except.
Three PR-added tests in test_dagrun.py used datetime.now() for a far-future FIXED_DATETIME, which violates the repo's time_machine/no-now() test standard (and was flagged in review). They only need a date past the wall clock so the deadline never fires, so a fixed 2099-01-01 literal is both compliant and deterministic.
The fixed-date conversion used 2099-01-01, which overflows MySQL 8.0's TIMESTAMP range (max 2038-01-19) and failed the MySQL core serialization suite with 'Incorrect datetime value' on the deadline_time column. Use 2037-01-01 instead: still far enough ahead that the FIXED_DATETIME deadline never fires, but valid on all backends. Verified on MySQL.
This PR's callback comms only use GetDagRun/GetConnection/GetVariable, never GetXCom, so the token:workload scope changes on the XCom execution-API route (and their TestWorkloadTokenScopeEnforcement tests) had no consumer here — they are apache#66611's deliverable (callback XCom read access) and were riding along. Revert xcoms.py and test_xcoms.py to main so this PR stays scoped to callback context fetching; apache#66611 owns the XCom route + comms changes.
…standard Cut ~1,400 added test lines (4,006 -> ~2,600, ~2:1 test:prod) per the repo standard: 100% of what the PR changes, no more; every test must fail without the change; no testing of pre-existing/stdlib/third-party logic; consolidate variations with parametrize. - Delete three over-coverage 'wave' suites that tested pre-existing flow or re-implemented logic in mocks: test_callback_upgrade_seams.py (x2), test_callback_bundle_init_faults.py. - test_callback_supervisor.py: drop tests of pre-existing supervisor plumbing (logging/upload/bundle/handle_request dispatch) and collapse the duplicate render-kwargs suites; keep the context-fetch, exit-code-mapping, timeout, and module-registration tests that defend this PR's lines. - Trim resilience/reentrancy/adversarial files to the single test that defends each new production branch (begin_nested isolation, PENDING requeue, flag_modified persistence, handle_miss idempotency). - Drop tests that exercised pydantic round-trip or re-implemented route scope logic in fixtures rather than the real code path. - Collapse over-parametrized validation/equality cases. Every retained test maps to a distinct production line changed by this PR.
The scheduler-side VariableInterval resolver read only the variable table, bypassing AIRFLOW_VAR_* env vars and secrets backends. A Variable living there resolved to None, and the per-alert except silently dropped the deadline. Iterate the secrets backends (env -> secrets -> metastore) like Variable.get, passing the scheduler session to the metastore backend so the DB read stays commit-safe under prohibit_commit. Fail loudly (isolated per-alert) when nothing resolves, instead of a silent skip. Tests: replace the dead Variable.get mock with a real seeded Variable row, and add a regression test that a VariableInterval backed only by an AIRFLOW_VAR_* env var resolves and creates the Deadline.
…ue-key race, unrelated to this PR)
callback is a Trigger-model attr, not on BaseEventTrigger — read via getattr so asset-only triggers and spec'd test mocks don't AttributeError. Fixes watched_assets test failures.
The in-process Execution API test harness overrides auth dependencies (_jwt_bearer, has_*_access) with always_allow so dry-run clients with an empty token can call the routes. This PR added a router-level Security(require_auth, scopes=...) dependency to the variables/connections routes, but require_auth was not in the override map. With token='', the real require_auth path ran on the in-process event loop and never returned cleanly, so the lifespan could not close and test_processor / test_triggerer_job tests timed out (Error while closing in-process execution API lifespan -> TimeoutError). Add require_auth to the in-process override map, mirroring _jwt_bearer.
| @router.get( | ||
| "/{connection_id}", | ||
| dependencies=[ | ||
| Security(require_auth, scopes=["token:execution", "token:workload"]), |
There was a problem hiding this comment.
@seanghaeli I'm sorry I didn't catch this earlier, but this change is regression on security, so this part at least (if not the entire PR?) needs reverting.
token:workload is essentially used for long-lived tokens (~24hrs) when the TI is in queued state between the executor Queueing the task, where a worker calls the ti /run endpoint to exchange it for a short lived (5-10mins) token that has more permissions.
This primary driver for this change was to make the tokens that are visible via workers (either in the Celery message bus, or in the KE pod spec itself) only useable once (i.e. can't be replayed, handled by the TI state transiation requirements) and for a single thing (just for calling the run endpoint)
There was a problem hiding this comment.
What's the right fix here? Between this comment and Kaxil's, is there a way to make this work within the existing token types or do we need a new token type entirely, maybe, instead of trying to shoe-horn what he is trying to do into a system that wasn't built for it?
Also, I clearly need to brush up on the tokens and their intended uses. I didn't know about that intentional down-scoping.
There was a problem hiding this comment.
Also, maybe this is a hint from the universe that we need tests around this to prevent accidental (or at least un-discussed) token scope changes? Let me see what I can come up with and I'll tag you in it for a review.
There was a problem hiding this comment.
Do we need a new token for this kind of exchange @ashb? Also the code/functionality that should invalidate the workload token after the first use (since we're only intending for it to be used once for the exchange for the short lived token) doesn't seem to be running, otherwise testing would have caught that here? Any who, I think we should circle back on this one and regroup a bit.
Shall we revert this one? @seanghaeli @ferruzzi?
There was a problem hiding this comment.
Yes we should revert to unblock 3.3.0b2 and we can figure out solution for this meanwhile
Summary
Replace the simple context workaround from #55241 that stored serialized context in trigger kwargs (DB). Now that #55068 gives the triggerer API access, fetch the DagRun at execution time via the Execution API and build context fresh.
This avoids DB bloat from serialized context, provides fresh (not stale) context, and builds a richer context dict including
logical_date,ds,ts,conf,data_interval_start/end, and the deadline info.Changes
deadline.py: Removeget_simple_context(). Store only identifiers (dag_id,run_id,deadline_id,deadline_time) in callback kwargs.callback.py: Add_build_context()that fetches DagRun viaSUPERVISOR_COMMS.asend(GetDagRun(...)). Backward compat: old callbacks with"context"key still work.triggerer_job_runner.py: AddGetDagRuntoToTriggerSupervisorunion,DagRunResulttoToTriggerRunnerunion, handler in_handle_request.callback_supervisor.py: AddGetDagRuntoCallbackToSupervisorunion + handler for executor callback path.Tests: Updated deadline model tests, added context-fetching test, backward-compat test,
GetDagRunhandler test.serialization/definitions/dag.py: Also fixes scheduled DAG runs silently getting no deadline — this includes a pre-existing main bug (aVariableIntervaldeadline committed mid-scheduler and got dropped), kept here because it's the same code block this PR rewrites.Testing
Ran in Breeze to verify the comms plumbing works e2e:
GetDagRunround-trips through the triggerer'sToTriggerSupervisor→_handle_request→DagRunResultresponse path without breaking existing trigger handlingSUPERVISOR_COMMS.asend()is the correct async calling pattern — usesTriggerCommsDecoderfrominit_comms()with async lock for coroutine safety in the trigger event loopDagRungenerated model has all fields accessed in_build_context:logical_date,data_interval_start,data_interval_end,conf"context"key (queued before this change) still workMotivation
Per @ramitkataria's feedback on #64984: context should not be stored in the DB. The triggerer now has API access (#55068), so fetch it at runtime like tasks do.
Related
Important
🛠️ Maintainer triage note for @seanghaeli · by
@potiuk· 2026-06-22 06:31 UTCHelpful heads-up from the maintainers — please address before this PR can be reviewed (see the Pull Request quality criteria):
main; rebase onto the latestmainand push.The ball is in your court — you've been assigned to this PR. Fix the above, then mark it Ready for review.
Automated triage — may be imperfect; a maintainer takes the next look.