Share source assets across products to dedupe unification#92
Merged
Conversation
Source assets were keyed per-product ([product_id, "sources", key]), so every product re-ran DIE unification for the same source. The four waterlevels-NM products (summary/timeseries/trends/recency) and the arsenic/nitrate/tds summary + MCL products each pulled identical upstream data 2-4x, amplifying API throttling and memory pressure. Key source assets product-independently by their unification signature: ["sources", parameter, mode, scope, source_key], where mode is summary|timeseries (the only distinction the backend makes) and scope encodes the spatial filter. Each (parameter, mode, scope, source) tuple becomes one shared asset, deduped across all products (124 -> 87 assets, 37 duplicate unifications eliminated). A dedicated sources_job materializes every shared source once on its own schedule (05:00, ahead of the product schedules). Each per-product job now selects only its combine + geoserver assets and loads the source inputs from the GCS IO manager instead of re-unifying them. The IO manager is subclassed to return an empty payload when a source pickle is absent, so a combine never hard-fails on a not-yet-materialized source (e.g. on a fresh deploy, before sources_job has run once). Source assets always write a payload when they run, so this cannot mask a real data loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Your pull request is automatically being deployed to Dagster Cloud.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Source assets were keyed per-product (
[product_id, "sources", key]), so every product re-ran DIE unification for the same source. The four waterlevels-NM products (summary/timeseries/trends/recency) and the arsenic/nitrate/tds summary + MCL products each pulled identical upstream data 2–4×, amplifying API throttling and memory pressure.Change
Key source assets product-independently by their unification signature:
["sources", parameter, mode, scope, source_key], wheremodeissummary|timeseries(the only distinction the backend makes) andscopeencodes the spatial filter. Each(parameter, mode, scope, source)tuple becomes one shared asset, deduped across all products — 124 → 87 assets, 37 duplicate unifications eliminated.A dedicated
sources_jobmaterializes every shared source once on its own schedule (05:00, ahead of the product schedules). Each per-product job now selects only its combine + geoserver assets and loads source inputs from the GCS IO manager instead of re-unifying them.The IO manager is subclassed to return an empty payload when a source pickle is absent, so a combine never hard-fails on a not-yet-materialized source (e.g. on a fresh deploy, before
sources_jobhas run once). Source assets always write a payload when they run, so this cannot mask a real data loss.Verification
dg check defs— cleansources_jobto all 87 shared sourcesDeploy note
On a fresh deploy, run
sources_jobonce before the product jobs (the schedule orders it ahead at 05:00, but it must materialize each source at least once; the tolerant IO manager degrades gracefully until then).🤖 Generated with Claude Code