Skip to content

Share source assets across products to dedupe unification#92

Merged
jirhiker merged 1 commit into
mainfrom
feature/source-asset-dedup
Jun 28, 2026
Merged

Share source assets across products to dedupe unification#92
jirhiker merged 1 commit into
mainfrom
feature/source-asset-dedup

Conversation

@jirhiker

Copy link
Copy Markdown
Member

Problem

Source assets were keyed per-product ([product_id, "sources", key]), so every product re-ran DIE unification for the same source. The four waterlevels-NM products (summary/timeseries/trends/recency) and the arsenic/nitrate/tds summary + MCL products each pulled identical upstream data 2–4×, amplifying API throttling and memory pressure.

Change

Key source assets product-independently by their unification signature: ["sources", parameter, mode, scope, source_key], where mode is summary|timeseries (the only distinction the backend makes) and scope encodes the spatial filter. Each (parameter, mode, scope, source) tuple becomes one shared asset, deduped across all products — 124 → 87 assets, 37 duplicate unifications eliminated.

A dedicated sources_job materializes every shared source once on its own schedule (05:00, ahead of the product schedules). Each per-product job now selects only its combine + geoserver assets and loads source inputs from the GCS IO manager instead of re-unifying them.

The IO manager is subclassed to return an empty payload when a source pickle is absent, so a combine never hard-fails on a not-yet-materialized source (e.g. on a fresh deploy, before sources_job has run once). Source assets always write a payload when they run, so this cannot mask a real data loss.

Verification

  • dg check defs — clean
  • Job subsetting confirmed: product jobs resolve to 2 assets (combine + geoserver); sources_job to all 87 shared sources
  • 277 tests pass

Deploy note

On a fresh deploy, run sources_job once before the product jobs (the schedule orders it ahead at 05:00, but it must materialize each source at least once; the tolerant IO manager degrades gracefully until then).

🤖 Generated with Claude Code

Source assets were keyed per-product ([product_id, "sources", key]), so
every product re-ran DIE unification for the same source. The four
waterlevels-NM products (summary/timeseries/trends/recency) and the
arsenic/nitrate/tds summary + MCL products each pulled identical upstream
data 2-4x, amplifying API throttling and memory pressure.

Key source assets product-independently by their unification signature:
["sources", parameter, mode, scope, source_key], where mode is
summary|timeseries (the only distinction the backend makes) and scope
encodes the spatial filter. Each (parameter, mode, scope, source) tuple
becomes one shared asset, deduped across all products (124 -> 87 assets,
37 duplicate unifications eliminated).

A dedicated sources_job materializes every shared source once on its own
schedule (05:00, ahead of the product schedules). Each per-product job
now selects only its combine + geoserver assets and loads the source
inputs from the GCS IO manager instead of re-unifying them.

The IO manager is subclassed to return an empty payload when a source
pickle is absent, so a combine never hard-fails on a not-yet-materialized
source (e.g. on a fresh deploy, before sources_job has run once). Source
assets always write a payload when they run, so this cannot mask a real
data loss.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 28, 2026

Copy link
Copy Markdown

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
die-orchestration View in Cloud Jun 28, 2026 at 10:30 PM (UTC)

@jirhiker jirhiker merged commit cb8a9cd into main Jun 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant