Share source assets across products to dedupe unification by jirhiker · Pull Request #92 · DataIntegrationGroup/DataIntegrationEngine

jirhiker · 2026-06-28T22:25:56Z

Problem

Source assets were keyed per-product ([product_id, "sources", key]), so every product re-ran DIE unification for the same source. The four waterlevels-NM products (summary/timeseries/trends/recency) and the arsenic/nitrate/tds summary + MCL products each pulled identical upstream data 2–4×, amplifying API throttling and memory pressure.

Change

Key source assets product-independently by their unification signature: ["sources", parameter, mode, scope, source_key], where mode is summary|timeseries (the only distinction the backend makes) and scope encodes the spatial filter. Each (parameter, mode, scope, source) tuple becomes one shared asset, deduped across all products — 124 → 87 assets, 37 duplicate unifications eliminated.

A dedicated sources_job materializes every shared source once on its own schedule (05:00, ahead of the product schedules). Each per-product job now selects only its combine + geoserver assets and loads source inputs from the GCS IO manager instead of re-unifying them.

The IO manager is subclassed to return an empty payload when a source pickle is absent, so a combine never hard-fails on a not-yet-materialized source (e.g. on a fresh deploy, before sources_job has run once). Source assets always write a payload when they run, so this cannot mask a real data loss.

Verification

dg check defs — clean
Job subsetting confirmed: product jobs resolve to 2 assets (combine + geoserver); sources_job to all 87 shared sources
277 tests pass

Deploy note

On a fresh deploy, run sources_job once before the product jobs (the schedule orders it ahead at 05:00, but it must materialize each source at least once; the tolerant IO manager degrades gracefully until then).

🤖 Generated with Claude Code

Source assets were keyed per-product ([product_id, "sources", key]), so every product re-ran DIE unification for the same source. The four waterlevels-NM products (summary/timeseries/trends/recency) and the arsenic/nitrate/tds summary + MCL products each pulled identical upstream data 2-4x, amplifying API throttling and memory pressure. Key source assets product-independently by their unification signature: ["sources", parameter, mode, scope, source_key], where mode is summary|timeseries (the only distinction the backend makes) and scope encodes the spatial filter. Each (parameter, mode, scope, source) tuple becomes one shared asset, deduped across all products (124 -> 87 assets, 37 duplicate unifications eliminated). A dedicated sources_job materializes every shared source once on its own schedule (05:00, ahead of the product schedules). Each per-product job now selects only its combine + geoserver assets and loads the source inputs from the GCS IO manager instead of re-unifying them. The IO manager is subclassed to return an empty payload when a source pickle is absent, so a combine never hard-fails on a not-yet-materialized source (e.g. on a fresh deploy, before sources_job has run once). Source assets always write a payload when they run, so this cannot mask a real data loss. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-28T22:26:30Z

Your pull request is automatically being deployed to Dagster Cloud.

Location	Status	Link	Updated
`die-orchestration`		View in Cloud	Jun 28, 2026 at 10:30 PM (UTC)

jirhiker merged commit cb8a9cd into main Jun 28, 2026
2 checks passed

jirhiker mentioned this pull request Jun 28, 2026

Cohort jobs: full per-run lineage with source dedup #93

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Share source assets across products to dedupe unification#92

Share source assets across products to dedupe unification#92
jirhiker merged 1 commit into
mainfrom
feature/source-asset-dedup

jirhiker commented Jun 28, 2026

Uh oh!

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jirhiker commented Jun 28, 2026

Problem

Change

Verification

Deploy note

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 28, 2026 •

edited

Loading