Cohort jobs: full per-run lineage with source dedup by jirhiker · Pull Request #93 · DataIntegrationGroup/DataIntegrationEngine

jirhiker · 2026-06-28T22:53:47Z

Reworks the job layout so each run shows the full sources → combine → geoserver lineage while still unifying each shared source only once.

Problem

The shared-source dedup (#92) split sources into their own sources_job and made product jobs publish-only (combine + geoserver). That deduped fetches but hid lineage: a product job's graph showed only <product> → geoserver, sources nowhere in the run.

Why full lineage + dedup needs grouping

Cross-product dedup is impossible across separate runs — if two products that share arsenic/summary/wqp run as two runs, each must materialize it → duplicated. So the sharing products must run together. That grouping is a cohort.

Change

Cohort = (group, mode, scope) — the products that can share source assets. One job per cohort materializes its whole graph in one run: each shared source unifies once (single asset key, selected once), every member combine reads it back through the GCS IO manager, geoserver publishes.
5 cohorts; the 87 source assets partition across them (no source materialized by two jobs → dedup preserved). Cohort cron = earliest member schedule.
Replaces sources_job + per-product publish jobs.

cohort job	sources	combines	sched
waterlevels_summary_state_NM	9	1 (summary)	06:00
waterlevels_timeseries_state_NM	9	3 (timeseries, trends, recency)	07:00
waterlevels_timeseries_county_Bernalillo	1	1 (bernco)	08:00
analytes_summary_state_NM	59	4 (arsenic, tds, major-chem, mcl)	09:00
analytes_timeseries_state_NM	9	2 (arsenic-trend, nitrate-trend)	13:00

API load unchanged: sources still fetched once each; publishes inline into the same run.

Also

Documents the remaining per-analyte fetch limitation (a source repeats per analyte because the backend unifies one parameter per pass — collapsing it needs a backend multi-analyte change, out of scope here).

Verification

dg check defs clean; 277 offline tests pass; 5 cohort jobs each show full lineage; source assets partition with no cross-job duplication.

🤖 Generated with Claude Code

The shared-source dedup collapses duplication across products but not across analytes: a source appears once per analyte because the backend unifies one parameter per pass. Note the future multi-analyte-unification optimization (a backend change) in the module docstring; out of scope for the orchestration graph. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-28T22:54:16Z

Your pull request is automatically being deployed to Dagster Cloud.

Location	Status	Link	Updated
`die-orchestration`		View in Cloud	Jun 28, 2026 at 11:06 PM (UTC)

The sources_job + publish-only product jobs preserved dedup but hid each product's source lineage (the product job showed only combine → geoserver). Replace that with one job per cohort, keyed (group, mode, scope): the products that can share source assets. A cohort job materializes its whole graph in one run — each shared source unifies once (one asset key, selected once), every member combine reads it back through the GCS IO manager, then geoserver publishes. Full sources → combine → geoserver lineage per run, no duplicated source fetch. 5 cohorts (waterlevels/analytes × summary/timeseries × scope); the 87 source assets partition across them so no source is materialized by two jobs. Cohort cron = earliest member schedule. The tolerant IO manager now only matters for ad-hoc combine-only materializations from the UI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

jirhiker changed the title ~~Document per-analyte source-fetch limitation~~ Cohort jobs: full per-run lineage with source dedup Jun 28, 2026

jirhiker merged commit 8c5a75c into main Jun 28, 2026
3 checks passed

jirhiker mentioned this pull request Jun 29, 2026

Scope bernco product statewide to dedupe its source fetch #94

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cohort jobs: full per-run lineage with source dedup#93

Cohort jobs: full per-run lineage with source dedup#93
jirhiker merged 2 commits into
mainfrom
docs/analyte-fetch-limitation

jirhiker commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jirhiker commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Why full lineage + dedup needs grouping

Change

Also

Verification

Uh oh!

github-actions Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jirhiker commented Jun 28, 2026 •

edited

Loading

github-actions Bot commented Jun 28, 2026 •

edited

Loading