Cohort jobs: full per-run lineage with source dedup#93
Merged
Conversation
The shared-source dedup collapses duplication across products but not across analytes: a source appears once per analyte because the backend unifies one parameter per pass. Note the future multi-analyte-unification optimization (a backend change) in the module docstring; out of scope for the orchestration graph. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Your pull request is automatically being deployed to Dagster Cloud.
|
The sources_job + publish-only product jobs preserved dedup but hid each product's source lineage (the product job showed only combine → geoserver). Replace that with one job per cohort, keyed (group, mode, scope): the products that can share source assets. A cohort job materializes its whole graph in one run — each shared source unifies once (one asset key, selected once), every member combine reads it back through the GCS IO manager, then geoserver publishes. Full sources → combine → geoserver lineage per run, no duplicated source fetch. 5 cohorts (waterlevels/analytes × summary/timeseries × scope); the 87 source assets partition across them so no source is materialized by two jobs. Cohort cron = earliest member schedule. The tolerant IO manager now only matters for ad-hoc combine-only materializations from the UI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reworks the job layout so each run shows the full sources → combine → geoserver lineage while still unifying each shared source only once.
Problem
The shared-source dedup (#92) split sources into their own
sources_joband made product jobs publish-only (combine + geoserver). That deduped fetches but hid lineage: a product job's graph showed only<product> → geoserver, sources nowhere in the run.Why full lineage + dedup needs grouping
Cross-product dedup is impossible across separate runs — if two products that share
arsenic/summary/wqprun as two runs, each must materialize it → duplicated. So the sharing products must run together. That grouping is a cohort.Change
sources_job+ per-product publish jobs.API load unchanged: sources still fetched once each; publishes inline into the same run.
Also
Verification
dg check defsclean; 277 offline tests pass; 5 cohort jobs each show full lineage; source assets partition with no cross-job duplication.🤖 Generated with Claude Code