Fetch each source once for both summary and timeseries unification by jirhiker · Pull Request #95 · DataIntegrationGroup/DataIntegrationEngine

jirhiker · 2026-06-29T04:51:30Z

What

A DIE source was unified separately per mode (summary vs timeseries). But every connector's get_records is mode-agnostic — both modes pull the same raw observations and differ only in how they're transformed. So a source needed by both a summary and a timeseries product hit the API twice for identical data.

This makes a source get fetched once and unified for both modes.

How

Backend — unify_source_both(config, source_key) (backend/unifier.py):

Enables an opt-in shared-fetch cache on the source (BaseSource._fetch_records / _sites_cache, off by default → the CLI/API path is byte-identical).
Runs _site_wrapper twice with config.output_summary toggled; the second pass reuses the first pass's cached site list + observations instead of re-querying.
Output is identical to running unify_source twice — only the underlying fetch is shared.

Orchestration — drop mode from the shared source key and the cohort key:

A source asset now calls unify_source_both and carries records (summary) + sites/timeseries together.
A summary product and a timeseries product over the same (parameter, scope, source) share one asset and one fetch.
They consequently share a cohort (cohorts keyed by group+scope), which is what lets them run together and dedupe the fetch.

Effect

	before	after
shared source assets	86	68 (−18 redundant fetches: waterlevels 9, arsenic 4, nitrate 5)
cohort jobs	4	2 (`waterlevels_state_NM`, `analytes_state_NM`)

Per-analyte fetch multiplication is unchanged and remains the documented next-step backend optimization.

Verification

tests/test_unify_dual.py (new): proves unify_source_both fetches each source once and yields output identical to two separate unify_source runs, plus the fetch-cache invariants.
dg check defs clean; 284 offline tests pass.
Adversarial review: cache stays off for CLI, config mutation isolated to fresh per-call config, no SiteRecord mutation across passes.

🤖 Generated with Claude Code

A DIE source was unified separately per mode (summary vs timeseries), but every connector's get_records is mode-agnostic: both modes pull the same raw observations and differ only in how they are transformed. So a source needed by both a summary and a timeseries product was fetched from the API twice for identical data. Backend: add unify_source_both(config, source_key), which fetches a source once and unifies it for both modes. It enables an opt-in shared-fetch cache on the source (BaseSource._fetch_records / _sites_cache, off by default so the CLI/API path is byte-identical) and runs _site_wrapper twice with config.output_summary toggled — the second pass reuses the first pass's cached site list and observations instead of re-querying. Output is identical to running unify_source twice; only the underlying fetch is shared. Orchestration: drop `mode` from the shared source key and the cohort key. A source asset now calls unify_source_both and carries records (summary) + sites/timeseries together, so a summary product and a timeseries product over the same (parameter, scope, source) share one asset and one fetch. Summary and timeseries products consequently share a cohort (cohorts keyed by group+scope), which is what lets them run together and dedupe the fetch. Effect: shared source assets 86 -> 68 (-18 redundant fetches: waterlevels 9, arsenic 4, nitrate 5); cohort jobs 4 -> 2 (waterlevels_state_NM, analytes_state_NM). Per-analyte fetch multiplication is unchanged and remains the documented next-step backend optimization. Adds tests/test_unify_dual.py: proves unify_source_both fetches each source once and yields output identical to two separate unify_source runs, plus the fetch-cache invariants. dg check defs clean; 284 offline tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

github-actions · 2026-06-29T04:52:00Z

Your pull request is automatically being deployed to Dagster Cloud.

Location	Status	Link	Updated
`die-orchestration`		View in Cloud	Jun 29, 2026 at 04:55 AM (UTC)

jirhiker merged commit 38c0fa3 into main Jun 29, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fetch each source once for both summary and timeseries unification#95

Fetch each source once for both summary and timeseries unification#95
jirhiker merged 1 commit into
mainfrom
feature/single-fetch-dual-unify

jirhiker commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jirhiker commented Jun 29, 2026

What

How

Effect

Verification

Uh oh!

github-actions Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 29, 2026 •

edited

Loading