Skip to content

Fetch each source once for both summary and timeseries unification#95

Merged
jirhiker merged 1 commit into
mainfrom
feature/single-fetch-dual-unify
Jun 29, 2026
Merged

Fetch each source once for both summary and timeseries unification#95
jirhiker merged 1 commit into
mainfrom
feature/single-fetch-dual-unify

Conversation

@jirhiker

Copy link
Copy Markdown
Member

What

A DIE source was unified separately per mode (summary vs timeseries). But every connector's get_records is mode-agnostic — both modes pull the same raw observations and differ only in how they're transformed. So a source needed by both a summary and a timeseries product hit the API twice for identical data.

This makes a source get fetched once and unified for both modes.

How

Backendunify_source_both(config, source_key) (backend/unifier.py):

  • Enables an opt-in shared-fetch cache on the source (BaseSource._fetch_records / _sites_cache, off by default → the CLI/API path is byte-identical).
  • Runs _site_wrapper twice with config.output_summary toggled; the second pass reuses the first pass's cached site list + observations instead of re-querying.
  • Output is identical to running unify_source twice — only the underlying fetch is shared.

Orchestration — drop mode from the shared source key and the cohort key:

  • A source asset now calls unify_source_both and carries records (summary) + sites/timeseries together.
  • A summary product and a timeseries product over the same (parameter, scope, source) share one asset and one fetch.
  • They consequently share a cohort (cohorts keyed by group+scope), which is what lets them run together and dedupe the fetch.

Effect

before after
shared source assets 86 68 (−18 redundant fetches: waterlevels 9, arsenic 4, nitrate 5)
cohort jobs 4 2 (waterlevels_state_NM, analytes_state_NM)

Per-analyte fetch multiplication is unchanged and remains the documented next-step backend optimization.

Verification

  • tests/test_unify_dual.py (new): proves unify_source_both fetches each source once and yields output identical to two separate unify_source runs, plus the fetch-cache invariants.
  • dg check defs clean; 284 offline tests pass.
  • Adversarial review: cache stays off for CLI, config mutation isolated to fresh per-call config, no SiteRecord mutation across passes.

🤖 Generated with Claude Code

A DIE source was unified separately per mode (summary vs timeseries), but
every connector's get_records is mode-agnostic: both modes pull the same raw
observations and differ only in how they are transformed. So a source needed
by both a summary and a timeseries product was fetched from the API twice for
identical data.

Backend: add unify_source_both(config, source_key), which fetches a source
once and unifies it for both modes. It enables an opt-in shared-fetch cache on
the source (BaseSource._fetch_records / _sites_cache, off by default so the
CLI/API path is byte-identical) and runs _site_wrapper twice with
config.output_summary toggled — the second pass reuses the first pass's cached
site list and observations instead of re-querying. Output is identical to
running unify_source twice; only the underlying fetch is shared.

Orchestration: drop `mode` from the shared source key and the cohort key. A
source asset now calls unify_source_both and carries records (summary) +
sites/timeseries together, so a summary product and a timeseries product over
the same (parameter, scope, source) share one asset and one fetch. Summary and
timeseries products consequently share a cohort (cohorts keyed by group+scope),
which is what lets them run together and dedupe the fetch.

Effect: shared source assets 86 -> 68 (-18 redundant fetches: waterlevels 9,
arsenic 4, nitrate 5); cohort jobs 4 -> 2 (waterlevels_state_NM,
analytes_state_NM). Per-analyte fetch multiplication is unchanged and remains
the documented next-step backend optimization.

Adds tests/test_unify_dual.py: proves unify_source_both fetches each source
once and yields output identical to two separate unify_source runs, plus the
fetch-cache invariants. dg check defs clean; 284 offline tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

Your pull request is automatically being deployed to Dagster Cloud.

Location Status Link Updated
die-orchestration View in Cloud Jun 29, 2026 at 04:55 AM (UTC)

@jirhiker jirhiker merged commit 38c0fa3 into main Jun 29, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant