Skip to content

Auto site chunking#870

Merged
ldecicco-USGS merged 14 commits into
DOI-USGS:developfrom
ldecicco-USGS:develop
Mar 17, 2026
Merged

Auto site chunking#870
ldecicco-USGS merged 14 commits into
DOI-USGS:developfrom
ldecicco-USGS:develop

Conversation

@ldecicco-USGS

Copy link
Copy Markdown
Collaborator

This now works:

ohio <- read_waterdata_monitoring_location(state_name = "Ohio", 
                                  site_type_code = "ST")

ohio_discharge <- read_waterdata_daily(monitoring_location_id = ohio$monitoring_location_id,
                                  parameter_code = "00060",
                                  time = "P7D")

where before that 2nd query would have produced:

! HTTP 403 Forbidden.

@mikemahoney218-usgs mikemahoney218-usgs left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nothing but nits. Think this will be a big help!

Comment thread R/get_ogc_data.R Outdated
Comment on lines +12 to +13
service,
split_into = 250){

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest:

Suggested change
service,
split_into = 250){
service,
...,
split_into = 250){

And then potentially call this from rlang to enforce empty dots:
https://rlang.r-lib.org/reference/check_dots_empty.html

Maybe doesn't matter for an internal function, but I like separating off optional parameters to make it clear what can usually be left alone

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also just a nit, but I feel like chunk_size might be more clear... split_into makes me think this might create 250 equal-size chunks.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! That finally answers my question on why people started putting ... in functions and then testing that the dots are empty. I noticed the trend but never figured out why the heck people were doing that. I'll add it here but I can't promise I'll remember this trick in the future.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also nice because it lets you reorganize arguments after ..., as it intentionally breaks passing arguments by position -- so you can add arguments into the middle of your pile of optional args without breaking old code.

Comment thread R/get_ogc_data.R Outdated
get_ogc_data(args = args,
output_id = output_id,
service = service)})
rl_filtered <- rl[sapply(rl, function(x) dim(x)[1]) > 0]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rl <- list() 
rl[[1]] <- rl[[2]] <- rl[[3]] <- data.frame()
do.call(rbind, rl[sapply(rl, function(x) dim(x)[1]) > 0])
#> NULL

not sure if this is worth worrying about, but might be unexpected if all site IDs are bad

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also: sapply scares me, but I don't see a way for it to accidentally return a matrix here, so I'm less scared by this instance of it. In my own code I prefer vapply to make my expected return type explicit

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, good call - I tend to use sapply first because of laziness, and then need to remember to go back and figure out how to get vapply to work (EVERY TIME).

Comment thread R/get_ogc_data.R
ml_splits <- split(args[["monitoring_location_id"]],
ceiling(seq_along(args[["monitoring_location_id"]])/split_into))

rl <- lapply(ml_splits, function(x) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd have a wishlist item to replace this with future_lapply (though I think the cool kids use mirai these days) to let folks parallelize the downloads. That said, I think "if you're comfortable enough to parallelize things, you're comfortable enough to chunk sites by yourself" makes sense too

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was the thought that finally switched me to "sure, let's do this". I was worried if we implemented this, power-users would gripe that we took away their ability to parallelize in every new way imaginable and we'd quickly have a million rabbit holes to fill. BUT, if a user wants to parallelize, they can still chunk any way they want.

Comment thread R/get_ogc_data.R
Comment on lines +74 to +78
if(!service %in% c("monitoring-locations",
"time-series-metadata",
"field-measurements-metadata",
"combined-metadata",
"parameter-codes")){

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you wanted, we could add a keyword to these endpoints -- "meaningful-id"? -- which you could read from /collections at package build time (or just via a script) to get a list of endpoints with IDs worth preserving? Thinking of this because probably most of the RLMS endpoints you'd want the ID field.
https://api.waterdata.usgs.gov/ogcapi/v0/collections?f=json

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, all the data coming back from the read_waterdata_metadata changes the "id" to a meaningful id from this named list here:
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/AAA.R#L22
It is basically making them singular and subbing a _ for - (to keep with the arguments style).

@ldecicco-USGS ldecicco-USGS merged commit 6a5bcce into DOI-USGS:develop Mar 17, 2026
1 check passed
ldecicco-USGS added a commit to ldecicco-USGS/dataRetrieval that referenced this pull request Apr 29, 2026
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

## Design

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

## Rate-limit gating

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

## Mid-call interruption recovery

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

## Connection-pool reuse

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

## Metadata behavior

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

## Module layout

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

## Tests

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

## Related

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

## Design

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

## Rate-limit gating

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

## Mid-call interruption recovery

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

## Connection-pool reuse

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

## Metadata behavior

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

## Module layout

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

## Tests

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

## Related

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

## Design

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

## Rate-limit gating

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

## Mid-call interruption recovery

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

## Connection-pool reuse

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

## Metadata behavior

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

## Module layout

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

## Tests

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

## Related

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

## Design

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

## Rate-limit gating

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

## Mid-call interruption recovery

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

## Connection-pool reuse

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

## Metadata behavior

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

## Module layout

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

## Tests

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

## Related

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 22, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type_code="ST",
        skip_geometry=True,
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked into multiple sub-requests, one
    # combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.

Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_list)
    except ChunkInterrupted as exc:
        time.sleep(exc.retry_after or 5 * 60)
        # Re-issues only the still-pending sub-requests; banked work
        # is preserved on `exc.call`.
        df, md = exc.call.resume()

`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).

One behavior change for paginated/chunked calls:

- `BaseMetadata.url` still reflects the user's original query
  (unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
  headers so downstream code that branches on
  `x-ratelimit-remaining` sees current state (was: first page's
  headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
  every page/sub-request (was: first page's elapsed).

- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate` strategy helper that `_walk_pages` and
  `get_stats_data` both delegate to; typed transport exceptions
  moved out to `chunking` so the layer direction is strictly
  `utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
  and filter-chunkability detector kept as primitives the joint
  planner consumes.

80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.

Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 23, 2026
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:

    from dataretrieval.waterdata import get_daily, get_monitoring_locations

    sites_df, _ = get_monitoring_locations(
        state_name="Ohio",
        site_type="Stream",
    )
    # Before: HTTP 414 once `sites_df` exceeded ~500 rows.
    # After: transparently chunked, one combined DataFrame returned.
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.

Design
------
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.

Quota
-----
After every non-final sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the remaining window can't fit the rest
of the plan it raises `RequestExceedsQuota` carrying the in-flight
`ChunkedCall` handle on `.call`, so already-fetched chunks are
recoverable via `exc.call.partial_frame`. Set `API_USGS_LIMIT=0` to
bypass the pre-emptive check.

Interruption
------------
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx (and bare transport failures like `ConnectionError`). Both
carry the partial result plus a resumable call handle on `exc.call`:

    import time
    from dataretrieval.waterdata import get_daily
    from dataretrieval.waterdata.chunking import ChunkInterrupted

    try:
        df, md = get_daily(monitoring_location_id=long_site_list)
    except ChunkInterrupted as exc:
        while True:
            time.sleep(exc.retry_after or 5 * 60)
            try:
                df, md = exc.call.resume()
                break
            except ChunkInterrupted as next_exc:
                exc = next_exc

`Retry-After` (when the server sets it) is surfaced on the exception
as `.retry_after`.

A single `requests.Session` is opened once per chunked call and
published via a `ContextVar` so paginated-loop helpers downstream
(`_walk_pages`) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first.

Metadata
--------
One behavior change for paginated / chunked calls:

- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page / sub-request
  headers so downstream code that branches on `x-ratelimit-remaining`
  sees current state (was: first page's headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across every
  page / sub-request (was: first page's elapsed).

Module layout
-------------
- New module `dataretrieval.waterdata.chunking`: joint planner,
  exception hierarchy (`_RetryableTransportError`, `RateLimited`,
  `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
  `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
  `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
  shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
  into a `_paginate(req, parse_response, follow_up, client)` strategy
  helper that `_walk_pages` and `get_stats_data` both delegate to;
  typed transport exceptions moved out to `chunking` so the layer
  direction is strictly `utils → chunking`.
- `dataretrieval.waterdata.filters`: slimmed to a CQL-parsing leaf
  (`_split_top_level_or`, `_check_numeric_filter_pitfall`,
  `_is_chunkable`, `FILTER_LANG`); URL-budget and filter-chunking
  logic moved into the joint planner.

Tests
-----
51 new unit tests in `tests/waterdata_chunking_test.py` cover the
planner, axis extraction, cartesian-product enumeration, rate-limit
gating, resume idempotency and equivalence, transient-error
classification, shared-session reuse, and a URL-construction stress
test against the real `_construct_api_requests` builder (not a fake)
— 500 USGS site IDs × 20 datetime OR-clauses, asserting every
sub-request URL stays under 8000 bytes. Plus regression tests for
the pre-merge code review's bug fixes (empty-frame GDF preservation,
single-frame aliasing, iter_sub_args passthrough copy, quota check
on resume, RequestExceedsQuota `.call` handle, missing-features
defensiveness). Mid-pagination 429 / 5xx is covered for both the OGC
and stats paginators.

Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`, `instantanous` →
`instantaneous`).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
thodson-usgs added a commit to thodson-usgs/dataretrieval-python that referenced this pull request May 23, 2026
The OGC `waterdata` getters previously failed with HTTP 414 when the
request URL exceeded the server's ~8 KB byte limit. A common pattern
— pulling a long site list from `get_monitoring_locations` and
feeding it into `get_daily` — was the main offender:

    sites_df, _ = get_monitoring_locations(state_name="Ohio")
    df, md = get_daily(
        monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
        parameter_code="00060",
        time="P7D",
    )

Introduces a joint chunker that models every multi-value list
parameter and the cql-text `filter` (split on top-level `OR`) as a
chunkable axis. Greedy halving splits the biggest chunk across all
axes until each sub-request URL fits; the chunker fans out under the
hood and returns one combined DataFrame. Callers see no API change.

Mid-stream 429 / 5xx surface as `ChunkInterrupted` subclasses
(`QuotaExhausted` / `ServiceInterrupted`) carrying the partial result
plus a `.call` resumable handle — `exc.call.resume()` continues only
the still-pending sub-requests. Pre-emptive `RequestExceedsQuota`
catches plans that won't fit the remaining rate-limit window;
`API_USGS_LIMIT=0` bypasses the check.

Behavior changes for paginated / chunked calls:

- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page's headers so
  `x-ratelimit-remaining` is current (was: first page's).
- `BaseMetadata.query_time` is now cumulative wall-clock across pages
  (was: first page's elapsed).

Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.

Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants