Auto site chunking by ldecicco-USGS · Pull Request #870 · DOI-USGS/dataRetrieval

ldecicco-USGS · 2026-03-17T15:47:54Z

This now works:

ohio <- read_waterdata_monitoring_location(state_name = "Ohio", 
                                  site_type_code = "ST")

ohio_discharge <- read_waterdata_daily(monitoring_location_id = ohio$monitoring_location_id,
                                  parameter_code = "00060",
                                  time = "P7D")

where before that 2nd query would have produced:

! HTTP 403 Forbidden.

reupdate Status

Merge branch 'main' of github.com:DOI-USGS/dataRetrieval into develop # Conflicts: # vignettes/Status.Rmd

Getting ready for CRAN submission

mikemahoney218-usgs

Really nothing but nits. Think this will be a big help!

mikemahoney218-usgs · 2026-03-17T15:53:07Z

+                         service,
+                         split_into = 250){


I'd suggest:

Suggested change

service,

split_into = 250){

service,

...,

split_into = 250){

And then potentially call this from rlang to enforce empty dots:
https://rlang.r-lib.org/reference/check_dots_empty.html

Maybe doesn't matter for an internal function, but I like separating off optional parameters to make it clear what can usually be left alone

This is also just a nit, but I feel like chunk_size might be more clear... split_into makes me think this might create 250 equal-size chunks.

Ahh! That finally answers my question on why people started putting ... in functions and then testing that the dots are empty. I noticed the trend but never figured out why the heck people were doing that. I'll add it here but I can't promise I'll remember this trick in the future.

It's also nice because it lets you reorganize arguments after ..., as it intentionally breaks passing arguments by position -- so you can add arguments into the middle of your pile of optional args without breaking old code.

mikemahoney218-usgs · 2026-03-17T16:01:58Z

+      get_ogc_data(args = args,
+                   output_id = output_id, 
+                   service = service)})
+    rl_filtered <- rl[sapply(rl, function(x) dim(x)[1]) > 0]


rl <- list() rl[[1]] <- rl[[2]] <- rl[[3]] <- data.frame() do.call(rbind, rl[sapply(rl, function(x) dim(x)[1]) > 0]) #> NULL

not sure if this is worth worrying about, but might be unexpected if all site IDs are bad

also: sapply scares me, but I don't see a way for it to accidentally return a matrix here, so I'm less scared by this instance of it. In my own code I prefer vapply to make my expected return type explicit

ack, good call - I tend to use sapply first because of laziness, and then need to remember to go back and figure out how to get vapply to work (EVERY TIME).

mikemahoney218-usgs · 2026-03-17T16:03:53Z

+    ml_splits <- split(args[["monitoring_location_id"]], 
+                       ceiling(seq_along(args[["monitoring_location_id"]])/split_into))
+
+    rl <- lapply(ml_splits, function(x) {


I'd have a wishlist item to replace this with future_lapply (though I think the cool kids use mirai these days) to let folks parallelize the downloads. That said, I think "if you're comfortable enough to parallelize things, you're comfortable enough to chunk sites by yourself" makes sense too

that was the thought that finally switched me to "sure, let's do this". I was worried if we implemented this, power-users would gripe that we took away their ability to parallelize in every new way imaginable and we'd quickly have a million rabbit holes to fill. BUT, if a user wants to parallelize, they can still chunk any way they want.

mikemahoney218-usgs · 2026-03-17T16:08:19Z

+      if(!service %in% c("monitoring-locations",
+                         "time-series-metadata",
+                         "field-measurements-metadata",
+                         "combined-metadata",
+                         "parameter-codes")){


If you wanted, we could add a keyword to these endpoints -- "meaningful-id"? -- which you could read from /collections at package build time (or just via a script) to get a list of endpoints with IDs worth preserving? Thinking of this because probably most of the RLMS endpoints you'd want the ID field.
https://api.waterdata.usgs.gov/ogcapi/v0/collections?f=json

For now, all the data coming back from the read_waterdata_metadata changes the "id" to a meaningful id from this named list here:
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/AAA.R#L22
It is basically making them singular and subbing a _ for - (to keep with the arguments style).

Auto site chunking

The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — was the main offender: from dataretrieval.waterdata import get_daily, get_monitoring_locations sites_df, _ = get_monitoring_locations( state_name="Ohio", site_type_code="ST", skip_geometry=True, ) # Before: HTTP 414 once `sites_df` exceeded ~500 rows. # After: transparently chunked into multiple sub-requests, one # combined DataFrame returned. df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) This patch introduces a joint chunker that models every multi-value list parameter AND the cql-text `filter` (split on its top-level `OR` clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change. ## Design Every axis (a list-shaped kwarg, or the filter split into its top-level `OR` clauses) is represented by an `_Axis` dataclass: the args key, the tuple of indivisible atoms (site IDs or clauses), and the joiner used to compose them back into URL text (`,` for list axes, ` OR ` for the filter axis). `ChunkPlan` extracts the chunkable axes for a request and runs greedy halving against the biggest chunk across all axes until the worst-case sub-request URL fits. `ChunkedCall` iterates the joint cartesian product of axis chunks and drives the sub-requests to completion. Requests that already fit get a trivial single-step plan — one code path either way. ## Rate-limit gating After the first sub-request, `ChunkedCall` reads `x-ratelimit-remaining`; if the rest of the plan can't fit the current per-key rate-limit window, it raises `RequestExceedsQuota` reporting the deficit before burning more budget. Set `API_USGS_LIMIT=0` to bypass the pre-emptive check. ## Mid-call interruption recovery Mid-stream transient failures surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for HTTP 5xx. Both carry the partial result plus a resumable call handle on `exc.call`: import time from dataretrieval.waterdata import get_daily from dataretrieval.waterdata.chunking import ChunkInterrupted try: df, md = get_daily(monitoring_location_id=long_list) except ChunkInterrupted as exc: time.sleep(exc.retry_after or 5 * 60) # Re-issues only the still-pending sub-requests; banked work # is preserved on `exc.call`. df, md = exc.call.resume() ## Connection-pool reuse `ChunkedCall.resume` opens one `requests.Session` for the entire fan-out and publishes it via a `ContextVar` so paginated-loop helpers downstream (`_walk_pages`, `get_stats_data` via the new `_paginate` helper) reuse the same connection pool across every sub-request — saves one TCP/TLS handshake per sub-request after the first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk fan-out against the live USGS API (1.78s shared vs 3.03s per-sub-request). ## Metadata behavior One behavior change for paginated/chunked calls: - `BaseMetadata.url` still reflects the user's original query (unchanged). - `BaseMetadata.header` now carries the *last* page/sub-request headers so downstream code that branches on `x-ratelimit-remaining` sees current state (was: first page's headers). - `BaseMetadata.query_time` is now cumulative wall-clock across every page/sub-request (was: first page's elapsed). ## Module layout - New module `dataretrieval.waterdata.chunking`: joint planner, exception hierarchy (`_RetryableTransportError`, `RateLimited`, `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`, `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`), `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator, shared-session ContextVar plumbing. - `dataretrieval.waterdata.utils`: paginated-loop body consolidated into a `_paginate` strategy helper that `_walk_pages` and `get_stats_data` both delegate to; typed transport exceptions moved out to `chunking` so the layer direction is strictly `utils → chunking` (no more lazy cross-module import). - `dataretrieval.waterdata.filters`: existing top-level-OR splitter and filter-chunkability detector kept as primitives the joint planner consumes. ## Tests 80 new unit tests in `tests/waterdata_chunking_test.py` covering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient- error classification, shared-session reuse, and a URL-construction stress test against the real `_construct_api_requests` builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes and the joint planner beats the bail-floor worst case. Mid-pagination 429/5xx now also covered for both the OGC and stats paginators. ## Related Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Also fixes a handful of pre-existing docstring typos in `waterdata/api.py` (`meaining` → `meaning`, `instantanous` → `instantaneous`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — was the main offender: from dataretrieval.waterdata import get_daily, get_monitoring_locations sites_df, _ = get_monitoring_locations( state_name="Ohio", site_type_code="ST", skip_geometry=True, ) # Before: HTTP 414 once `sites_df` exceeded ~500 rows. # After: transparently chunked into multiple sub-requests, one # combined DataFrame returned. df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) This patch introduces a joint chunker that models every multi-value list parameter AND the cql-text `filter` (split on its top-level `OR` clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change. Every axis (a list-shaped kwarg, or the filter split into its top-level `OR` clauses) is represented by an `_Axis` dataclass: the args key, the tuple of indivisible atoms (site IDs or clauses), and the joiner used to compose them back into URL text (`,` for list axes, ` OR ` for the filter axis). `ChunkPlan` extracts the chunkable axes for a request and runs greedy halving against the biggest chunk across all axes until the worst-case sub-request URL fits. `ChunkedCall` iterates the joint cartesian product of axis chunks and drives the sub-requests to completion. Requests that already fit get a trivial single-step plan — one code path either way. After the first sub-request, `ChunkedCall` reads `x-ratelimit-remaining`; if the rest of the plan can't fit the current per-key rate-limit window, it raises `RequestExceedsQuota` reporting the deficit before burning more budget. Set `API_USGS_LIMIT=0` to bypass the pre-emptive check. Mid-stream transient failures surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for HTTP 5xx. Both carry the partial result plus a resumable call handle on `exc.call`: import time from dataretrieval.waterdata import get_daily from dataretrieval.waterdata.chunking import ChunkInterrupted try: df, md = get_daily(monitoring_location_id=long_list) except ChunkInterrupted as exc: time.sleep(exc.retry_after or 5 * 60) # Re-issues only the still-pending sub-requests; banked work # is preserved on `exc.call`. df, md = exc.call.resume() `ChunkedCall.resume` opens one `requests.Session` for the entire fan-out and publishes it via a `ContextVar` so paginated-loop helpers downstream (`_walk_pages`, `get_stats_data` via the new `_paginate` helper) reuse the same connection pool across every sub-request — saves one TCP/TLS handshake per sub-request after the first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk fan-out against the live USGS API (1.78s shared vs 3.03s per-sub-request). One behavior change for paginated/chunked calls: - `BaseMetadata.url` still reflects the user's original query (unchanged). - `BaseMetadata.header` now carries the *last* page/sub-request headers so downstream code that branches on `x-ratelimit-remaining` sees current state (was: first page's headers). - `BaseMetadata.query_time` is now cumulative wall-clock across every page/sub-request (was: first page's elapsed). - New module `dataretrieval.waterdata.chunking`: joint planner, exception hierarchy (`_RetryableTransportError`, `RateLimited`, `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`, `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`), `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator, shared-session ContextVar plumbing. - `dataretrieval.waterdata.utils`: paginated-loop body consolidated into a `_paginate` strategy helper that `_walk_pages` and `get_stats_data` both delegate to; typed transport exceptions moved out to `chunking` so the layer direction is strictly `utils → chunking` (no more lazy cross-module import). - `dataretrieval.waterdata.filters`: existing top-level-OR splitter and filter-chunkability detector kept as primitives the joint planner consumes. 80 new unit tests in `tests/waterdata_chunking_test.py` covering the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient- error classification, shared-session reuse, and a URL-construction stress test against the real `_construct_api_requests` builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes and the joint planner beats the bail-floor worst case. Mid-pagination 429/5xx now also covered for both the OGC and stats paginators. Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Also fixes a handful of pre-existing docstring typos in `waterdata/api.py` (`meaining` → `meaning`, `instantanous` → `instantaneous`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The OGC `waterdata` getters (`get_daily`, `get_continuous`, `get_field_measurements`, and the rest of the multi-value-capable functions) previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. The common chained-query pattern — pull a long site list from `get_monitoring_locations`, then feed it into `get_daily` — was the main offender: from dataretrieval.waterdata import get_daily, get_monitoring_locations sites_df, _ = get_monitoring_locations( state_name="Ohio", site_type="Stream", ) # Before: HTTP 414 once `sites_df` exceeded ~500 rows. # After: transparently chunked, one combined DataFrame returned. df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) This patch introduces a joint chunker that models every multi-value list parameter AND the cql-text `filter` (split on its top-level `OR` clauses) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits the limit; the chunker fans out into multiple HTTP requests under the hood and returns one combined DataFrame. Callers see no API change. Design ------ Every axis (a list-shaped kwarg, or the filter split into its top-level `OR` clauses) is represented by an `_Axis` dataclass: the args key, the tuple of indivisible atoms (site IDs or clauses), and the joiner used to compose them back into URL text (`,` for list axes, ` OR ` for the filter axis). `ChunkPlan` extracts the chunkable axes for a request and runs greedy halving against the biggest chunk across all axes until the worst-case sub-request URL fits. `ChunkedCall` iterates the joint cartesian product of axis chunks and drives the sub-requests to completion. Requests that already fit get a trivial single-step plan — one code path either way. Quota ----- After every non-final sub-request, `ChunkedCall` reads `x-ratelimit-remaining`; if the remaining window can't fit the rest of the plan it raises `RequestExceedsQuota` carrying the in-flight `ChunkedCall` handle on `.call`, so already-fetched chunks are recoverable via `exc.call.partial_frame`. Set `API_USGS_LIMIT=0` to bypass the pre-emptive check. Interruption ------------ Mid-stream transient failures surface as a `ChunkInterrupted` subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for HTTP 5xx (and bare transport failures like `ConnectionError`). Both carry the partial result plus a resumable call handle on `exc.call`: import time from dataretrieval.waterdata import get_daily from dataretrieval.waterdata.chunking import ChunkInterrupted try: df, md = get_daily(monitoring_location_id=long_site_list) except ChunkInterrupted as exc: while True: time.sleep(exc.retry_after or 5 * 60) try: df, md = exc.call.resume() break except ChunkInterrupted as next_exc: exc = next_exc `Retry-After` (when the server sets it) is surfaced on the exception as `.retry_after`. A single `requests.Session` is opened once per chunked call and published via a `ContextVar` so paginated-loop helpers downstream (`_walk_pages`) reuse the same connection pool across every sub-request — saves one TCP/TLS handshake per sub-request after the first. Metadata -------- One behavior change for paginated / chunked calls: - `BaseMetadata.url` still reflects the user's original query. - `BaseMetadata.header` now carries the LAST page / sub-request headers so downstream code that branches on `x-ratelimit-remaining` sees current state (was: first page's headers). - `BaseMetadata.query_time` is now cumulative wall-clock across every page / sub-request (was: first page's elapsed). Module layout ------------- - New module `dataretrieval.waterdata.chunking`: joint planner, exception hierarchy (`_RetryableTransportError`, `RateLimited`, `ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`, `ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`), `ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator, shared-session ContextVar plumbing. - `dataretrieval.waterdata.utils`: paginated-loop body consolidated into a `_paginate(req, parse_response, follow_up, client)` strategy helper that `_walk_pages` and `get_stats_data` both delegate to; typed transport exceptions moved out to `chunking` so the layer direction is strictly `utils → chunking`. - `dataretrieval.waterdata.filters`: slimmed to a CQL-parsing leaf (`_split_top_level_or`, `_check_numeric_filter_pitfall`, `_is_chunkable`, `FILTER_LANG`); URL-budget and filter-chunking logic moved into the joint planner. Tests ----- 51 new unit tests in `tests/waterdata_chunking_test.py` cover the planner, axis extraction, cartesian-product enumeration, rate-limit gating, resume idempotency and equivalence, transient-error classification, shared-session reuse, and a URL-construction stress test against the real `_construct_api_requests` builder (not a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting every sub-request URL stays under 8000 bytes. Plus regression tests for the pre-merge code review's bug fixes (empty-frame GDF preservation, single-frame aliasing, iter_sub_args passthrough copy, quota check on resume, RequestExceedsQuota `.call` handle, missing-features defensiveness). Mid-pagination 429 / 5xx is covered for both the OGC and stats paginators. Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Also fixes a handful of pre-existing docstring typos in `waterdata/api.py` (`meaining` → `meaning`, `instantanous` → `instantaneous`). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The OGC `waterdata` getters previously failed with HTTP 414 when the request URL exceeded the server's ~8 KB byte limit. A common pattern — pulling a long site list from `get_monitoring_locations` and feeding it into `get_daily` — was the main offender: sites_df, _ = get_monitoring_locations(state_name="Ohio") df, md = get_daily( monitoring_location_id=sites_df["monitoring_location_id"].tolist(), parameter_code="00060", time="P7D", ) Introduces a joint chunker that models every multi-value list parameter and the cql-text `filter` (split on top-level `OR`) as a chunkable axis. Greedy halving splits the biggest chunk across all axes until each sub-request URL fits; the chunker fans out under the hood and returns one combined DataFrame. Callers see no API change. Mid-stream 429 / 5xx surface as `ChunkInterrupted` subclasses (`QuotaExhausted` / `ServiceInterrupted`) carrying the partial result plus a `.call` resumable handle — `exc.call.resume()` continues only the still-pending sub-requests. Pre-emptive `RequestExceedsQuota` catches plans that won't fit the remaining rate-limit window; `API_USGS_LIMIT=0` bypasses the check. Behavior changes for paginated / chunked calls: - `BaseMetadata.url` still reflects the user's original query. - `BaseMetadata.header` now carries the LAST page's headers so `x-ratelimit-remaining` is current (was: first page's). - `BaseMetadata.query_time` is now cumulative wall-clock across pages (was: first page's elapsed). Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870), generalized from one filter axis to N joint axes. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

ldecicco-USGS added 13 commits February 9, 2026 12:32

reupdate Status

6e4a599

Merge pull request DOI-USGS#857 from ldecicco-USGS/main

cf9196c

reupdate Status

rerun doc.

432cdb6

Add a more basic introduction set of slides

2820a89

Forgot to add to banner

d5ab7c4

from upstream

2b80957

Merge branch 'main' of github.com:DOI-USGS/dataRetrieval into develop # Conflicts: # vignettes/Status.Rmd

Removed some links that CRAN-checks complained about

9aa8153

Merge pull request DOI-USGS#867 from ldecicco-USGS/develop

7c723c6

Getting ready for CRAN submission

Actually render slides

f348c75

Wrong limit

81e350c

Getting ready for more updates

e4f6f9b

new column

2fde691

Add automatic site chunking

dde01f8

ldecicco-USGS temporarily deployed to CI_config March 17, 2026 15:48 — with GitHub Actions Inactive

ldecicco-USGS requested a review from mikemahoney218-usgs March 17, 2026 15:48

mikemahoney218-usgs approved these changes Mar 17, 2026

View reviewed changes

forgot we are importing data.table now, rbindlist is way faster

a268ad8

ldecicco-USGS temporarily deployed to CI_config March 17, 2026 17:11 — with GitHub Actions Inactive

ldecicco-USGS merged commit 6a5bcce into DOI-USGS:develop Mar 17, 2026
1 check passed

ldecicco-USGS added a commit to ldecicco-USGS/dataRetrieval that referenced this pull request Apr 29, 2026

Merge pull request DOI-USGS#870 from ldecicco-USGS/develop

c93fd29

Auto site chunking

thodson-usgs mentioned this pull request May 22, 2026

feat(waterdata): Auto-chunk OGC requests over the URL byte limit DOI-USGS/dataretrieval-python#283

Merged

6 tasks

Uh oh!

Conversation

ldecicco-USGS commented Mar 17, 2026

Uh oh!

mikemahoney218-usgs left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants