Auto site chunking#870
Conversation
reupdate Status
Merge branch 'main' of github.com:DOI-USGS/dataRetrieval into develop # Conflicts: # vignettes/Status.Rmd
Getting ready for CRAN submission
mikemahoney218-usgs
left a comment
There was a problem hiding this comment.
Really nothing but nits. Think this will be a big help!
| service, | ||
| split_into = 250){ |
There was a problem hiding this comment.
I'd suggest:
| service, | |
| split_into = 250){ | |
| service, | |
| ..., | |
| split_into = 250){ |
And then potentially call this from rlang to enforce empty dots:
https://rlang.r-lib.org/reference/check_dots_empty.html
Maybe doesn't matter for an internal function, but I like separating off optional parameters to make it clear what can usually be left alone
There was a problem hiding this comment.
This is also just a nit, but I feel like chunk_size might be more clear... split_into makes me think this might create 250 equal-size chunks.
There was a problem hiding this comment.
Ahh! That finally answers my question on why people started putting ... in functions and then testing that the dots are empty. I noticed the trend but never figured out why the heck people were doing that. I'll add it here but I can't promise I'll remember this trick in the future.
There was a problem hiding this comment.
It's also nice because it lets you reorganize arguments after ..., as it intentionally breaks passing arguments by position -- so you can add arguments into the middle of your pile of optional args without breaking old code.
| get_ogc_data(args = args, | ||
| output_id = output_id, | ||
| service = service)}) | ||
| rl_filtered <- rl[sapply(rl, function(x) dim(x)[1]) > 0] |
There was a problem hiding this comment.
rl <- list()
rl[[1]] <- rl[[2]] <- rl[[3]] <- data.frame()
do.call(rbind, rl[sapply(rl, function(x) dim(x)[1]) > 0])
#> NULLnot sure if this is worth worrying about, but might be unexpected if all site IDs are bad
There was a problem hiding this comment.
also: sapply scares me, but I don't see a way for it to accidentally return a matrix here, so I'm less scared by this instance of it. In my own code I prefer vapply to make my expected return type explicit
There was a problem hiding this comment.
ack, good call - I tend to use sapply first because of laziness, and then need to remember to go back and figure out how to get vapply to work (EVERY TIME).
| ml_splits <- split(args[["monitoring_location_id"]], | ||
| ceiling(seq_along(args[["monitoring_location_id"]])/split_into)) | ||
|
|
||
| rl <- lapply(ml_splits, function(x) { |
There was a problem hiding this comment.
I'd have a wishlist item to replace this with future_lapply (though I think the cool kids use mirai these days) to let folks parallelize the downloads. That said, I think "if you're comfortable enough to parallelize things, you're comfortable enough to chunk sites by yourself" makes sense too
There was a problem hiding this comment.
that was the thought that finally switched me to "sure, let's do this". I was worried if we implemented this, power-users would gripe that we took away their ability to parallelize in every new way imaginable and we'd quickly have a million rabbit holes to fill. BUT, if a user wants to parallelize, they can still chunk any way they want.
| if(!service %in% c("monitoring-locations", | ||
| "time-series-metadata", | ||
| "field-measurements-metadata", | ||
| "combined-metadata", | ||
| "parameter-codes")){ |
There was a problem hiding this comment.
If you wanted, we could add a keyword to these endpoints -- "meaningful-id"? -- which you could read from /collections at package build time (or just via a script) to get a list of endpoints with IDs worth preserving? Thinking of this because probably most of the RLMS endpoints you'd want the ID field.
https://api.waterdata.usgs.gov/ogcapi/v0/collections?f=json
There was a problem hiding this comment.
For now, all the data coming back from the read_waterdata_metadata changes the "id" to a meaningful id from this named list here:
https://github.com/DOI-USGS/dataRetrieval/blob/main/R/AAA.R#L22
It is basically making them singular and subbing a _ for - (to keep with the arguments style).
Auto site chunking
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
## Design
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
## Rate-limit gating
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
## Mid-call interruption recovery
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
## Connection-pool reuse
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
## Metadata behavior
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
## Module layout
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
## Tests
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
## Related
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
## Design
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
## Rate-limit gating
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
## Mid-call interruption recovery
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
## Connection-pool reuse
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
## Metadata behavior
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
## Module layout
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
## Tests
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
## Related
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
## Design
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
## Rate-limit gating
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
## Mid-call interruption recovery
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
## Connection-pool reuse
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
## Metadata behavior
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
## Module layout
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
## Tests
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
## Related
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
## Design
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
## Rate-limit gating
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
## Mid-call interruption recovery
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
## Connection-pool reuse
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
## Metadata behavior
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
## Module layout
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
## Tests
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
## Related
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type_code="ST",
skip_geometry=True,
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked into multiple sub-requests, one
# combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
After the first sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the rest of the plan can't fit the
current per-key rate-limit window, it raises `RequestExceedsQuota`
reporting the deficit before burning more budget. Set
`API_USGS_LIMIT=0` to bypass the pre-emptive check.
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx. Both carry the partial result plus a resumable call handle
on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_list)
except ChunkInterrupted as exc:
time.sleep(exc.retry_after or 5 * 60)
# Re-issues only the still-pending sub-requests; banked work
# is preserved on `exc.call`.
df, md = exc.call.resume()
`ChunkedCall.resume` opens one `requests.Session` for the entire
fan-out and publishes it via a `ContextVar` so paginated-loop
helpers downstream (`_walk_pages`, `get_stats_data` via the new
`_paginate` helper) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first. Measured 41% wall-clock reduction on a 2000-site / 8-chunk
fan-out against the live USGS API (1.78s shared vs 3.03s
per-sub-request).
One behavior change for paginated/chunked calls:
- `BaseMetadata.url` still reflects the user's original query
(unchanged).
- `BaseMetadata.header` now carries the *last* page/sub-request
headers so downstream code that branches on
`x-ratelimit-remaining` sees current state (was: first page's
headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across
every page/sub-request (was: first page's elapsed).
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate` strategy helper that `_walk_pages` and
`get_stats_data` both delegate to; typed transport exceptions
moved out to `chunking` so the layer direction is strictly
`utils → chunking` (no more lazy cross-module import).
- `dataretrieval.waterdata.filters`: existing top-level-OR splitter
and filter-chunkability detector kept as primitives the joint
planner consumes.
80 new unit tests in `tests/waterdata_chunking_test.py` covering
the planner, axis extraction, cartesian-product enumeration,
rate-limit gating, resume idempotency and equivalence, transient-
error classification, shared-session reuse, and a URL-construction
stress test against the real `_construct_api_requests` builder (not
a fake) — 500 USGS site IDs × 20 datetime OR-clauses, asserting
every sub-request URL stays under 8000 bytes and the joint planner
beats the bail-floor worst case. Mid-pagination 429/5xx now also
covered for both the OGC and stats paginators.
Mirrors R `dataRetrieval`'s [#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`,
`instantanous` → `instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters (`get_daily`, `get_continuous`,
`get_field_measurements`, and the rest of the multi-value-capable
functions) previously failed with HTTP 414 when the request URL
exceeded the server's ~8 KB byte limit. The common chained-query
pattern — pull a long site list from `get_monitoring_locations`,
then feed it into `get_daily` — was the main offender:
from dataretrieval.waterdata import get_daily, get_monitoring_locations
sites_df, _ = get_monitoring_locations(
state_name="Ohio",
site_type="Stream",
)
# Before: HTTP 414 once `sites_df` exceeded ~500 rows.
# After: transparently chunked, one combined DataFrame returned.
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
This patch introduces a joint chunker that models every multi-value
list parameter AND the cql-text `filter` (split on its top-level
`OR` clauses) as a chunkable axis. Greedy halving splits the biggest
chunk across all axes until each sub-request URL fits the limit; the
chunker fans out into multiple HTTP requests under the hood and
returns one combined DataFrame. Callers see no API change.
Design
------
Every axis (a list-shaped kwarg, or the filter split into its
top-level `OR` clauses) is represented by an `_Axis` dataclass: the
args key, the tuple of indivisible atoms (site IDs or clauses), and
the joiner used to compose them back into URL text (`,` for list
axes, ` OR ` for the filter axis). `ChunkPlan` extracts the
chunkable axes for a request and runs greedy halving against the
biggest chunk across all axes until the worst-case sub-request URL
fits. `ChunkedCall` iterates the joint cartesian product of axis
chunks and drives the sub-requests to completion. Requests that
already fit get a trivial single-step plan — one code path either
way.
Quota
-----
After every non-final sub-request, `ChunkedCall` reads
`x-ratelimit-remaining`; if the remaining window can't fit the rest
of the plan it raises `RequestExceedsQuota` carrying the in-flight
`ChunkedCall` handle on `.call`, so already-fetched chunks are
recoverable via `exc.call.partial_frame`. Set `API_USGS_LIMIT=0` to
bypass the pre-emptive check.
Interruption
------------
Mid-stream transient failures surface as a `ChunkInterrupted`
subclass — `QuotaExhausted` for HTTP 429, `ServiceInterrupted` for
HTTP 5xx (and bare transport failures like `ConnectionError`). Both
carry the partial result plus a resumable call handle on `exc.call`:
import time
from dataretrieval.waterdata import get_daily
from dataretrieval.waterdata.chunking import ChunkInterrupted
try:
df, md = get_daily(monitoring_location_id=long_site_list)
except ChunkInterrupted as exc:
while True:
time.sleep(exc.retry_after or 5 * 60)
try:
df, md = exc.call.resume()
break
except ChunkInterrupted as next_exc:
exc = next_exc
`Retry-After` (when the server sets it) is surfaced on the exception
as `.retry_after`.
A single `requests.Session` is opened once per chunked call and
published via a `ContextVar` so paginated-loop helpers downstream
(`_walk_pages`) reuse the same connection pool across every
sub-request — saves one TCP/TLS handshake per sub-request after the
first.
Metadata
--------
One behavior change for paginated / chunked calls:
- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page / sub-request
headers so downstream code that branches on `x-ratelimit-remaining`
sees current state (was: first page's headers).
- `BaseMetadata.query_time` is now cumulative wall-clock across every
page / sub-request (was: first page's elapsed).
Module layout
-------------
- New module `dataretrieval.waterdata.chunking`: joint planner,
exception hierarchy (`_RetryableTransportError`, `RateLimited`,
`ServiceUnavailable`, `RequestTooLarge`, `RequestExceedsQuota`,
`ChunkInterrupted`, `QuotaExhausted`, `ServiceInterrupted`),
`ChunkPlan`, `ChunkedCall`, `multi_value_chunked` decorator,
shared-session ContextVar plumbing.
- `dataretrieval.waterdata.utils`: paginated-loop body consolidated
into a `_paginate(req, parse_response, follow_up, client)` strategy
helper that `_walk_pages` and `get_stats_data` both delegate to;
typed transport exceptions moved out to `chunking` so the layer
direction is strictly `utils → chunking`.
- `dataretrieval.waterdata.filters`: slimmed to a CQL-parsing leaf
(`_split_top_level_or`, `_check_numeric_filter_pitfall`,
`_is_chunkable`, `FILTER_LANG`); URL-budget and filter-chunking
logic moved into the joint planner.
Tests
-----
51 new unit tests in `tests/waterdata_chunking_test.py` cover the
planner, axis extraction, cartesian-product enumeration, rate-limit
gating, resume idempotency and equivalence, transient-error
classification, shared-session reuse, and a URL-construction stress
test against the real `_construct_api_requests` builder (not a fake)
— 500 USGS site IDs × 20 datetime OR-clauses, asserting every
sub-request URL stays under 8000 bytes. Plus regression tests for
the pre-merge code review's bug fixes (empty-frame GDF preservation,
single-frame aliasing, iter_sub_args passthrough copy, quota check
on resume, RequestExceedsQuota `.call` handle, missing-features
defensiveness). Mid-pagination 429 / 5xx is covered for both the OGC
and stats paginators.
Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Also fixes a handful of pre-existing docstring typos in
`waterdata/api.py` (`meaining` → `meaning`, `instantanous` →
`instantaneous`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The OGC `waterdata` getters previously failed with HTTP 414 when the
request URL exceeded the server's ~8 KB byte limit. A common pattern
— pulling a long site list from `get_monitoring_locations` and
feeding it into `get_daily` — was the main offender:
sites_df, _ = get_monitoring_locations(state_name="Ohio")
df, md = get_daily(
monitoring_location_id=sites_df["monitoring_location_id"].tolist(),
parameter_code="00060",
time="P7D",
)
Introduces a joint chunker that models every multi-value list
parameter and the cql-text `filter` (split on top-level `OR`) as a
chunkable axis. Greedy halving splits the biggest chunk across all
axes until each sub-request URL fits; the chunker fans out under the
hood and returns one combined DataFrame. Callers see no API change.
Mid-stream 429 / 5xx surface as `ChunkInterrupted` subclasses
(`QuotaExhausted` / `ServiceInterrupted`) carrying the partial result
plus a `.call` resumable handle — `exc.call.resume()` continues only
the still-pending sub-requests. Pre-emptive `RequestExceedsQuota`
catches plans that won't fit the remaining rate-limit window;
`API_USGS_LIMIT=0` bypasses the check.
Behavior changes for paginated / chunked calls:
- `BaseMetadata.url` still reflects the user's original query.
- `BaseMetadata.header` now carries the LAST page's headers so
`x-ratelimit-remaining` is current (was: first page's).
- `BaseMetadata.query_time` is now cumulative wall-clock across pages
(was: first page's elapsed).
Mirrors R `dataRetrieval`'s
[#870](DOI-USGS/dataRetrieval#870),
generalized from one filter axis to N joint axes.
Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
This now works:
where before that 2nd query would have produced: