Skip to content

feat(waterdata): add max_rows to the OGC data getters to cap total results#340

Merged
thodson-usgs merged 4 commits into
DOI-USGS:mainfrom
thodson-usgs:fix/ogc-getters-row-cap
Jul 1, 2026
Merged

feat(waterdata): add max_rows to the OGC data getters to cap total results#340
thodson-usgs merged 4 commits into
DOI-USGS:mainfrom
thodson-usgs:fix/ogc-getters-row-cap

Conversation

@thodson-usgs

Copy link
Copy Markdown
Collaborator

Summary

limit= on the 11 typed OGC data getters (get_daily, get_continuous,
get_monitoring_locations, get_time_series_metadata,
get_combined_metadata, get_latest_continuous, get_latest_daily,
get_field_measurements, get_field_measurements_metadata, get_peaks,
get_channel) has always been a per-page size, not a total-result cap —
_paginate follows every next link regardless of limit, so a broad,
unfiltered call pages through the entire matching result.

Hit this live while testing #333: get_daily(parameter_code="00060", limit=10) with no narrowing filter hung for 2+ minutes paging through the
full nationwide, multi-year result 10 rows at a time, rather than stopping
at 10 rows as the parameter name suggests.

get_reference_table already had the fix for this exact problem —
get_ogc_data has threaded a max_rows kwarg through to the _row_cap
context var and _finalize_ogc's truncation since it was added — the 11
typed getters just never exposed it on their public signature.

Changes

  • Add max_rows: int | None = None to each of the 11 getters, excluded from
    the OGC query args and passed through to get_ogc_data.
  • Each limit docstring now says explicitly that it's a page size, not a
    result cap, and points to max_rows.
  • get_cql (the raw-CQL escape hatch) builds its requests directly rather
    than going through get_ogc_data, so it isn't covered here — noted as a
    follow-up, not silently dropped.
  • NEWS.md entry.
  • New live test (test_get_daily_max_rows_caps_total_across_pages):
    limit=1 forces multiple pages, max_rows=3 confirms the combined result
    is truncated to exactly 3 rather than paging to completion.

Verification

  • New test passes live; also spot-checked get_monitoring_locations
    (the _with_state-path getter) live with max_rows.
  • Reproduced the original hang: the same broad, unfiltered query
    (get_daily(parameter_code="00060", limit=10), previously 2+ minutes)
    now returns in under a second when bounded with max_rows=10.
  • Full offline waterdata suite passes (394 tests).
  • ruff check / ruff format clean; mypy --strict shows only
    pre-existing, unrelated errors in ogc/planning.py and wateruse.py
    (confirmed present on a clean main checkout, untouched by this diff).

🤖 Generated with Claude Code

thodson-usgs and others added 4 commits June 30, 2026 17:23
…sults

limit= on get_daily, get_continuous, get_monitoring_locations,
get_time_series_metadata, get_combined_metadata, get_latest_continuous,
get_latest_daily, get_field_measurements, get_field_measurements_metadata,
get_peaks, and get_channel has always been a per-page size, not a result
cap — _paginate follows every `next` link regardless, so a broad,
unfiltered call (e.g. get_daily(parameter_code="00060", limit=10)) pages
through the entire multi-year, nationwide result 10 rows at a time. Hit
this live: it hung for 2+ minutes.

get_ogc_data already threads a max_rows kwarg through to the _row_cap
context var and _finalize_ogc's truncation (get_reference_table has used
it since it was added); these 11 getters just never exposed it. Add
max_rows: int | None = None to each, exclude it from the OGC query args,
and pass it through to get_ogc_data. Each limit docstring now says
explicitly that it's a page size, not a cap, and points to max_rows.

get_cql builds its requests directly rather than through get_ogc_data, so
it isn't covered here.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The original test exercised max_rows live with limit=1, forcing ~3 serial
round-trips against the real USGS API (and, under the module's flaky_api
marker, retrying the whole 3-page call on any transient blip) just to
re-prove cap-across-pages behavior that is already covered without a network
hop by the engine's _row_cap / _finalize_ogc tests in
tests/waterdata_utils_test.py.

The only behavior this PR actually adds is the per-getter wiring: max_rows
must be excluded from the request args (it's a client-side pagination cap,
not an OGC query param the server understands) and forwarded to get_ogc_data
as a keyword. Pin exactly that with the file's existing
mock.patch("...api.get_ogc_data") pattern — no network, deterministic.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removing the changelog entry per request; the code/test/docstring changes
stand on their own.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-cap

# Conflicts:
#	dataretrieval/waterdata/api.py
@thodson-usgs thodson-usgs marked this pull request as ready for review July 1, 2026 03:33
@thodson-usgs thodson-usgs merged commit 74c4856 into DOI-USGS:main Jul 1, 2026
9 checks passed
@thodson-usgs thodson-usgs deleted the fix/ogc-getters-row-cap branch July 1, 2026 03:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant