feat(waterdata): add max_rows to the OGC data getters to cap total results#340
Merged
Merged
Conversation
…sults limit= on get_daily, get_continuous, get_monitoring_locations, get_time_series_metadata, get_combined_metadata, get_latest_continuous, get_latest_daily, get_field_measurements, get_field_measurements_metadata, get_peaks, and get_channel has always been a per-page size, not a result cap — _paginate follows every `next` link regardless, so a broad, unfiltered call (e.g. get_daily(parameter_code="00060", limit=10)) pages through the entire multi-year, nationwide result 10 rows at a time. Hit this live: it hung for 2+ minutes. get_ogc_data already threads a max_rows kwarg through to the _row_cap context var and _finalize_ogc's truncation (get_reference_table has used it since it was added); these 11 getters just never exposed it. Add max_rows: int | None = None to each, exclude it from the OGC query args, and pass it through to get_ogc_data. Each limit docstring now says explicitly that it's a page size, not a cap, and points to max_rows. get_cql builds its requests directly rather than through get_ogc_data, so it isn't covered here. Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
The original test exercised max_rows live with limit=1, forcing ~3 serial
round-trips against the real USGS API (and, under the module's flaky_api
marker, retrying the whole 3-page call on any transient blip) just to
re-prove cap-across-pages behavior that is already covered without a network
hop by the engine's _row_cap / _finalize_ogc tests in
tests/waterdata_utils_test.py.
The only behavior this PR actually adds is the per-getter wiring: max_rows
must be excluded from the request args (it's a client-side pagination cap,
not an OGC query param the server understands) and forwarded to get_ogc_data
as a keyword. Pin exactly that with the file's existing
mock.patch("...api.get_ogc_data") pattern — no network, deterministic.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Removing the changelog entry per request; the code/test/docstring changes stand on their own. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-cap # Conflicts: # dataretrieval/waterdata/api.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
limit=on the 11 typed OGC data getters (get_daily,get_continuous,get_monitoring_locations,get_time_series_metadata,get_combined_metadata,get_latest_continuous,get_latest_daily,get_field_measurements,get_field_measurements_metadata,get_peaks,get_channel) has always been a per-page size, not a total-result cap —_paginatefollows everynextlink regardless oflimit, so a broad,unfiltered call pages through the entire matching result.
Hit this live while testing #333:
get_daily(parameter_code="00060", limit=10)with no narrowing filter hung for 2+ minutes paging through thefull nationwide, multi-year result 10 rows at a time, rather than stopping
at 10 rows as the parameter name suggests.
get_reference_tablealready had the fix for this exact problem —get_ogc_datahas threaded amax_rowskwarg through to the_row_capcontext var and
_finalize_ogc's truncation since it was added — the 11typed getters just never exposed it on their public signature.
Changes
max_rows: int | None = Noneto each of the 11 getters, excluded fromthe OGC query args and passed through to
get_ogc_data.limitdocstring now says explicitly that it's a page size, not aresult cap, and points to
max_rows.get_cql(the raw-CQL escape hatch) builds its requests directly ratherthan going through
get_ogc_data, so it isn't covered here — noted as afollow-up, not silently dropped.
NEWS.mdentry.test_get_daily_max_rows_caps_total_across_pages):limit=1forces multiple pages,max_rows=3confirms the combined resultis truncated to exactly 3 rather than paging to completion.
Verification
get_monitoring_locations(the
_with_state-path getter) live withmax_rows.(
get_daily(parameter_code="00060", limit=10), previously 2+ minutes)now returns in under a second when bounded with
max_rows=10.waterdatasuite passes (394 tests).ruff check/ruff formatclean;mypy --strictshows onlypre-existing, unrelated errors in
ogc/planning.pyandwateruse.py(confirmed present on a clean
maincheckout, untouched by this diff).🤖 Generated with Claude Code