Conversation
… recall over-matching - set_cancensus_api_key()/set_cancensus_cache_path() with overwrite=TRUE, install=TRUE truncated the user's entire .Renviron when it did not yet contain the key: x[-integer(0)] selects nothing rather than everything. - get_statcan_wds_data() cached error response bodies as data: write_disk() persists the body on non-200 responses and the file was never removed, so every subsequent call (including refresh=TRUE, which was ignored for the data cache) parsed the cached error payload as census CSV. - list_recalled_cached_data() matched recalled vector IDs as unanchored regex substrings, so a recall of v_CA21_1 also flagged v_CA21_10 etc., and remove_recalled_cached_data() could delete never-recalled data. IDs are now escaped and anchored on the JSON quotes of the vectors field. - get_recalled_database() error handler assigned the problem flag with <- instead of <<-, so download errors fell through to read_csv on a nonexistent file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ions - dataset_from_vector_list() used a scalar ifelse(), which returns only the first element; the mixed-dataset guard could never fire and hierarchy helpers silently dropped vectors from other datasets. - get_census() only checked for recalled data in an unreachable branch (readr is in Imports, so the non-readr fallback never runs) and on the fresh-download path where recall is impossible. The check now runs when reading tabular data from the local cache, mirroring the geo path. - add_unique_names_to_region_list() restored groups with length(gs)>1 (dropping single-variable grouping) and passed a list of symbols to all_of() (erroring on multi-variable grouping). Uses group_vars() now. - get_statcan_geographies() ignored its documented cache_path argument; the local argument shadowed the cache_path() helper, so user-supplied paths silently fell through to the default cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…oken
- find_census_vectors(query_type='exact') passed the query to grep() as an
unescaped regex: queries containing metacharacters like "income (" hard
errored and "income ($)" silently never matched. Queries are now escaped
and matched literally (new regex_escape() helper, also reused for the
recalled-data pattern).
- semantic_search() used seq_along(ncol(df)), which is always 1, so only
the full-phrase distance column was ever consulted while the match gate
used all columns; per-word matches passed the gate but grepped the wrong
n-gram. Now seq_len(ncol(df)).
- semantic_search() n-gram generation (PR #216) dropped suffix n-grams for
details shorter than the query, flipping former matches into "No close
matches found". The general loop already handles that case; the early
return is removed, restoring pre-0.6.0 n-grams.
- keyword_search() built its alternation regex from strsplit tokens without
dropping empty strings, so queries starting with a digit or punctuation
(e.g. "2015 income") produced an empty alternative that matched every
vector in the dataset.
- search_census_vectors()/search_census_regions() tested length() of a
one-column tibble (always 1), making the "No results found" stop()
unreachable and printing empty hint lists. Now nrow().
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ation - list_census_datasets()/list_census_regions(): operator precedence in `!quiet && is.null(x) || stale` left the staleness warning outside the quiet guard (and crashed on a NULL last_updated attribute with quiet=TRUE). Parenthesized so quiet=TRUE suppresses the warning. - visualize_vector_hierarchy(): character input of length > 1 was compared with == (element-wise recycling against the full vector column), erroring instead of taking the documented use-first-vector path. Now %in%. - visualize_vector_hierarchy(): nodes truncated by max_depth were labeled " (leaf)" even when they have children in the dataset. Leaf status is now checked against the full vector list; truncated nodes print " ..." instead. - retry_api_call(): HTTP 429 (rate limit) and 408 (request timeout) are now retried like 5xx, honoring a Retry-After header when present (capped at 60s). Previously the most common transient failures aborted immediately. - cache_path(): the cache directory is now created recursively and a clear error is raised when creation fails, instead of silently continuing and failing later with a cryptic write error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up to the seq_len(ncol()) fix: which.min per column yields one best n-gram per query component, but grep() only uses the first pattern element. Combine the unique best n-grams into a single escaped alternation so vectors matching any query component's closest n-gram are returned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
list_census_vectors()/list_census_regions() cached results as compressed .rda files in tempdir(), paying a disk read + decompress + unserialize of a multi-thousand-row tibble on every call - and hierarchy/search/viz helpers call them two to three times per invocation. A package-level environment now fronts the file cache: hits return in microseconds instead of tens of milliseconds. The tempdir() file cache is retained unchanged as the fallback, and use_cache=FALSE refreshes both layers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- parent_census_vectors()/child_census_vectors(): replaced the per-level bind_rows() + distinct() + full-set re-filter loop with a BFS over plain character vectors (hash-based %in%/match), subsetting the full vector list once at the end. Output rows and ordering are identical (verified against the previous implementation across roots, max_level, leaves_only and keep_parent combinations on randomized hierarchies). - semantic_search(): Levenshtein distance is bounded below by string length difference, so n-grams more than 2 characters longer/shorter than a query term can never pass the distance-2 threshold; they are pruned before adist(), which dominates the function's runtime, and treated as infinitely distant. Best n-grams are only collected from query components that actually have a within-threshold match. Empty vector lists now fail with a clear error. - keyword_search(): the cleaned details corpus is built as a character vector via vapply() instead of a list, avoiding coercion in grep() and regmatches() on every search. - get_census(): removed the unreachable installed.packages()-based install prompts (sf presence is already enforced earlier; installed.packages() scans the whole library) and the dead agr column computation that scanned and copied the assembled result frame before being discarded. - get_statcan_wds_metadata(): Parent ID extraction uses vectorized xml_find_first() instead of an R-level lapply over one nodeset per code (~57k codes at DA level); removed duplicate unused description queries. - list_cancensus_cache(): file sizes are stat'ed in one vectorized file.info() call instead of per-file tibbles. - dataset_from_vector_list(): single vectorized sub() instead of per-element strsplit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Bug fix details (by commit)
|
| # | Location | Bug | Trigger |
|---|---|---|---|
| 1 | R/user_settings.R (both setters) |
oldenv[-grep(key, oldenv)] evaluates to oldenv[integer(0)] (empty) when no line matches, truncating .Renviron to just the new key |
set_cancensus_api_key("...", overwrite=TRUE, install=TRUE) on an .Renviron with other env vars but no CM_API_KEY — everything else (GitHub PATs, other keys) silently destroyed |
| 2 | R/wds.R get_statcan_wds_data() |
httr::write_disk() writes the body even on non-200; the file wasn't deleted before stop(), and the cache-existence check ignored refresh |
Any transient StatCan 5xx: every subsequent call in the session, including refresh=TRUE, fed the cached XML/HTML error payload into read_csv() as census data |
| 3 | R/recalled_data.R list_recalled_cached_data() |
Recalled vector IDs pasted into a regex unescaped and unanchored | A recall of v_CA21_1 also flagged cached extracts for v_CA21_10, v_CA21_123, … and remove_recalled_cached_data() would delete them. Now escaped and anchored on the JSON quotes of the vectors metadata field |
| 4 | R/recalled_data.R get_recalled_database() |
error handler set the problem flag with <- (local) instead of <<- |
A download error (vs. warning) fell through to read_csv() on a nonexistent file |
27c7bcf — Silently wrong results
| # | Location | Bug | Trigger |
|---|---|---|---|
| 5 | R/helpers.R dataset_from_vector_list() |
scalar ifelse() returns only the first element, so the mixed-dataset guard could never fire |
parent_census_vectors(c("v_CA16_2519", "v_CA21_906")) silently inferred CA16 and dropped the CA21 vector instead of stopping with "Unable to determine dataset" |
| 6 | R/cancensus.R get_census() |
the tabular recalled-data check sat in the fresh-download path inside a !requireNamespace("readr") branch — unreachable since readr moved to Imports |
Cached tabular data that was later recalled produced no warning; the geometry path warned correctly. Check moved to the cache-read branch, mirroring geo |
| 7 | R/census_regions.R add_unique_names_to_region_list() |
length(gs)>1 off-by-one + all_of() on a list of symbols |
One grouping var → silently ungrouped output; two+ → error. Now group_vars() + length>0 |
| 8 | R/geographies.R get_statcan_geographies() |
documented cache_path argument never used (local arg shadowed by the cache_path() helper at call resolution) |
User-supplied cache paths silently ignored; data always went to the default cache |
96802e9 + eed5038 — Search correctness
| # | Location | Bug | Trigger |
|---|---|---|---|
| 9 | R/vector_discovery.R exact search |
query used as unescaped regex | find_census_vectors("income (", ...) → hard error; "income ($)" → silent "No exact matches found" despite literal matches existing |
| 10 | R/vector_discovery.R semantic search |
seq_along(ncol(df)) ≡ 1: only the full-phrase column was consulted while the match gate used all columns |
Queries whose only close match was via an individual word grepped the wrong n-gram. Follow-up (eed5038): the per-column best n-grams are combined into one escaped alternation |
| 11 | R/vector_discovery.R n-gram generation |
0.6.0 perf pass dropped suffix n-grams for details shorter than the query | 4-word query vs. a 2-word detail: the suffix n-gram that previously matched at distance 0 no longer existed → false "No close matches found" |
| 12 | R/vector_discovery.R keyword search |
empty strsplit tokens produced an empty regex alternative | find_census_vectors("2015 income", ..., query_type="keyword") matched every vector in the dataset |
| 13 | R/vector_discovery.R / R/census_regions.R |
length() of a 1-column tibble is always 1 |
"No results found" stop() unreachable; empty hint tibbles printed with a misleading warning. Now nrow() |
303028b — Quiet handling, viz, retry, cache dir
| # | Location | Bug | Trigger |
|---|---|---|---|
| 14 | R/cancensus.R / R/census_regions.R |
!quiet && is.null(x) || stale precedence: staleness check escaped the quiet guard |
list_census_regions("CA16", quiet=TRUE) with a >1-day-old session cache still warned; last_updated=NULL + quiet=TRUE crashed with "argument is of length zero" |
| 15 | R/vector_viz.R |
== instead of %in% for character input of length >1 (element-wise recycling against the full vector column) |
visualize_vector_hierarchy(c("v1","v2"), ...) errored instead of taking the documented use-first-vector path |
| 16 | R/vector_viz.R |
leaf check consulted only the max_depth-truncated tree |
Nodes with real children printed (leaf) — exactly the label users rely on for additive aggregation. Truncated nodes now print ... |
| 17 | R/helpers.R retry_api_call() |
only 5xx retried | HTTP 429 (rate limit) and 408 aborted immediately even with retry=3. Both now retried with backoff, honoring Retry-After (capped 60s) |
| 18 | R/helpers.R cache_path() |
dir.create non-recursive with warnings suppressed |
set_cancensus_cache_path("~/a/b/c") where ~/a/b doesn't exist failed silently, then get_census() died later with a cryptic write error. Now recursive + clear error |
🤖 Generated with Claude Code
Performance details (
|
|
@mountainMath in case you wanted to see what the latest gen model does :) |
|
Nice, will take a look later today and try to get this off to CRAN asap. |
Full 5-platform matrix on pull_request; single ubuntu-latest/release for push, schedule, and manual triggers. Adds concurrency group to cancel stale PR runs without stomping main-branch checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
Patch release following up on 0.6.0. A systematic review of the package surfaced 18 bugs (three of them data-loss class) and several optimization opportunities missed by the 0.6.0 performance pass. This PR fixes all of them across 8 commits — bug fixes first, then performance work, with each commit message carrying the details. Per-commit write-ups and benchmark results follow in the comment thread.
Bug fixes (highlights)
Data loss / corruption:
set_cancensus_api_key()/set_cancensus_cache_path()withoverwrite=TRUE, install=TRUEtruncated the user's entire.Renvironwhen it didn't already contain the key (x[-integer(0)]selects nothing, not everything).get_statcan_wds_data()cached HTTP error bodies as data:write_disk()persists the body on non-200 and the file was never removed, so every later call — includingrefresh=TRUE, which was ignored — parsed the error payload as census CSV.remove_recalled_cached_data()matched recalled vector IDs as unanchored regex substrings, so a recall ofv_CA21_1also flagged (and could delete) cached data forv_CA21_10,v_CA21_123, …Silently wrong results:
parent_census_vectors()/child_census_vectors()with mixed-dataset input silently dropped vectors instead of erroring (scalarifelse()indataset_from_vector_list()).get_census()never warned about recalled data when reading tabular data from cache (the check sat in an unreachable branch); the geometry path warned correctly.seq_along(ncol(df))is always 1) and 0.6.0 dropped suffix n-grams for short details; keyword search matched every vector for queries starting with a digit; exact search treated queries as unescaped regex ("income ("errored,"income ($)"never matched).add_unique_names_to_region_list()dropped single-variable grouping and errored on multi-variable grouping.get_statcan_geographies()ignored its documentedcache_pathargument.Other: staleness warnings now respect
quiet=TRUE(and no longer crash on a missinglast_updatedattribute);visualize_vector_hierarchy()handles multi-vector character input and no longer labelsmax_depth-truncated nodes as(leaf); retry logic now covers HTTP 429/408 and honorsRetry-After; cache directories are created recursively with a clear error on failure.Performance improvements
child_census_vectors()(6-level tree, ~5.5k nodes)child_census_vectors(leaves_only=TRUE)parent_census_vectors()(deep leaf)list_census_vectors()cache hitMedians via
microbenchmark, mock data, R 4.5.0/macOS; script and methodology inbenchmarks/benchmark_v0.6.1.R. Key changes: an in-memory session cache in front of the tempdir.rdametadata cache (hierarchy/search/viz helpers were re-deserializing it 2-3x per call), a hash-based BFS rewrite of the hierarchy traversal, length-bound pruning beforeadist()in semantic search, removal of aninstalled.packages()scan from every spatialget_census()call, and vectorizedxml_find_first()parent-ID extraction in WDS metadata parsing (~57k codes at DA level).Verification
devtools::test(): 43 pass, 0 fail, 0 warningsR CMD check: 0 errors, 0 warnings, 1 NOTE (environment clock artifact: "unable to verify current time")max_levelin {NA,1,2,3,10} xleaves_onlyxkeep_parent🤖 Generated with Claude Code