0.6.1: bug fixes and performance improvements by dshkol · Pull Request #225 · mountainMath/cancensus

dshkol · 2026-06-11T15:34:16Z

Summary

Patch release following up on 0.6.0. A systematic review of the package surfaced 18 bugs (three of them data-loss class) and several optimization opportunities missed by the 0.6.0 performance pass. This PR fixes all of them across 8 commits — bug fixes first, then performance work, with each commit message carrying the details. Per-commit write-ups and benchmark results follow in the comment thread.

Bug fixes (highlights)

Data loss / corruption:

set_cancensus_api_key() / set_cancensus_cache_path() with overwrite=TRUE, install=TRUE truncated the user's entire .Renviron when it didn't already contain the key (x[-integer(0)] selects nothing, not everything).
get_statcan_wds_data() cached HTTP error bodies as data: write_disk() persists the body on non-200 and the file was never removed, so every later call — including refresh=TRUE, which was ignored — parsed the error payload as census CSV.
remove_recalled_cached_data() matched recalled vector IDs as unanchored regex substrings, so a recall of v_CA21_1 also flagged (and could delete) cached data for v_CA21_10, v_CA21_123, …

Silently wrong results:

parent_census_vectors() / child_census_vectors() with mixed-dataset input silently dropped vectors instead of erroring (scalar ifelse() in dataset_from_vector_list()).
get_census() never warned about recalled data when reading tabular data from cache (the check sat in an unreachable branch); the geometry path warned correctly.
Semantic search only ever consulted the full-phrase distance column (seq_along(ncol(df)) is always 1) and 0.6.0 dropped suffix n-grams for short details; keyword search matched every vector for queries starting with a digit; exact search treated queries as unescaped regex ("income (" errored, "income ($)" never matched).
add_unique_names_to_region_list() dropped single-variable grouping and errored on multi-variable grouping.
get_statcan_geographies() ignored its documented cache_path argument.

Other: staleness warnings now respect quiet=TRUE (and no longer crash on a missing last_updated attribute); visualize_vector_hierarchy() handles multi-vector character input and no longer labels max_depth-truncated nodes as (leaf); retry logic now covers HTTP 429/408 and honors Retry-After; cache directories are created recursively with a clear error on failure.

Performance improvements

benchmark	v0.6.0	v0.6.1	speedup
`child_census_vectors()` (6-level tree, ~5.5k nodes)	4.59 ms	0.78 ms	~5.9x
`child_census_vectors(leaves_only=TRUE)`	4.86 ms	1.20 ms	~4.1x
`parent_census_vectors()` (deep leaf)	3.13 ms	0.40 ms	~7.8x
semantic search (7.7k vectors)	291 ms	245 ms	~1.2x
`list_census_vectors()` cache hit	4.77 ms	0.003 ms	~1600x

Medians via microbenchmark, mock data, R 4.5.0/macOS; script and methodology in benchmarks/benchmark_v0.6.1.R. Key changes: an in-memory session cache in front of the tempdir .rda metadata cache (hierarchy/search/viz helpers were re-deserializing it 2-3x per call), a hash-based BFS rewrite of the hierarchy traversal, length-bound pruning before adist() in semantic search, removal of an installed.packages() scan from every spatial get_census() call, and vectorized xml_find_first() parent-ID extraction in WDS metadata parsing (~57k codes at DA level).

Verification

devtools::test(): 43 pass, 0 fail, 0 warnings
R CMD check: 0 errors, 0 warnings, 1 NOTE (environment clock artifact: "unable to verify current time")
BFS hierarchy rewrite verified to produce identical rows and ordering to the previous implementation across randomized 2,000-node hierarchies x max_level in {NA,1,2,3,10} x leaves_only x keep_parent

🤖 Generated with Claude Code

… recall over-matching - set_cancensus_api_key()/set_cancensus_cache_path() with overwrite=TRUE, install=TRUE truncated the user's entire .Renviron when it did not yet contain the key: x[-integer(0)] selects nothing rather than everything. - get_statcan_wds_data() cached error response bodies as data: write_disk() persists the body on non-200 responses and the file was never removed, so every subsequent call (including refresh=TRUE, which was ignored for the data cache) parsed the cached error payload as census CSV. - list_recalled_cached_data() matched recalled vector IDs as unanchored regex substrings, so a recall of v_CA21_1 also flagged v_CA21_10 etc., and remove_recalled_cached_data() could delete never-recalled data. IDs are now escaped and anchored on the JSON quotes of the vectors field. - get_recalled_database() error handler assigned the problem flag with <- instead of <<-, so download errors fell through to read_csv on a nonexistent file. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ions - dataset_from_vector_list() used a scalar ifelse(), which returns only the first element; the mixed-dataset guard could never fire and hierarchy helpers silently dropped vectors from other datasets. - get_census() only checked for recalled data in an unreachable branch (readr is in Imports, so the non-readr fallback never runs) and on the fresh-download path where recall is impossible. The check now runs when reading tabular data from the local cache, mirroring the geo path. - add_unique_names_to_region_list() restored groups with length(gs)>1 (dropping single-variable grouping) and passed a list of symbols to all_of() (erroring on multi-variable grouping). Uses group_vars() now. - get_statcan_geographies() ignored its documented cache_path argument; the local argument shadowed the cache_path() helper, so user-supplied paths silently fell through to the default cache. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…oken - find_census_vectors(query_type='exact') passed the query to grep() as an unescaped regex: queries containing metacharacters like "income (" hard errored and "income ($)" silently never matched. Queries are now escaped and matched literally (new regex_escape() helper, also reused for the recalled-data pattern). - semantic_search() used seq_along(ncol(df)), which is always 1, so only the full-phrase distance column was ever consulted while the match gate used all columns; per-word matches passed the gate but grepped the wrong n-gram. Now seq_len(ncol(df)). - semantic_search() n-gram generation (PR #216) dropped suffix n-grams for details shorter than the query, flipping former matches into "No close matches found". The general loop already handles that case; the early return is removed, restoring pre-0.6.0 n-grams. - keyword_search() built its alternation regex from strsplit tokens without dropping empty strings, so queries starting with a digit or punctuation (e.g. "2015 income") produced an empty alternative that matched every vector in the dataset. - search_census_vectors()/search_census_regions() tested length() of a one-column tibble (always 1), making the "No results found" stop() unreachable and printing empty hint lists. Now nrow(). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ation - list_census_datasets()/list_census_regions(): operator precedence in `!quiet && is.null(x) || stale` left the staleness warning outside the quiet guard (and crashed on a NULL last_updated attribute with quiet=TRUE). Parenthesized so quiet=TRUE suppresses the warning. - visualize_vector_hierarchy(): character input of length > 1 was compared with == (element-wise recycling against the full vector column), erroring instead of taking the documented use-first-vector path. Now %in%. - visualize_vector_hierarchy(): nodes truncated by max_depth were labeled " (leaf)" even when they have children in the dataset. Leaf status is now checked against the full vector list; truncated nodes print " ..." instead. - retry_api_call(): HTTP 429 (rate limit) and 408 (request timeout) are now retried like 5xx, honoring a Retry-After header when present (capped at 60s). Previously the most common transient failures aborted immediately. - cache_path(): the cache directory is now created recursively and a clear error is raised when creation fails, instead of silently continuing and failing later with a cryptic write error. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Follow-up to the seq_len(ncol()) fix: which.min per column yields one best n-gram per query component, but grep() only uses the first pattern element. Combine the unique best n-grams into a single escaped alternation so vectors matching any query component's closest n-gram are returned. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

list_census_vectors()/list_census_regions() cached results as compressed .rda files in tempdir(), paying a disk read + decompress + unserialize of a multi-thousand-row tibble on every call - and hierarchy/search/viz helpers call them two to three times per invocation. A package-level environment now fronts the file cache: hits return in microseconds instead of tens of milliseconds. The tempdir() file cache is retained unchanged as the fallback, and use_cache=FALSE refreshes both layers. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

- parent_census_vectors()/child_census_vectors(): replaced the per-level bind_rows() + distinct() + full-set re-filter loop with a BFS over plain character vectors (hash-based %in%/match), subsetting the full vector list once at the end. Output rows and ordering are identical (verified against the previous implementation across roots, max_level, leaves_only and keep_parent combinations on randomized hierarchies). - semantic_search(): Levenshtein distance is bounded below by string length difference, so n-grams more than 2 characters longer/shorter than a query term can never pass the distance-2 threshold; they are pruned before adist(), which dominates the function's runtime, and treated as infinitely distant. Best n-grams are only collected from query components that actually have a within-threshold match. Empty vector lists now fail with a clear error. - keyword_search(): the cleaned details corpus is built as a character vector via vapply() instead of a list, avoiding coercion in grep() and regmatches() on every search. - get_census(): removed the unreachable installed.packages()-based install prompts (sf presence is already enforced earlier; installed.packages() scans the whole library) and the dead agr column computation that scanned and copied the assembled result frame before being discarded. - get_statcan_wds_metadata(): Parent ID extraction uses vectorized xml_find_first() instead of an R-level lapply over one nodeset per code (~57k codes at DA level); removed duplicate unused description queries. - list_cancensus_cache(): file sizes are stat'ed in one vectorized file.info() call instead of per-file tibbles. - dataset_from_vector_list(): single vectorized sub() instead of per-element strsplit. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

dshkol · 2026-06-11T15:34:52Z

Bug fix details (by commit)

`7fcc886` — Critical data-loss bugs

#	Location	Bug	Trigger
1	`R/user_settings.R` (both setters)	`oldenv[-grep(key, oldenv)]` evaluates to `oldenv[integer(0)]` (empty) when no line matches, truncating `.Renviron` to just the new key	`set_cancensus_api_key("...", overwrite=TRUE, install=TRUE)` on an `.Renviron` with other env vars but no `CM_API_KEY` — everything else (GitHub PATs, other keys) silently destroyed
2	`R/wds.R` `get_statcan_wds_data()`	`httr::write_disk()` writes the body even on non-200; the file wasn't deleted before `stop()`, and the cache-existence check ignored `refresh`	Any transient StatCan 5xx: every subsequent call in the session, including `refresh=TRUE`, fed the cached XML/HTML error payload into `read_csv()` as census data
3	`R/recalled_data.R` `list_recalled_cached_data()`	Recalled vector IDs pasted into a regex unescaped and unanchored	A recall of `v_CA21_1` also flagged cached extracts for `v_CA21_10`, `v_CA21_123`, … and `remove_recalled_cached_data()` would delete them. Now escaped and anchored on the JSON quotes of the `vectors` metadata field
4	`R/recalled_data.R` `get_recalled_database()`	error handler set the `problem` flag with `<-` (local) instead of `<<-`	A download error (vs. warning) fell through to `read_csv()` on a nonexistent file

`27c7bcf` — Silently wrong results

#	Location	Bug	Trigger
5	`R/helpers.R` `dataset_from_vector_list()`	scalar `ifelse()` returns only the first element, so the mixed-dataset guard could never fire	`parent_census_vectors(c("v_CA16_2519", "v_CA21_906"))` silently inferred CA16 and dropped the CA21 vector instead of stopping with "Unable to determine dataset"
6	`R/cancensus.R` `get_census()`	the tabular recalled-data check sat in the fresh-download path inside a `!requireNamespace("readr")` branch — unreachable since readr moved to Imports	Cached tabular data that was later recalled produced no warning; the geometry path warned correctly. Check moved to the cache-read branch, mirroring geo
7	`R/census_regions.R` `add_unique_names_to_region_list()`	`length(gs)>1` off-by-one + `all_of()` on a list of symbols	One grouping var → silently ungrouped output; two+ → error. Now `group_vars()` + `length>0`
8	`R/geographies.R` `get_statcan_geographies()`	documented `cache_path` argument never used (local arg shadowed by the `cache_path()` helper at call resolution)	User-supplied cache paths silently ignored; data always went to the default cache

`96802e9` + `eed5038` — Search correctness

#	Location	Bug	Trigger
9	`R/vector_discovery.R` exact search	query used as unescaped regex	`find_census_vectors("income (", ...)` → hard error; `"income ($)"` → silent "No exact matches found" despite literal matches existing
10	`R/vector_discovery.R` semantic search	`seq_along(ncol(df))` ≡ `1`: only the full-phrase column was consulted while the match gate used all columns	Queries whose only close match was via an individual word grepped the wrong n-gram. Follow-up (`eed5038`): the per-column best n-grams are combined into one escaped alternation
11	`R/vector_discovery.R` n-gram generation	0.6.0 perf pass dropped suffix n-grams for details shorter than the query	4-word query vs. a 2-word detail: the suffix n-gram that previously matched at distance 0 no longer existed → false "No close matches found"
12	`R/vector_discovery.R` keyword search	empty strsplit tokens produced an empty regex alternative	`find_census_vectors("2015 income", ..., query_type="keyword")` matched every vector in the dataset
13	`R/vector_discovery.R` / `R/census_regions.R`	`length()` of a 1-column tibble is always 1	"No results found" `stop()` unreachable; empty hint tibbles printed with a misleading warning. Now `nrow()`

`303028b` — Quiet handling, viz, retry, cache dir

#	Location	Bug	Trigger
14	`R/cancensus.R` / `R/census_regions.R`	`!quiet && is.null(x) \|\| stale` precedence: staleness check escaped the quiet guard	`list_census_regions("CA16", quiet=TRUE)` with a >1-day-old session cache still warned; `last_updated=NULL` + `quiet=TRUE` crashed with "argument is of length zero"
15	`R/vector_viz.R`	`==` instead of `%in%` for character input of length >1 (element-wise recycling against the full vector column)	`visualize_vector_hierarchy(c("v1","v2"), ...)` errored instead of taking the documented use-first-vector path
16	`R/vector_viz.R`	leaf check consulted only the `max_depth`-truncated tree	Nodes with real children printed `(leaf)` — exactly the label users rely on for additive aggregation. Truncated nodes now print `...`
17	`R/helpers.R` `retry_api_call()`	only 5xx retried	HTTP 429 (rate limit) and 408 aborted immediately even with `retry=3`. Both now retried with backoff, honoring `Retry-After` (capped 60s)
18	`R/helpers.R` `cache_path()`	`dir.create` non-recursive with warnings suppressed	`set_cancensus_cache_path("~/a/b/c")` where `~/a/b` doesn't exist failed silently, then `get_census()` died later with a cryptic write error. Now recursive + clear error

🤖 Generated with Claude Code

dshkol · 2026-06-11T15:35:27Z

Performance details (`86ae5b9`, `1309a19`)

Methodology

Benchmarked with microbenchmark on mock data (no API calls), R 4.5.0 / macOS, comparing a worktree at the v0.6.0 tag against this branch with the identical script — now committed as benchmarks/benchmark_v0.6.1.R. Hierarchy benchmarks use a 6-level, branching-factor-4 tree (5,461 nodes); search and cache benchmarks use a 7,700-row vector list sized like CA16.

Results (medians)

                        v0.6.0      v0.6.1     speedup
 child_full             4.59 ms     0.78 ms     ~5.9x
 child_leaves           4.86 ms     1.20 ms     ~4.1x
 parent_deep            3.13 ms     0.40 ms     ~7.8x
 semantic search         291 ms      245 ms     ~1.2x
 keyword search          131 ms      129 ms       ~1x
 list_vectors (cached)  4.77 ms    0.003 ms   ~1600x

What changed

In-memory session cache (86ae5b9). list_census_vectors()/list_census_regions() cached results as compressed .rda files in tempdir() and paid a disk read + decompress + unserialize on every call — and the hierarchy/search/viz helpers call them 2-3x per invocation (visualize_vector_hierarchy() alone loads the vector list three times). A package-level environment now fronts the file cache; the file layer is unchanged as the fallback and use_cache=FALSE refreshes both. This is the change that compounds: every search and hierarchy entry point starts with this call, so real-world find_census_vectors() / child_census_vectors() invocations gain this on top of their own speedups.

BFS hierarchy traversal (1309a19). The 0.6.0 loop still did bind_rows() + distinct() over the full accumulated set and re-filtered against all found vectors at every depth level — O(depth x total_found) with tibble copies per level. Now a BFS over plain character vectors (hash-based %in%/match) with a single tibble subset at the end. Equivalence verified: identical rows and ordering to the old implementation across randomized 2,000-node hierarchies x max_level in {NA, 1, 2, 3, 10} x leaves_only x keep_parent (164 combinations).

Semantic search adist pruning (1309a19). Levenshtein distance is bounded below by string-length difference, so n-grams more than 2 characters longer/shorter than a query term can never pass the distance-2 match threshold — they're pruned before adist() and treated as infinitely distant, which is exact with respect to both the match gate and the winner selection. Profiling after the change shows adist is no longer the bottleneck; remaining time is inherent n-gram generation (paste in the per-sentence loop), so the modest 1.2x here is the honest ceiling of this approach. End-to-end find_census_vectors() calls additionally gain the cache win above.

Misc (1309a19).

get_census(): removed an installed.packages() scan (R docs: "can be slow") from every spatial call — it guarded install-prompt branches that were unreachable because sf presence is already enforced earlier; also removed a dead agr computation that scanned and partially copied the full assembled result frame (50k+ rows at DA level) before being discarded.
get_statcan_wds_metadata(): parent-ID extraction switched from an R-level lapply over one nodeset per code (~57k codes for CL_GEO_DA) to vectorized xml_find_first(), which returns NA for missing nodes — identical semantics, C-speed. Two duplicate unused XPath queries deleted.
keyword_search(): cleaned corpus built as a character vector via vapply() instead of a list, removing as.character() coercion inside grep()/regmatches() on every search.
list_cancensus_cache(): one vectorized file.info() batch instead of per-file calls wrapped in one-row tibbles.

Verification on the final tree

devtools::test(): 43 pass / 0 fail / 0 warnings
R CMD check (0.6.1): 0 errors / 0 warnings / 1 NOTE ("unable to verify current time" — sandbox clock artifact, not package-related)

🤖 Generated with Claude Code

dshkol · 2026-06-11T15:42:59Z

@mountainMath in case you wanted to see what the latest gen model does :)

mountainMath · 2026-06-11T16:58:41Z

Nice, will take a look later today and try to get this off to CRAN asap.

Full 5-platform matrix on pull_request; single ubuntu-latest/release for push, schedule, and manual triggers. Adds concurrency group to cancel stale PR runs without stomping main-branch checks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

dshkol and others added 8 commits June 11, 2026 07:55

Bump version to 0.6.1, add NEWS entries and benchmark script

e951850

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

mountainMath and others added 11 commits June 11, 2026 15:14

add positai to ignore files

6bfc2b1

cran comments

fe19bbd

update readme for new version

94f8890

update github action to use v6

8414626

update github pages docs

c140ac7

update roxygen

bd2c8ec

update broken links to censusmapper

15a0d1e

regenerate github pages

cd976f7

update links to new censusmapper

36aa6e3

udpate gh pages

dc49fb3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.6.1: bug fixes and performance improvements#225

0.6.1: bug fixes and performance improvements#225
dshkol wants to merge 19 commits into
mainfrom
v0.6.1

dshkol commented Jun 11, 2026

Uh oh!

dshkol commented Jun 11, 2026

Uh oh!

dshkol commented Jun 11, 2026

Uh oh!

dshkol commented Jun 11, 2026

Uh oh!

mountainMath commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dshkol commented Jun 11, 2026

Summary

Bug fixes (highlights)

Performance improvements

Verification

Uh oh!

dshkol commented Jun 11, 2026

Bug fix details (by commit)

7fcc886 — Critical data-loss bugs

27c7bcf — Silently wrong results

96802e9 + eed5038 — Search correctness

303028b — Quiet handling, viz, retry, cache dir

Uh oh!

dshkol commented Jun 11, 2026

Performance details (86ae5b9, 1309a19)

Methodology

Results (medians)

What changed

Verification on the final tree

Uh oh!

dshkol commented Jun 11, 2026

Uh oh!

mountainMath commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`7fcc886` — Critical data-loss bugs

`27c7bcf` — Silently wrong results

`96802e9` + `eed5038` — Search correctness

`303028b` — Quiet handling, viz, retry, cache dir

Performance details (`86ae5b9`, `1309a19`)