Skip to content

0.6.1: bug fixes and performance improvements#225

Open
dshkol wants to merge 19 commits into
mainfrom
v0.6.1
Open

0.6.1: bug fixes and performance improvements#225
dshkol wants to merge 19 commits into
mainfrom
v0.6.1

Conversation

@dshkol

@dshkol dshkol commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

Patch release following up on 0.6.0. A systematic review of the package surfaced 18 bugs (three of them data-loss class) and several optimization opportunities missed by the 0.6.0 performance pass. This PR fixes all of them across 8 commits — bug fixes first, then performance work, with each commit message carrying the details. Per-commit write-ups and benchmark results follow in the comment thread.

Bug fixes (highlights)

Data loss / corruption:

  • set_cancensus_api_key() / set_cancensus_cache_path() with overwrite=TRUE, install=TRUE truncated the user's entire .Renviron when it didn't already contain the key (x[-integer(0)] selects nothing, not everything).
  • get_statcan_wds_data() cached HTTP error bodies as data: write_disk() persists the body on non-200 and the file was never removed, so every later call — including refresh=TRUE, which was ignored — parsed the error payload as census CSV.
  • remove_recalled_cached_data() matched recalled vector IDs as unanchored regex substrings, so a recall of v_CA21_1 also flagged (and could delete) cached data for v_CA21_10, v_CA21_123, …

Silently wrong results:

  • parent_census_vectors() / child_census_vectors() with mixed-dataset input silently dropped vectors instead of erroring (scalar ifelse() in dataset_from_vector_list()).
  • get_census() never warned about recalled data when reading tabular data from cache (the check sat in an unreachable branch); the geometry path warned correctly.
  • Semantic search only ever consulted the full-phrase distance column (seq_along(ncol(df)) is always 1) and 0.6.0 dropped suffix n-grams for short details; keyword search matched every vector for queries starting with a digit; exact search treated queries as unescaped regex ("income (" errored, "income ($)" never matched).
  • add_unique_names_to_region_list() dropped single-variable grouping and errored on multi-variable grouping.
  • get_statcan_geographies() ignored its documented cache_path argument.

Other: staleness warnings now respect quiet=TRUE (and no longer crash on a missing last_updated attribute); visualize_vector_hierarchy() handles multi-vector character input and no longer labels max_depth-truncated nodes as (leaf); retry logic now covers HTTP 429/408 and honors Retry-After; cache directories are created recursively with a clear error on failure.

Performance improvements

benchmark v0.6.0 v0.6.1 speedup
child_census_vectors() (6-level tree, ~5.5k nodes) 4.59 ms 0.78 ms ~5.9x
child_census_vectors(leaves_only=TRUE) 4.86 ms 1.20 ms ~4.1x
parent_census_vectors() (deep leaf) 3.13 ms 0.40 ms ~7.8x
semantic search (7.7k vectors) 291 ms 245 ms ~1.2x
list_census_vectors() cache hit 4.77 ms 0.003 ms ~1600x

Medians via microbenchmark, mock data, R 4.5.0/macOS; script and methodology in benchmarks/benchmark_v0.6.1.R. Key changes: an in-memory session cache in front of the tempdir .rda metadata cache (hierarchy/search/viz helpers were re-deserializing it 2-3x per call), a hash-based BFS rewrite of the hierarchy traversal, length-bound pruning before adist() in semantic search, removal of an installed.packages() scan from every spatial get_census() call, and vectorized xml_find_first() parent-ID extraction in WDS metadata parsing (~57k codes at DA level).

Verification

  • devtools::test(): 43 pass, 0 fail, 0 warnings
  • R CMD check: 0 errors, 0 warnings, 1 NOTE (environment clock artifact: "unable to verify current time")
  • BFS hierarchy rewrite verified to produce identical rows and ordering to the previous implementation across randomized 2,000-node hierarchies x max_level in {NA,1,2,3,10} x leaves_only x keep_parent

🤖 Generated with Claude Code

dshkol and others added 8 commits June 11, 2026 07:55
… recall over-matching

- set_cancensus_api_key()/set_cancensus_cache_path() with overwrite=TRUE,
  install=TRUE truncated the user's entire .Renviron when it did not yet
  contain the key: x[-integer(0)] selects nothing rather than everything.
- get_statcan_wds_data() cached error response bodies as data: write_disk()
  persists the body on non-200 responses and the file was never removed, so
  every subsequent call (including refresh=TRUE, which was ignored for the
  data cache) parsed the cached error payload as census CSV.
- list_recalled_cached_data() matched recalled vector IDs as unanchored
  regex substrings, so a recall of v_CA21_1 also flagged v_CA21_10 etc.,
  and remove_recalled_cached_data() could delete never-recalled data. IDs
  are now escaped and anchored on the JSON quotes of the vectors field.
- get_recalled_database() error handler assigned the problem flag with <-
  instead of <<-, so download errors fell through to read_csv on a
  nonexistent file.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ions

- dataset_from_vector_list() used a scalar ifelse(), which returns only the
  first element; the mixed-dataset guard could never fire and hierarchy
  helpers silently dropped vectors from other datasets.
- get_census() only checked for recalled data in an unreachable branch
  (readr is in Imports, so the non-readr fallback never runs) and on the
  fresh-download path where recall is impossible. The check now runs when
  reading tabular data from the local cache, mirroring the geo path.
- add_unique_names_to_region_list() restored groups with length(gs)>1
  (dropping single-variable grouping) and passed a list of symbols to
  all_of() (erroring on multi-variable grouping). Uses group_vars() now.
- get_statcan_geographies() ignored its documented cache_path argument;
  the local argument shadowed the cache_path() helper, so user-supplied
  paths silently fell through to the default cache.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…oken

- find_census_vectors(query_type='exact') passed the query to grep() as an
  unescaped regex: queries containing metacharacters like "income (" hard
  errored and "income ($)" silently never matched. Queries are now escaped
  and matched literally (new regex_escape() helper, also reused for the
  recalled-data pattern).
- semantic_search() used seq_along(ncol(df)), which is always 1, so only
  the full-phrase distance column was ever consulted while the match gate
  used all columns; per-word matches passed the gate but grepped the wrong
  n-gram. Now seq_len(ncol(df)).
- semantic_search() n-gram generation (PR #216) dropped suffix n-grams for
  details shorter than the query, flipping former matches into "No close
  matches found". The general loop already handles that case; the early
  return is removed, restoring pre-0.6.0 n-grams.
- keyword_search() built its alternation regex from strsplit tokens without
  dropping empty strings, so queries starting with a digit or punctuation
  (e.g. "2015 income") produced an empty alternative that matched every
  vector in the dataset.
- search_census_vectors()/search_census_regions() tested length() of a
  one-column tibble (always 1), making the "No results found" stop()
  unreachable and printing empty hint lists. Now nrow().

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ation

- list_census_datasets()/list_census_regions(): operator precedence in
  `!quiet && is.null(x) || stale` left the staleness warning outside the
  quiet guard (and crashed on a NULL last_updated attribute with
  quiet=TRUE). Parenthesized so quiet=TRUE suppresses the warning.
- visualize_vector_hierarchy(): character input of length > 1 was compared
  with == (element-wise recycling against the full vector column), erroring
  instead of taking the documented use-first-vector path. Now %in%.
- visualize_vector_hierarchy(): nodes truncated by max_depth were labeled
  " (leaf)" even when they have children in the dataset. Leaf status is now
  checked against the full vector list; truncated nodes print " ..." instead.
- retry_api_call(): HTTP 429 (rate limit) and 408 (request timeout) are now
  retried like 5xx, honoring a Retry-After header when present (capped at
  60s). Previously the most common transient failures aborted immediately.
- cache_path(): the cache directory is now created recursively and a clear
  error is raised when creation fails, instead of silently continuing and
  failing later with a cryptic write error.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Follow-up to the seq_len(ncol()) fix: which.min per column yields one best
n-gram per query component, but grep() only uses the first pattern element.
Combine the unique best n-grams into a single escaped alternation so vectors
matching any query component's closest n-gram are returned.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
list_census_vectors()/list_census_regions() cached results as compressed
.rda files in tempdir(), paying a disk read + decompress + unserialize of a
multi-thousand-row tibble on every call - and hierarchy/search/viz helpers
call them two to three times per invocation. A package-level environment now
fronts the file cache: hits return in microseconds instead of tens of
milliseconds. The tempdir() file cache is retained unchanged as the fallback,
and use_cache=FALSE refreshes both layers.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- parent_census_vectors()/child_census_vectors(): replaced the per-level
  bind_rows() + distinct() + full-set re-filter loop with a BFS over plain
  character vectors (hash-based %in%/match), subsetting the full vector
  list once at the end. Output rows and ordering are identical (verified
  against the previous implementation across roots, max_level, leaves_only
  and keep_parent combinations on randomized hierarchies).
- semantic_search(): Levenshtein distance is bounded below by string length
  difference, so n-grams more than 2 characters longer/shorter than a query
  term can never pass the distance-2 threshold; they are pruned before
  adist(), which dominates the function's runtime, and treated as
  infinitely distant. Best n-grams are only collected from query components
  that actually have a within-threshold match. Empty vector lists now fail
  with a clear error.
- keyword_search(): the cleaned details corpus is built as a character
  vector via vapply() instead of a list, avoiding coercion in grep() and
  regmatches() on every search.
- get_census(): removed the unreachable installed.packages()-based install
  prompts (sf presence is already enforced earlier; installed.packages()
  scans the whole library) and the dead agr column computation that scanned
  and copied the assembled result frame before being discarded.
- get_statcan_wds_metadata(): Parent ID extraction uses vectorized
  xml_find_first() instead of an R-level lapply over one nodeset per code
  (~57k codes at DA level); removed duplicate unused description queries.
- list_cancensus_cache(): file sizes are stat'ed in one vectorized
  file.info() call instead of per-file tibbles.
- dataset_from_vector_list(): single vectorized sub() instead of per-element
  strsplit.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@dshkol

dshkol commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Bug fix details (by commit)

7fcc886 — Critical data-loss bugs

# Location Bug Trigger
1 R/user_settings.R (both setters) oldenv[-grep(key, oldenv)] evaluates to oldenv[integer(0)] (empty) when no line matches, truncating .Renviron to just the new key set_cancensus_api_key("...", overwrite=TRUE, install=TRUE) on an .Renviron with other env vars but no CM_API_KEY — everything else (GitHub PATs, other keys) silently destroyed
2 R/wds.R get_statcan_wds_data() httr::write_disk() writes the body even on non-200; the file wasn't deleted before stop(), and the cache-existence check ignored refresh Any transient StatCan 5xx: every subsequent call in the session, including refresh=TRUE, fed the cached XML/HTML error payload into read_csv() as census data
3 R/recalled_data.R list_recalled_cached_data() Recalled vector IDs pasted into a regex unescaped and unanchored A recall of v_CA21_1 also flagged cached extracts for v_CA21_10, v_CA21_123, … and remove_recalled_cached_data() would delete them. Now escaped and anchored on the JSON quotes of the vectors metadata field
4 R/recalled_data.R get_recalled_database() error handler set the problem flag with <- (local) instead of <<- A download error (vs. warning) fell through to read_csv() on a nonexistent file

27c7bcf — Silently wrong results

# Location Bug Trigger
5 R/helpers.R dataset_from_vector_list() scalar ifelse() returns only the first element, so the mixed-dataset guard could never fire parent_census_vectors(c("v_CA16_2519", "v_CA21_906")) silently inferred CA16 and dropped the CA21 vector instead of stopping with "Unable to determine dataset"
6 R/cancensus.R get_census() the tabular recalled-data check sat in the fresh-download path inside a !requireNamespace("readr") branch — unreachable since readr moved to Imports Cached tabular data that was later recalled produced no warning; the geometry path warned correctly. Check moved to the cache-read branch, mirroring geo
7 R/census_regions.R add_unique_names_to_region_list() length(gs)>1 off-by-one + all_of() on a list of symbols One grouping var → silently ungrouped output; two+ → error. Now group_vars() + length>0
8 R/geographies.R get_statcan_geographies() documented cache_path argument never used (local arg shadowed by the cache_path() helper at call resolution) User-supplied cache paths silently ignored; data always went to the default cache

96802e9 + eed5038 — Search correctness

# Location Bug Trigger
9 R/vector_discovery.R exact search query used as unescaped regex find_census_vectors("income (", ...) → hard error; "income ($)" → silent "No exact matches found" despite literal matches existing
10 R/vector_discovery.R semantic search seq_along(ncol(df))1: only the full-phrase column was consulted while the match gate used all columns Queries whose only close match was via an individual word grepped the wrong n-gram. Follow-up (eed5038): the per-column best n-grams are combined into one escaped alternation
11 R/vector_discovery.R n-gram generation 0.6.0 perf pass dropped suffix n-grams for details shorter than the query 4-word query vs. a 2-word detail: the suffix n-gram that previously matched at distance 0 no longer existed → false "No close matches found"
12 R/vector_discovery.R keyword search empty strsplit tokens produced an empty regex alternative find_census_vectors("2015 income", ..., query_type="keyword") matched every vector in the dataset
13 R/vector_discovery.R / R/census_regions.R length() of a 1-column tibble is always 1 "No results found" stop() unreachable; empty hint tibbles printed with a misleading warning. Now nrow()

303028b — Quiet handling, viz, retry, cache dir

# Location Bug Trigger
14 R/cancensus.R / R/census_regions.R !quiet && is.null(x) || stale precedence: staleness check escaped the quiet guard list_census_regions("CA16", quiet=TRUE) with a >1-day-old session cache still warned; last_updated=NULL + quiet=TRUE crashed with "argument is of length zero"
15 R/vector_viz.R == instead of %in% for character input of length >1 (element-wise recycling against the full vector column) visualize_vector_hierarchy(c("v1","v2"), ...) errored instead of taking the documented use-first-vector path
16 R/vector_viz.R leaf check consulted only the max_depth-truncated tree Nodes with real children printed (leaf) — exactly the label users rely on for additive aggregation. Truncated nodes now print ...
17 R/helpers.R retry_api_call() only 5xx retried HTTP 429 (rate limit) and 408 aborted immediately even with retry=3. Both now retried with backoff, honoring Retry-After (capped 60s)
18 R/helpers.R cache_path() dir.create non-recursive with warnings suppressed set_cancensus_cache_path("~/a/b/c") where ~/a/b doesn't exist failed silently, then get_census() died later with a cryptic write error. Now recursive + clear error

🤖 Generated with Claude Code

@dshkol

dshkol commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

Performance details (86ae5b9, 1309a19)

Methodology

Benchmarked with microbenchmark on mock data (no API calls), R 4.5.0 / macOS, comparing a worktree at the v0.6.0 tag against this branch with the identical script — now committed as benchmarks/benchmark_v0.6.1.R. Hierarchy benchmarks use a 6-level, branching-factor-4 tree (5,461 nodes); search and cache benchmarks use a 7,700-row vector list sized like CA16.

Results (medians)

                        v0.6.0      v0.6.1     speedup
 child_full             4.59 ms     0.78 ms     ~5.9x
 child_leaves           4.86 ms     1.20 ms     ~4.1x
 parent_deep            3.13 ms     0.40 ms     ~7.8x
 semantic search         291 ms      245 ms     ~1.2x
 keyword search          131 ms      129 ms       ~1x
 list_vectors (cached)  4.77 ms    0.003 ms   ~1600x

What changed

In-memory session cache (86ae5b9). list_census_vectors()/list_census_regions() cached results as compressed .rda files in tempdir() and paid a disk read + decompress + unserialize on every call — and the hierarchy/search/viz helpers call them 2-3x per invocation (visualize_vector_hierarchy() alone loads the vector list three times). A package-level environment now fronts the file cache; the file layer is unchanged as the fallback and use_cache=FALSE refreshes both. This is the change that compounds: every search and hierarchy entry point starts with this call, so real-world find_census_vectors() / child_census_vectors() invocations gain this on top of their own speedups.

BFS hierarchy traversal (1309a19). The 0.6.0 loop still did bind_rows() + distinct() over the full accumulated set and re-filtered against all found vectors at every depth level — O(depth x total_found) with tibble copies per level. Now a BFS over plain character vectors (hash-based %in%/match) with a single tibble subset at the end. Equivalence verified: identical rows and ordering to the old implementation across randomized 2,000-node hierarchies x max_level in {NA, 1, 2, 3, 10} x leaves_only x keep_parent (164 combinations).

Semantic search adist pruning (1309a19). Levenshtein distance is bounded below by string-length difference, so n-grams more than 2 characters longer/shorter than a query term can never pass the distance-2 match threshold — they're pruned before adist() and treated as infinitely distant, which is exact with respect to both the match gate and the winner selection. Profiling after the change shows adist is no longer the bottleneck; remaining time is inherent n-gram generation (paste in the per-sentence loop), so the modest 1.2x here is the honest ceiling of this approach. End-to-end find_census_vectors() calls additionally gain the cache win above.

Misc (1309a19).

  • get_census(): removed an installed.packages() scan (R docs: "can be slow") from every spatial call — it guarded install-prompt branches that were unreachable because sf presence is already enforced earlier; also removed a dead agr computation that scanned and partially copied the full assembled result frame (50k+ rows at DA level) before being discarded.
  • get_statcan_wds_metadata(): parent-ID extraction switched from an R-level lapply over one nodeset per code (~57k codes for CL_GEO_DA) to vectorized xml_find_first(), which returns NA for missing nodes — identical semantics, C-speed. Two duplicate unused XPath queries deleted.
  • keyword_search(): cleaned corpus built as a character vector via vapply() instead of a list, removing as.character() coercion inside grep()/regmatches() on every search.
  • list_cancensus_cache(): one vectorized file.info() batch instead of per-file calls wrapped in one-row tibbles.

Verification on the final tree

  • devtools::test(): 43 pass / 0 fail / 0 warnings
  • R CMD check (0.6.1): 0 errors / 0 warnings / 1 NOTE ("unable to verify current time" — sandbox clock artifact, not package-related)

🤖 Generated with Claude Code

@dshkol

dshkol commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator Author

@mountainMath in case you wanted to see what the latest gen model does :)

@mountainMath

Copy link
Copy Markdown
Owner

Nice, will take a look later today and try to get this off to CRAN asap.

mountainMath and others added 11 commits June 11, 2026 15:14
Full 5-platform matrix on pull_request; single ubuntu-latest/release for
push, schedule, and manual triggers. Adds concurrency group to cancel
stale PR runs without stomping main-branch checks.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants