bench: gate CodSpeed-unstable canonicalization microbenchmarks from simulation by connortsui20 · Pull Request #8519 · vortex-data/vortex

connortsui20 · 2026-06-20T07:07:35Z

Summary

A small, recurring set of CodSpeed Simulation microbenchmarks report false-positive regressions in nearly every PR, regardless of what the PR changes. This PR gates the provably-unfixable offenders out of CodSpeed simulation (keeping them runnable via local cargo bench), and documents the rest with a keep/remove decision for each.

Note

Draft for discussion. The analysis below justifies every benchmark I touched and explains the ones I deliberately left alone.

How I found them

I read CodSpeed's bot comment on 9 recent PRs spanning unrelated areas (scalar ops, SIMD take, GPU/CUDA, session refactors, stats rules, random-access splitting, etc.): #8345, #8454, #8470, #8478, #8483, #8489, #8496, #8500, #8505, #8511.

A benchmark that shows a large change in a PR that doesn't touch its code is, by definition, noise. The tell is when the same unchanged code reports very different numbers depending only on which commit it was built against. Every one of these comparisons also carried CodSpeed's own ⚠️ "Different runtime environments detected" warning — and this PR itself reproduced it (see Self-validation below).

The flaky set (each row = identical code, different measured value)

Benchmark	File	Observed range (unchanged code)	PRs	Verdict
`decompress_rd[*]`	`encodings/alp/benches/alp_compress.rs`	`f64,(100k)` 842 ↔ 1025 µs; `(10k)` 108 ↔ 139 µs	7/9	Gate
`chunked_varbinview_*` (×4)	`vortex-array/benches/chunk_array_builder.rs`	`(1000,10)` 161 ↔ 198; `(100,100)` 223 ↔ 308 µs	6/9	Gate
`chunked_bool_canonical_into`	`vortex-array/benches/chunk_array_builder.rs`	16 ↔ 35 µs (~2×)	4/9	Gate
`take_10k_`, `patched_take_10k_` (the 4 flaky ones)	`encodings/fastlanes/benches/bitpacking_take.rs`	bimodal 195 ↔ 255 µs	4/9 + this PR	Gate
`baseline_eq/lt[*]`	`encodings/fastlanes/benches/bitpack_compare.rs`	243 ↔ 185, 303 ↔ 245 µs	this PR	Keep (this PR)
`varbinview_large`	`vortex-array/benches/listview_rebuild.rs`	112 ↔ 131 µs	3/9	Keep (this PR)
`bitwise_not_vortex_buffer_mut[128]`	`vortex-buffer/benches/vortex_bitbuffer.rs`	186 ↔ 244 ns	2/9	Keep (this PR)
`sum_i32_nullable_all_valid`, `chunked_dict_`, `encode_varbin`, `null_count_run_end`	various	±34–95%	2/9, only vs base `679e2c5`	Keep (lower confidence)

Root cause

CodSpeed Simulation estimates CPU cycles from an instruction trace (Cachegrind-style). That trace is deterministic for a fixed binary in a fixed environment, but it includes instructions executed inside glibc — and memcpy/memmove/memset are resolved at runtime via ifunc to a SIMD variant chosen by the runner's CPU/glibc. When the develop baseline and the PR run on different runner images (which CodSpeed explicitly flags here as "different runtime environments"), any benchmark whose hot path is heap allocation + byte copying (or a tight SIMD loop sensitive to code layout) changes instruction count even though the Vortex code is byte-identical. See CodSpeed's own write-up, "Why glibc is faster on some GitHub Actions Runners".

This predicts exactly which benchmarks flake, and the data confirms it:

decompress_rd flakes; compress_rd never does. Decode materializes a fresh canonical output buffer (alloc + tight SIMD/memcpy); encode is compute-bound (sampling, dictionary build). Same file, same data — only the copy-bound one is noisy.
chunked_varbinview_* flakes — canonicalizing chunked strings is concatenating variable-length bytes into one buffer (memcpy-dominated).
chunked_bool_canonical_into flakes hardest in relative terms — it's also just too small (~16–35 µs).
take_10k_* is bimodal — 10k gather + canonicalize: output alloc/memcpy plus the SIMD bit-unpack's code-layout sensitivity.

The data movement is the thing being measured, so these cannot be made stable under Simulation by tweaking inputs. docs/developer-guide/benchmarking.md already prescribes the remedy: "Use #[cfg(not(codspeed))] for benchmarks that are incompatible with CodSpeed."

Changes (every gated benchmark justified)

All gated benches stay fully available via local cargo bench — only CodSpeed CI stops tracking them.

encodings/alp/benches/alp_compress.rs

decompress_rd → gated. The Fix build after move #1 offender: moved in 7/9 sampled PRs, bidirectionally, 842↔1025 µs for identical code. Decodes to a canonical array → output-buffer alloc + memcpy-dominated. compress_rd (encode, compute-bound, never flagged) is kept.

vortex-array/benches/chunk_array_builder.rs

chunked_varbinview_{canonical_into,into_canonical,opt_canonical_into,opt_into_canonical} → gated. All four flake across 6/9 PRs; all are memcpy-bound string canonicalization.
chunked_bool_canonical_into → gated. Worst relative swings (~2×, 16↔35 µs) and below the Simulation noise floor.
Kept: chunked_opt_bool_*, chunked_constant_* — compute-bound, never flagged.

encodings/fastlanes/benches/bitpacking_take.rs

take_10k_random, take_10k_contiguous, patched_take_10k_random, patched_take_10k_contiguous_patches → gated. Bimodal ~195↔255 µs (±23%); fired in 4/9 sampled PRs and again on this PR. The other take variants in the file have not shown this instability and are kept (they remain useful for the sparse/bitpacked-take threshold tuning the in-file comment describes).

Self-validation (this PR's own CodSpeed run)

The first push proved the thesis on itself: CodSpeed reported "will not alter performance", correctly marked the gated benches as skipped, and the only flagged moves were pre-existing flaky benches in files this PR never touched — take_10k_* (±20–23%), baseline_eq/lt (±23–31%), varbinview_large (±17%). The take_10k gating in this PR was added in response to that run; the others are documented below.

Deliberately kept (flaky ≠ delete)

bitpack_compare.rs baseline_eq/lt[*] and listview_rebuild.rs varbinview_large — same cross-environment root cause, surfaced on this PR. Left for a focused follow-up to keep this PR's scope contained; varbinview_large was only seen against a couple of bases.
bitwise_not_vortex_buffer_mut[128] — sub-µs noise, but its INPUT_SIZE is shared by ~25 valid vortex-vs-arrow comparison benches; not worth restructuring for a ±30 ns wobble.
CUDA walltime benches (e.g. cuda/bitpacked_u8/unpack/3bw[100M]) — these use the WallTime instrument on non-macro GPU runners (you can't simulate a GPU); a separate, intentional trade-off and out of scope.

Verification

Run locally with the pinned nightly + cargo-codspeed, valgrind present, for all three suites:

✅ cargo codspeed build (Simulation) succeeds for vortex-alp, vortex-array (--features _test-harness), and vortex-fastlanes — no warnings, no orphaned imports/dead code.
✅ cargo codspeed run (Measurement mode: Simulation) executes the suites to completion — gated benches absent at runtime; the kept benches (compress_rd, chunked_opt_bool_*, chunked_constant_*, the other take variants) still measure.
✅ Binary symbol check confirms the gated benches are compiled out under --cfg codspeed and present without it.
✅ Local cargo bench build path, cargo +nightly fmt --check, and cargo clippy --no-deps on all changed bench targets are clean. (Full-workspace clippy hits an unrelated pre-existing lint in vortex-buffer under this environment's toolchain; not touched here.)

Follow-ups for maintainers

Archive the gated benches on CodSpeed so they drop off the dashboard instead of showing as "skipped".
The broader class (canonicalization/builder/decode-to-canonical/copy-bound) shares this root cause. The durable fix is infrastructure-level: ensure the develop baseline and PR runs use the same pinned runner image (so "different runtime environments" stops firing), or move these to CodSpeed walltime/macro runners. If that lands, several "kept" benches above could return to Simulation.

A small set of microbenchmarks report false-positive regressions in nearly every PR. Their CodSpeed CPU-simulation instruction count is dominated by output-buffer allocation and glibc `memcpy`/`memmove` (whose `ifunc`-selected implementation varies across runner images) rather than by Vortex compute, so they move bidirectionally by 10-90% for unchanged code and CodSpeed flags "different runtime environments" on the comparisons. They cannot be stabilized under simulation, so per `docs/developer-guide/benchmarking.md` they are gated with `#[cfg(not(codspeed))]` and remain available via local `cargo bench`. Gated from CodSpeed (kept for local runs): - alp_compress.rs: `decompress_rd` (decode-to-canonical; moved in 7/9 sampled PRs, 842-1025 us for identical code). `compress_rd` (encode, compute-bound, never flaky) is kept. - chunk_array_builder.rs: `chunked_varbinview_*` (string canonicalization, memcpy-bound; flaky in 6/9 PRs) and `chunked_bool_canonical_into` (also below the ~16-35 us noise floor, ~2x swings). The compute-bound `chunked_opt_bool_*` and `chunked_constant_*` benches are kept. Verified: both suites build and run under `cargo codspeed` (Simulation mode), the gated benches are excluded while the kept benches still execute, and the local `cargo bench` path, `cargo fmt`, and `cargo clippy` are clean. Signed-off-by: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GXdjWYp7AbSKwn2bw6GYsf

`take_10k_random`, `take_10k_contiguous`, `patched_take_10k_random`, and `patched_take_10k_contiguous_patches` gather 10k elements and canonicalize the result, so their CodSpeed-simulation instruction count is bimodal (~195 us vs ~255 us, +-23% for unchanged code) from output-buffer allocation + `memcpy` and the SIMD bit-unpack's code-layout sensitivity across runner images. They fired in 4+ recent unrelated PRs and on this PR itself, so they are gated with `#[cfg(not(codspeed))]` and remain available via local `cargo bench`. The other take variants in this file have not exhibited the instability and are kept. Verified: `cargo codspeed build` is clean, the four benches are excluded from the built suite while the other take variants remain, and the local `cargo bench` build, `cargo fmt`, and `cargo clippy` pass. Signed-off-by: Claude <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01GXdjWYp7AbSKwn2bw6GYsf

Make the `#[cfg(not(codspeed))]` exclusions self-documenting and durable: - Every gate now carries a comment stating exactly why the benchmark (or its helper/import) is excluded, focused on the root cause (output-buffer allocation + glibc `memcpy`/`memmove` ifunc variance and SIMD code-layout sensitivity across runner images) rather than per-PR flake counts, which go stale as the codebase moves. Each links to the analysis in PR #8519. - Drop time-relative phrasing (e.g. "last N PRs", "fired in M PRs") so the comments stay accurate over time. - Move `Canonical`/`IntoArray` in alp_compress.rs into `decompress_rd`, the only gated consumer, removing two per-import cfg attributes. In chunk_array_builder.rs the gated imports serve several gated benches, so they stay at module scope with a clear note; the two gated helper fns now explain their gating too. Verified: `cargo build`, `RUSTFLAGS="--cfg codspeed" cargo build`, and `cargo +nightly fmt --check` are clean for vortex-alp, vortex-array, and vortex-fastlanes bench targets; `cargo clippy -p vortex-alp --benches` is clean. Signed-off-by: Claude <connor@spiraldb.com> Claude-Session: https://claude.ai/code/session_01HGegegFqsRfGJuA8X9aHT7

codspeed-hq · 2026-06-20T17:53:56Z

Merging this PR will improve performance by 13.57%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 3 improved benchmarks
✅ 1547 untouched benchmarks
⏩ 31 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	Simulation	`varbinview_large`	130.8 µs	112.1 µs	+16.65%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[128]`	244.4 ns	215.3 ns	+13.55%
⚡	Simulation	`bitwise_not_vortex_buffer_mut[1024]`	304.7 ns	275.6 ns	+10.58%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.

_{Comparing claude/sharp-planck-i4ifv8 (17ae8e0) with develop (5df0f94)}

31 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

connortsui20 added the changelog/ci label Jun 20, 2026 — with Claude

claude and others added 2 commits June 20, 2026 14:32

vortex-data deleted a comment from codspeed-hq Bot Jun 20, 2026

connortsui20 marked this pull request as ready for review June 20, 2026 17:52

connortsui20 requested a review from a team June 20, 2026 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: gate CodSpeed-unstable canonicalization microbenchmarks from simulation#8519

bench: gate CodSpeed-unstable canonicalization microbenchmarks from simulation#8519
connortsui20 wants to merge 3 commits into
developfrom
claude/sharp-planck-i4ifv8

connortsui20 commented Jun 20, 2026 •

edited

Loading

Uh oh!

codspeed-hq Bot commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

connortsui20 commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How I found them

The flaky set (each row = identical code, different measured value)

Root cause

Changes (every gated benchmark justified)

Self-validation (this PR's own CodSpeed run)

Deliberately kept (flaky ≠ delete)

Verification

Follow-ups for maintainers

Uh oh!

codspeed-hq Bot commented Jun 20, 2026

Merging this PR will improve performance by 13.57%

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

connortsui20 commented Jun 20, 2026 •

edited

Loading