Generalize SIMD take operations to Copy#8496
Conversation
An AVX2 gather just moves bytes, so dispatch the take kernel on size_of::<V>() rather than NativePType: 4-byte values gather through a u32 lane, 8-byte through u64, others fall back to scalar. This halves the gather impls (i32/u32/f32 share one, i64/u64/f64 the other) and lets any POD Copy type use the path. exec_take casts only raw pointers, keeping align-1 types like [u8; 4] sound; take_primitive_scalar is relaxed to T: Copy. Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
robert3005
left a comment
There was a problem hiding this comment.
You need to audit the callers to make sure that T passed into this function is NativeValue(T) and not T
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 0.993x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (0.993x ➖, 0↑ 1↓)
No file size changes detected. |
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.998x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.040x ➖, 0↑ 1↓)
datafusion / parquet (0.995x ➖, 1↑ 2↓)
datafusion / arrow (0.848x ✅, 15↑ 0↓)
duckdb / vortex-file-compressed (0.969x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.947x ➖, 4↑ 0↓)
duckdb / parquet (0.958x ➖, 2↑ 1↓)
duckdb / duckdb (0.984x ➖, 1↑ 0↓)
File Size Changes (9 files changed, +0.2% overall, 6↑ 3↓)
Totals:
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.913x ➖, 4↑ 0↓)
datafusion / vortex-compact (0.892x ✅, 7↑ 0↓)
datafusion / parquet (0.885x ✅, 5↑ 0↓)
duckdb / vortex-file-compressed (0.883x ✅, 5↑ 0↓)
duckdb / vortex-compact (0.908x ➖, 6↑ 0↓)
duckdb / parquet (0.873x ✅, 9↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Merging this PR will improve performance by 10.62%
|
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ❌ | Simulation | chunked_varbinview_canonical_into[(100, 100)] |
223.8 µs | 259.3 µs | -13.66% |
| ❌ | Simulation | chunked_varbinview_opt_canonical_into[(100, 100)] |
304.7 µs | 341 µs | -10.63% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.01)] |
1,020.9 µs | 842.5 µs | +21.18% |
| ⚡ | Simulation | decompress_rd[f64, (100000, 0.1)] |
1,020.9 µs | 842.5 µs | +21.18% |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.1)] |
583 µs | 495.1 µs | +17.74% |
| ⚡ | Simulation | decompress_rd[f32, (100000, 0.01)] |
583 µs | 495.1 µs | +17.74% |
| ⚡ | Simulation | take_filter_primitive_large_random_mask_random_indices[(2500, 25000)] |
331.5 µs | 283.1 µs | +17.12% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[128] |
215.3 ns | 186.1 ns | +15.67% |
| ⚡ | Simulation | take_filter_primitive_large_random_mask_random_indices[(12500, 25000)] |
400.7 µs | 352.2 µs | +13.78% |
| ⚡ | Simulation | extend_from_array_zctl[(1000, 64)] |
1.2 ms | 1 ms | +12.03% |
| ⚡ | Simulation | bitwise_not_vortex_buffer_mut[1024] |
275.6 ns | 246.4 ns | +11.84% |
Tip
Investigate this regression by commenting @codspeedbot fix this regression on this PR, or directly use the CodSpeed MCP with your agent.
Comparing ct/simd-copy (d5f63d5) with develop (ed69077)
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.972x ➖, 3↑ 0↓)
datafusion / vortex-compact (0.977x ➖, 2↑ 2↓)
datafusion / parquet (0.972x ➖, 5↑ 0↓)
duckdb / vortex-file-compressed (0.970x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.975x ➖, 4↑ 1↓)
duckdb / parquet (0.990x ➖, 0↑ 0↓)
duckdb / duckdb (0.991x ➖, 1↑ 2↓)
File Size Changes (7 files changed, +0.0% overall, 4↑ 3↓)
Totals:
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.011x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.007x ➖, 0↑ 0↓)
duckdb / parquet (0.989x ➖, 0↑ 0↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.014x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.129x ➖, 0↑ 2↓)
datafusion / parquet (0.906x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.918x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.919x ➖, 1↑ 0↓)
duckdb / parquet (0.940x ➖, 0↑ 0↓)
|
|
The AVX2 changes seem fine to me |
Benchmarks: Random AccessVortex (geomean): 0.838x ✅ How to read Verdict and Engines
unknown / unknown (0.880x ✅, 25↑ 0↓)
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.979x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.937x ➖, 4↑ 0↓)
datafusion / parquet (0.965x ➖, 2↑ 0↓)
datafusion / arrow (0.948x ➖, 3↑ 1↓)
duckdb / vortex-file-compressed (1.004x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.004x ➖, 0↑ 0↓)
duckdb / parquet (0.995x ➖, 1↑ 0↓)
duckdb / duckdb (1.003x ➖, 0↑ 0↓)
File Size Changes (27 files changed, +0.1% overall, 16↑ 11↓)
Totals:
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.988x ➖, 1↑ 0↓)
datafusion / parquet (0.997x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.004x ➖, 4↑ 7↓)
duckdb / parquet (1.003x ➖, 0↑ 0↓)
duckdb / duckdb (0.996x ➖, 0↑ 0↓)
File Size Changes (107 files changed, -0.0% overall, 49↑ 58↓)
Totals:
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.950x ➖, 2↑ 2↓)
datafusion / vortex-compact (0.973x ➖, 2↑ 3↓)
datafusion / parquet (0.918x ➖, 5↑ 4↓)
duckdb / vortex-file-compressed (0.899x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.952x ➖, 0↑ 1↓)
duckdb / parquet (0.968x ➖, 0↑ 0↓)
|
Benchmarks: CompressionVortex (geomean): 0.995x ➖ How to read Verdict and Engines
unknown / unknown (0.996x ➖, 0↑ 0↓)
|
Benchmarks: Appian on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.916x ➖, 3↑ 0↓)
datafusion / parquet (0.901x ➖, 4↑ 0↓)
duckdb / vortex-file-compressed (0.931x ➖, 0↑ 0↓)
duckdb / parquet (0.945x ➖, 0↑ 0↓)
duckdb / duckdb (0.945x ➖, 0↑ 0↓)
File Size Changes (4 files changed, -0.0% overall, 2↑ 2↓)
Totals:
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.889x ➖, 1↑ 2↓)
datafusion / vortex-compact (0.949x ➖, 0↑ 0↓)
datafusion / parquet (0.866x ➖, 2↑ 0↓)
duckdb / vortex-file-compressed (1.042x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.097x ➖, 0↑ 0↓)
duckdb / parquet (0.874x ➖, 0↑ 0↓)
|
2955142 to
c3087ba
Compare
|
not sure which regressions/speedups are real, there are some take speedups maybe because of less hit to the instruction cache? |
c3087ba to
d5f63d5
Compare
Summary
Closes: #5677
Changes the
NativePTypebound toCopy, which allows us to deduplicate a bunch of code. Imo this is also easier to read.Also a small
is_constantchange.Testing
Existing tests should suffice