Document why the Arrow exporter keeps Decimal128 as the default decimal width#8197
Document why the Arrow exporter keeps Decimal128 as the default decimal width#8197joseph-isaacs wants to merge 6 commits into
Conversation
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
|
this used to break some benchmarks in some annoying ways |
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.017x ➖ How to read Verdict and Engines
datafusion / vortex-file-compressed (1.017x ➖, 0↑ 2↓)
File Size Changes (1 files changed, -0.0% overall, 0↑ 1↓)
Totals:
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.006x ➖, 1↑ 3↓)
datafusion / vortex-compact (1.023x ➖, 1↑ 2↓)
datafusion / parquet (1.030x ➖, 0↑ 4↓)
datafusion / arrow (0.991x ➖, 5↑ 3↓)
duckdb / vortex-file-compressed (1.013x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.019x ➖, 0↑ 0↓)
duckdb / parquet (1.008x ➖, 1↑ 0↓)
duckdb / duckdb (1.014x ➖, 0↑ 0↓)
File Size Changes (18 files changed, +5.5% overall, 9↑ 9↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.908x ➖, 3↑ 1↓)
datafusion / vortex-compact (1.015x ➖, 0↑ 0↓)
datafusion / parquet (1.023x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.896x ✅, 3↑ 0↓)
duckdb / vortex-compact (1.016x ➖, 0↑ 1↓)
duckdb / parquet (1.005x ➖, 0↑ 0↓)
File Size Changes (2 files changed, +13.7% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.080x ➖, 1↑ 1↓)
datafusion / vortex-compact (1.042x ➖, 0↑ 0↓)
datafusion / parquet (1.077x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.157x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.017x ➖, 0↑ 0↓)
duckdb / parquet (1.052x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) How to read Verdict and Engines
duckdb / vortex-file-compressed (1.025x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.063x ➖, 0↑ 0↓)
duckdb / parquet (1.009x ➖, 0↑ 0↓)
File Size Changes (2 files changed, +0.2% overall, 1↑ 1↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.876x ✅, 17↑ 1↓)
datafusion / vortex-compact (0.908x ➖, 13↑ 1↓)
datafusion / parquet (0.971x ➖, 6↑ 2↓)
datafusion / arrow (0.927x ➖, 8↑ 5↓)
duckdb / vortex-file-compressed (0.964x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.973x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 0↑ 0↓)
duckdb / duckdb (1.001x ➖, 0↑ 0↓)
File Size Changes (48 files changed, +5.5% overall, 27↑ 21↓)
Totals:
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.807x ✅, 26↑ 1↓)
datafusion / parquet (0.782x ✅, 27↑ 0↓)
duckdb / vortex-file-compressed (0.931x ➖, 9↑ 1↓)
duckdb / parquet (0.986x ➖, 0↑ 0↓)
duckdb / duckdb (0.970x ➖, 4↑ 2↓)
File Size Changes (201 files changed, +16.5% overall, 100↑ 101↓)
Totals:
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.987x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.954x ➖, 1↑ 1↓)
datafusion / parquet (0.990x ➖, 3↑ 2↓)
duckdb / vortex-file-compressed (1.022x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.964x ➖, 0↑ 0↓)
duckdb / parquet (1.007x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Appian on NVMEVerdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (1.001x ➖, 2↑ 1↓)
datafusion / parquet (1.009x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
duckdb / parquet (0.993x ➖, 0↑ 0↓)
duckdb / duckdb (0.989x ➖, 0↑ 0↓)
File Size Changes (19 files changed, +1.8% overall, 3↑ 16↓)
Totals:
Full attributed analysis
|
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Benchmarks: CompressionVortex (geomean): 0.994x ➖ How to read Verdict and Engines
unknown / unknown (0.984x ➖, 2↑ 0↓)
|
Benchmarks: Random AccessVortex (geomean): 1.009x ➖ How to read Verdict and Engines
unknown / unknown (1.010x ➖, 4↑ 6↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) How to read Verdict and Engines
datafusion / vortex-file-compressed (0.954x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.897x ➖, 1↑ 1↓)
datafusion / parquet (0.964x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.899x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.904x ➖, 0↑ 0↓)
duckdb / parquet (0.894x ➖, 0↑ 0↓)
Full attributed analysis
|
|
@AdamGS do you remember where this broke previously? |
|
TPC-DS |
|
I remembered Sum and and Avg, but I'll open a DF issue. |
|
another fix for now will be to have schemas for tpc-ds/tpc-h, if we declare them wide enough we'll avoid the overflow. |
Merging this PR will not alter performance
|
Rebased onto the DataFusion 54 upgrade branch (adamg/df-54) to evaluate narrow decimal export against DataFusion 54. Maps decimals to the smallest Arrow width that fits the precision: Decimal32 for <=9, Decimal64 for <=18, Decimal128 for <=38, else Decimal256. The Arrow executor already handles all four widths. Note: DataFusion 54's decimal SUM/AVG accumulators are unchanged from 53 (branch-54 keeps SUM(Decimal32) -> Decimal32(min(9, p+10), s) in i32), so aggregating inferred narrow decimals is still expected to overflow (e.g. TPC-DS Q1). This branch is for evaluating that behavior on 54. Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
6174c77 to
6603fcf
Compare
|
I remember now that arrow-java doesn't support them yet so for jni we would need a way to force only i128/i256 |
|
needs to gate these too if < DF 0.54.0 |
|
DF issue: apache/datafusion#22713 |
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
Cherry-pick of #8197 (Joe Isaacs). Enables narrow Arrow decimal types in the shared Vortex-to-Arrow mapping now that DataFusion 51+ and arrow-rs 56+ support them. Signed-off-by: Alexander Droste <alex@spiraldb.com> Co-Authored-By: Joe Isaacs <joe.isaacs@live.co.uk> Signed-off-by: Alexander Droste <alexander.droste@protonmail.com>
|
This PR has been marked as stale because it has been open for 14 days with no activity. Please comment or remove the stale label if you wish to keep it active, otherwise it will be closed in 7 days |
|
This PR was closed because it has been inactive for 7 days since being marked as stale. |
Summary
Originally this PR enabled emitting narrower Arrow decimals (
Decimal32/Decimal64) by default fromto_data_type_naive. E2E verification againstdatafusion-benchwith real generated data showed this is not safe on the versions Vortex ships, so the behavior change has been reverted; the net diff is now an explanatory comment documenting why the narrower widths stay gated.What the verification found
Decimal32/Decimal64types (added in arrow-rs ~v56 and DataFusion 51.0.0), and Vortex's Arrow executor already produces all four widths on request. So the types exist.SUM/AVGwithin the same physical width family —Decimal32 → i32,Decimal64 → i64— instead of widening the accumulator (datafusion-functions-aggregatesum.rs/average.rs). ADecimal32SUM/AVGis capped at precision 9 /i32, so it overflows once the running total exceeds ~1e9, regardless of row count.avg(sum(sr_return_amt)),sr_return_amt=decimal(7,2), inferred schema →Decimal32) fails withArithmetic Overflow in AvgAccumulator. Tracked upstream in apache/datafusion#17489.Decimal128(15,2)schemas, so its plan never sees the narrow type.SparkToArrowSchemahard-codesbitWidth = 128; the JNITestMinimalwritesDecimal128(9,2)and asserts an exact round-trip), so any narrowing breaks the round-trip — confirmed by theTestMinimal.testFullScanfailure.There is no narrowing tier that's safe across all current consumers:
Decimal32breaks DataFusion aggregation + JNI;Decimal64would break the Sparkdecimal(10,2)round-trip too. The original guard comment was correct.Change
Revert to the
Decimal128default and replace the terse// commented out until DataFusion improves...note with a comment recording the concrete reasons (accumulator overflow + 128-bit binding assumption, with the upstream issue link). The Arrow executor still emitsDecimal32/Decimal64when a consumer explicitly requests that target type — only the inferred default staysDecimal128.Follow-up option
If we want the storage/bandwidth win for narrow decimals without breaking these consumers, the path is to make narrowing opt-in (default stays
Decimal128, enabled via a session/config flag). Happy to do that as a separate change.Checks
cargo test -p vortex-array --lib dtype::arrow✅datafusion-bench tpch --formats parquet,vortex --queries 1,6(SF=1, real data) ✅ — row counts validateddevelop(comment-only diff), restoring the green baseline for the TPC-DS bench and JNI jobs.https://claude.ai/code/session_01RctLpues7aLnsxJ86XH9pC