[SYSTEMDS-3949] Add native Delta Lake matrix read/write via Delta Kernel#2511
Open
Baunsgaard wants to merge 2 commits into
Open
[SYSTEMDS-3949] Add native Delta Lake matrix read/write via Delta Kernel#2511Baunsgaard wants to merge 2 commits into
Baunsgaard wants to merge 2 commits into
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2511 +/- ##
============================================
+ Coverage 71.56% 71.60% +0.04%
- Complexity 49110 49228 +118
============================================
Files 1575 1579 +4
Lines 189793 190162 +369
Branches 37235 37297 +62
============================================
+ Hits 135816 136157 +341
- Misses 43480 43503 +23
- Partials 10497 10502 +5 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
06deac3 to
d15de22
Compare
Contributor
Author
Native Delta read performance vs binary readerBenchmark ( Workload: 5,000,000 × 16 dense FP64 matrix (~640 MB in memory, sparsity 1.0).
Takeaways
For reference, the write path is encode-bound: Delta write ~7.0 s vs binary ~0.15 s |
92fc3c8 to
d5b6104
Compare
Introduce a DELTA file format that reads and writes Delta Lake tables natively through the Spark-free Delta Kernel library, for matrices on the single-node CP path. DML read/write with format="delta" now operates directly on Delta tables without a Spark DataFrame round-trip. - Add FileFormat.DELTA and exclude it from the text formats - Accept format="delta" with unknown dimensions in the parser and set blocksize -1 for the columnar format - Wire DELTA into the matrix reader and writer factories - Add DeltaKernelUtils plus serial and parallel native Delta readers and WriterDelta with column-at-a-time, boxing-free data transfer - Expose Delta reader batch size and writer target file size via DMLConfig - Refresh cached matrix metadata after a Delta read (discovered dimensions) - Add a parquet.version property and pin delta-kernel 3.3.2 - Run Delta component IO tests in CI and broaden matrix coverage Append/overwrite table semantics, distributed execution, frames, and time travel are out of scope.
d5b6104 to
b41b6db
Compare
- Bound each per-file decode task in the direct parallel read path to its numRecords-derived row slice, so a Delta file that decodes more rows than its statistic claims fails with a clear error instead of overflowing into the next file's region (concurrent overlapping writes) or off the array. - Use the parallel recomputeNonZeros(_numThreads) in the buffered read path to match the direct path; the buffered fallback handles the largest matrices.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce a DELTA file format that reads and writes Delta Lake tables natively through the Spark-free Delta Kernel library, for matrices on the single-node CP path. DML read/write with format="delta" now operates directly on Delta tables without a Spark DataFrame round-trip.
Append/overwrite table semantics, distributed execution, frames, and time travel are out of scope.