Skip to content

[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515

Open
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:delta-frame-io
Open

[SYSTEMDS-3949] Add native Delta Lake frame read/write via Delta Kernel#2515
Baunsgaard wants to merge 2 commits into
apache:mainfrom
Baunsgaard:delta-frame-io

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

Extend the native Delta Lake support (#2511) from matrices to frames, reading and writing Delta Lake tables through the Spark-free Delta Kernel library on the single-node CP path. DML read/write with format="delta" now works for frames, discovering schema, column names, and dimensions directly from the table.

Stacked on #2511 and should merge after it. Append/overwrite semantics, distributed execution, and time travel remain out of scope

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 78.41191% with 174 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.60%. Comparing base (d4e7def) to head (cc33e9e).

Files with missing lines Patch % Lines
...g/apache/sysds/runtime/io/ReaderDeltaParallel.java 47.56% 40 Missing and 3 partials ⚠️
.../org/apache/sysds/runtime/io/DeltaKernelUtils.java 75.00% 19 Missing and 17 partials ⚠️
...che/sysds/runtime/io/FrameReaderDeltaParallel.java 81.60% 18 Missing and 14 partials ⚠️
.../java/org/apache/sysds/runtime/io/ReaderDelta.java 70.73% 16 Missing and 8 partials ⚠️
.../org/apache/sysds/runtime/io/FrameReaderDelta.java 90.62% 0 Missing and 15 partials ⚠️
.../java/org/apache/sysds/runtime/io/WriterDelta.java 81.53% 7 Missing and 5 partials ⚠️
.../org/apache/sysds/runtime/io/FrameWriterDelta.java 89.18% 4 Missing and 4 partials ⚠️
...g/apache/sysds/runtime/io/MatrixReaderFactory.java 25.00% 2 Missing and 1 partial ⚠️
...n/java/org/apache/sysds/parser/DataExpression.java 83.33% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2515      +/-   ##
============================================
+ Coverage     71.56%   71.60%   +0.03%     
- Complexity    49125    49310     +185     
============================================
  Files          1575     1582       +7     
  Lines        189784   190583     +799     
  Branches      37232    37395     +163     
============================================
+ Hits         135823   136461     +638     
- Misses        43470    43565      +95     
- Partials      10491    10557      +66     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Introduce a DELTA file format that reads and writes Delta Lake tables
natively through the Spark-free Delta Kernel library, for matrices on the
single-node CP path. DML read/write with format="delta" now operates
directly on Delta tables without a Spark DataFrame round-trip.

- Add FileFormat.DELTA and exclude it from the text formats
- Accept format="delta" with unknown dimensions in the parser and set
  blocksize -1 for the columnar format
- Wire DELTA into the matrix reader and writer factories
- Add DeltaKernelUtils plus serial and parallel native Delta readers and
  WriterDelta with column-at-a-time, boxing-free data transfer
- Expose Delta reader batch size and writer target file size via DMLConfig
- Refresh cached matrix metadata after a Delta read (discovered dimensions)
- Add a parquet.version property and pin delta-kernel 3.3.2
- Run Delta component IO tests in CI and broaden matrix coverage

Append/overwrite table semantics, distributed execution, frames, and time
travel are out of scope.
Extend the native Delta Lake support from matrices to frames, reading and
writing Delta Lake tables through the Spark-free Delta Kernel library on the
single-node CP path. DML read/write with format="delta" now works for
frames, discovering schema, column names, and dimensions directly from the
table.

- Add FrameReaderDelta, FrameReaderDeltaParallel and FrameWriterDelta
- Wire DELTA into the frame reader and writer factories
- Refresh cached frame metadata and schema after a Delta read
- Broaden Delta frame component IO coverage

Stacked on the matrix Delta support; append/overwrite semantics,
distributed execution, and time travel remain out of scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

1 participant