Skip to content

[BWARE] Add removeEmpty support to compressed column groups#2504

Merged
Baunsgaard merged 6 commits into
apache:mainfrom
Baunsgaard:split/compressedRemoveEmpty
Jun 24, 2026
Merged

[BWARE] Add removeEmpty support to compressed column groups#2504
Baunsgaard merged 6 commits into
apache:mainfrom
Baunsgaard:split/compressedRemoveEmpty

Conversation

@Baunsgaard

Copy link
Copy Markdown
Contributor

Implement removeEmptyRows and removeEmptyCols for compressed matrices without full decompression. Add CLALibRemoveEmpty driver and LibMatrixReorg helpers (rmemptyEarlyAbort/rmemptyUnsafe), per-column-group removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and offset/mapping support for index-only row removal.

Implement removeEmptyRows and removeEmptyCols for compressed matrices without full decompression. Add CLALibRemoveEmpty driver and LibMatrixReorg helpers (rmemptyEarlyAbort/rmemptyUnsafe), per-column-group removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and offset/mapping support for index-only row removal.
@github-project-automation github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 23, 2026
@Baunsgaard Baunsgaard changed the title Add removeEmpty support to compressed column groups [BWARE] Add removeEmpty support to compressed column groups Jun 23, 2026
…ide effect

- Add Apache license header to CLALibRemoveEmpty (fixes RAT license check)
- Remove dead partition/swap helpers and unused imports in Dictionary/QDictionary
  (fixes checkstyle unused-import errors)
- Make MatrixBlockDictionary.sliceColumns read-through instead of calling
  sparseToDense() in place, so it no longer mutates the shared dictionary block
- Validate selected-row count up front in AMapToData.removeEmpty instead of
  catching ArrayIndexOutOfBoundsException for control flow
- Lower the decompress-fallback log in CLALibRemoveEmpty to guarded debug
- Document the null-return contract on removeEmptyCols and rmemptyEarlyAbort
- Remove leftover commented-out debug code in AOffset and ColGroupSDCSingleZeros
- Use Assume.assumeTrue for size-gated removeEmpty tests instead of silent skips
- Revert unrelated CompressForce and EncodeSample test changes out of scope
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.82090% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.54%. Comparing base (3871809) to head (a260a01).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
...ache/sysds/runtime/compress/colgroup/ASDCZero.java 0.00% 9 Missing ⚠️
...ysds/runtime/compress/colgroup/ColGroupDDCLZW.java 0.00% 4 Missing ⚠️
...untime/compress/colgroup/ColGroupUncompressed.java 63.63% 4 Missing ⚠️
...ess/colgroup/dictionary/MatrixBlockDictionary.java 84.61% 2 Missing and 2 partials ⚠️
.../sysds/runtime/compress/lib/CLALibRemoveEmpty.java 92.45% 2 Missing and 2 partials ⚠️
...che/sysds/runtime/compress/colgroup/AColGroup.java 92.00% 2 Missing ⚠️
...me/compress/colgroup/ColGroupLinearFunctional.java 0.00% 2 Missing ⚠️
...e/compress/colgroup/ColGroupUncompressedArray.java 0.00% 2 Missing ⚠️
.../runtime/compress/colgroup/mapping/AMapToData.java 88.88% 1 Missing and 1 partial ⚠️
.../compress/colgroup/dictionary/DeltaDictionary.java 0.00% 1 Missing ⚠️
... and 4 more
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2504      +/-   ##
============================================
+ Coverage     71.45%   71.54%   +0.09%     
- Complexity    48864    49043     +179     
============================================
  Files          1573     1574       +1     
  Lines        189239   189565     +326     
  Branches      37128    37188      +60     
============================================
+ Hits         135215   135632     +417     
+ Misses        43576    43457     -119     
- Partials      10448    10476      +28     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Route CompressedMatrixBlock.removeEmptyOperations through
CLALibRemoveEmpty.rmempty instead of always decompressing and delegating.
This makes the compressed removeEmpty implementation (per-column-group
removeEmptyRows/removeEmptyColsSubset, dictionary sliceColumns, and
offset/mapping handling) actually reachable; it was previously dead code
with 0% test coverage. The existing removeEmptyOperations tests in
CompressedMatrixTest now exercise the compressed path, while the
null-select case still falls back to full decompression.
Extend CompressedMatrixTest with additional removeEmptyOperations cases
that exercise branches the existing tests missed: emptyReturn=true with a
selection vector, a denser row selection, select-all (nothing removed),
and the all-rows/all-cols-removed paths (rOut==0 / cOut==0). Each case
compares the compressed result against the uncompressed reference across
the full compression-config matrix.
CompressedMatrixBlock.removeEmptyOperations was wired to CLALibRemoveEmpty,
but the select-based fast path iterated column groups and called
removeEmptyRows/removeEmptyColsSubset with no fallback. OLE, RLE,
LinearFunctional, and UncompressedArray groups throw NotImplementedException
for these operations, so a compressed matrix containing one of them would
hard-fail where the previous always-decompress implementation succeeded.

Wrap the per-group loops in rmEmptyRows/rmEmptyCols and route to the existing
decompression fallback on NotImplementedException, restoring the prior
"works for any encoding" contract. Add CompressedRemoveEmptyForcedTest, which
forces OLE/RLE encodings and asserts removeEmptyOperations matches the
uncompressed reference instead of throwing.
Cover the per-encoding removeEmptyColsSubset dictionary-slicing paths that
were previously untested by selecting a strict subset of a multi-column
column group:

- SDC, SDC-single, and SDCFOR (via sparsifyFOR) reference/default-tuple
  slicing branches.
- DDC subset removal.
- RLE/OLE fallback to decompression when index-only column removal is
  unsupported (NotImplementedException path).
- Selection vectors with unknown (-1) non-zero counts, exercising the
  recompute branches in CLALibRemoveEmpty for both rows and columns.

Identical columns are used so co-coding merges them into one multi-column
group, guaranteeing the subset path is reached instead of the all-selected
copyAndSet shortcut.
@Baunsgaard Baunsgaard merged commit e295b40 into apache:main Jun 24, 2026
50 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant