[BWARE] Add sort support to compressed column groups#2507
Merged
Conversation
9cc1fb7 to
3bd84e0
Compare
Implement single-column sort for compressed matrices via a new AColGroup.sort that reorders the dictionary and remaps indexes. Add CLALibSort driver, IDictionary/Dictionary sort with shared index permutation, and per-column-group sort implementations.
3bd84e0 to
61ac368
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2507 +/- ##
============================================
- Coverage 71.56% 71.55% -0.02%
- Complexity 49052 49113 +61
============================================
Files 1574 1575 +1
Lines 189565 189784 +219
Branches 37188 37232 +44
============================================
+ Hits 135658 135791 +133
- Misses 43422 43496 +74
- Partials 10485 10497 +12 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Add CompressedSortTest covering the single-column sort of compressed column groups (DDC, SDC, SDCSingle, SDCSingleZeros, CONST) by comparing the decompressed result against an ascending reference sort. Fix ColGroupSDCSingleZeros.sort and ColGroupSDCSingle.sort, which both indexed the per-value counts array beyond the dictionary sort length (throwing ArrayIndexOutOfBoundsException) and never advanced the offset cursor. These encodings hold a single non-default value, so sorting is a contiguous block placed before or after the default values depending on sign/ordering relative to the default.
Route the order (SortIndex) reorg through CLALibSort so a single column held in a single column group is sorted ascending while staying compressed. Multiple columns, multiple groups, descending order, index return, or encodings without a sort implementation fall back to a decompressed reorg via a shared fallback in CLALibReorg. Rewrite CLALibSort to expose a SortIndex-based entry that returns null when the compressed fast-path does not apply, instead of the previous unused, semantically inconsistent sortOperations-style helper. Fix ColGroupUncompressed.sort, which built a quantile value/weight table via sortOperations instead of ordering the column; it now reorders the rows ascending. Expand CompressedSortTest to drive the order operation end to end through reorgOperations, covering compressed sorting (DDC, SDC variants, CONST, uncompressed column group) and the decompression fallbacks (descending and multi-column).
… table Restore the CompressedMatrixBlock.sortOperations(weights, result, k) override so the qsort/median/quantile path (SortKeys lop) runs through CLALibSort instead of always decompressing. For the unweighted single-column single-group case, CLALibSort now sorts the few distinct values via the column-group sort and builds the exact (1 + nnz) x 2 value/weight table that MatrixBlock.sortOperations produces: one row per non-zero value (weight 1) plus a single collapsed zero row, ordered ascending. This keeps downstream pickValue/median/IQM results bit-identical (their averaging logic depends on the per-element table layout) while avoiding a full-length sort. Weighted, multi-column, multi-group, or unsupported encodings fall back to a decompressed sort. Add median/quantile coverage to CompressedSortTest comparing the compressed value/weight table and the resulting median/quantile picks against the uncompressed reference.
Emit the compressed single-column quantile/median value-weight table through the same reorg used by MatrixBlock.sortOperations instead of building it and calling recomputeNonZeros. The uncompressed reference leaves the result's non-zero count unmaintained (0); recomputing it on the compressed side made the two paths asymmetric and broke CompressedVectorTest.testSortOperations, which relies on both sides reporting the same empty/non-empty state. Routing through reorg makes the produced table bit-for-bit identical to the uncompressed path, including its metadata.
Add a testSort case mirroring testSortOperations so the new compressed order() reorg path (CLALibSort) is exercised across the full single-column parameter matrix (sparsity, value type, value range, DDC/SDC/UNCOMPRESSED), comparing the compressed result against the uncompressed reference. This cheaply hits many more encoding variations than the dedicated CompressedSortTest.
- Add a single-column sort() test to DictionaryTests covering MatrixBlockDictionary.sort() and Dictionary.sort(), validating the returned permutation yields a non-decreasing sequence. - Add CLALibSort fallback tests: index-return order, unsupported (OLE) encoding for both the order and quantile paths, and a dense all-negative quantile table. - Simplify CLALibSort.sortTableSingleColumn: the value/weight table is re-sorted by the reorg used for metadata parity, so the explicit negative/zero/positive ordering and zeroWritten tracking were redundant. Emit one weight-1 row per non-zero plus one collapsed zero row in any order, matching MatrixBlock.sortOperations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement single-column sort for compressed matrices via a new AColGroup.sort that reorders the dictionary and remaps indexes. Add CLALibSort driver, IDictionary/Dictionary sort with shared index permutation, and per-column-group sort implementations.