[BWARE] Handle hash columns in transform decoders and tighten decode metadata by Baunsgaard · Pull Request #2479 · apache/systemds

Baunsgaard · 2026-06-08T13:00:25Z

Reworks the transform decoders so feature-hashed columns survive the inverse-transform path and decode metadata is initialized consistently. The dummycode hash domain size K is now recovered from the meta cell instead of numDistinct, and each decoder builds its output->source column mapping in initMetaData via a shared helper that accounts for dummycode expansion. DecoderRecode skips hash columns (their bucket codes pass through unchanged), applies recode-on-output when dummycoding is present, and reports a bounded error on malformed recode entries.

The parallel row-block decode is hoisted into the base Decoder so all decoders share it (DecoderComposite drops its own override); DecoderDummycode gains separate dense/sparse paths; DecoderBin serialization is fixed to persist the column mappings and the dead _numBins field is removed; DecoderFactory routes the hash/dummycode/bin/passthrough column sets accordingly. StringArray.getAsDouble also gets a behavior-equivalent boolean-token fast path (drive-by perf cleanup).

Testing: a new TransformDecodeRoundTripTest covers exact-inverse round trips (dense/sparse/parallel), hash + dummycode + bin combinations, federated sub-range decoding, and error/edge paths (corrupt recode meta, parallel-decode interruption, base sub-range rejection); TransformDecodeTest adds dense/sparse hash + dummycode consistency cases.

codecov · 2026-06-08T22:39:49Z

Codecov Report

❌ Patch coverage is 93.75000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.49%. Comparing base (d71d469) to head (1203139).

Files with missing lines	Patch %	Lines
...apache/sysds/runtime/transform/decode/Decoder.java	92.50%	0 Missing and 3 partials ⚠️
...che/sysds/runtime/transform/decode/DecoderBin.java	92.59%	0 Missing and 2 partials ⚠️
...sds/runtime/transform/decode/DecoderDummycode.java	95.23%	0 Missing and 2 partials ⚠️
...s/runtime/transform/decode/DecoderPassThrough.java	83.33%	0 Missing and 2 partials ⚠️
...sysds/runtime/transform/decode/DecoderFactory.java	94.44%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #2479      +/-   ##
============================================
+ Coverage     71.46%   71.49%   +0.02%     
- Complexity    48904    48932      +28     
============================================
  Files          1573     1573              
  Lines        189266   189334      +68     
  Branches      37137    37149      +12     
============================================
+ Hits         135266   135371     +105     
+ Misses        43546    43488      -58     
- Partials      10454    10475      +21

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

… flow Reworks transform decoders so that hash-encoded columns survive the inverse-transform path, and tightens how decoder metadata (column indices, value mappings) is propagated and initialized. - Decoder: pass column-id arrays through decode/decodeFromMap so each decoder knows its own output column range - DecoderRecode: skip recode for hash columns, keep encoded ints passthrough; init metadata from frame consistently - DecoderDummycode: handle hash columns when expanding categorical bits; parallel decode path; sparse-friendly init - DecoderPassThrough / DecoderBin / DecoderComposite / DecoderFactory: consume the new column-id arrays from the dispatch layer - ColumnEncoderFeatureHash: align hash bookkeeping with the decode-side changes - Frame columns (HashMapToInt, StringArray): small support changes consumed by the decoder path above

Fix two regressions in the transform decode rewrite that broke encode/decode roundtrips on dummycoded/recoded frames: - DecoderDummycode.decodeSparse compared 0-based sparse column indexes against the 1-based _clPos/_cuPos bounds used by the dense path (in.get(i, k-1)). This shifted every lookup by one column, so the first category was never matched (decoded as null) and all others decoded one code too low. Shift the sparse bounds and index to be 0-based, matching the dense path. - Restore the public no-arg constructors on DecoderComposite, DecoderBin, DecoderPassThrough, and DecoderRecode. Decoder is Externalizable, and Spark broadcasts the top-level decoder via Java serialization, which requires a public no-arg constructor; without it deserialization fails with InvalidClassException on executors. Restores passing of TransformFrameEncodeColmapTest, TransformFrameEncodeDecodeTest, TransformCSVFrameEncodeDecodeTest, and TransformFrameEncodeDecodeTokenTest in single-node and Spark modes.

Cover the decoder paths touched by the hash-column and decode-metadata changes: parallel block decode equals serial decode, the sparse and dense dummycode decode paths agree, feature-hash columns decode through dummycode via the magic domain-size metadata, and bin columns whose source position is shifted by dummycoding of another column. Add exact inverse round-trip checks for recode and dummycode to validate the sparse binary-search decode against ground truth.

Replace the toLowerCase plus equals chain in the parse fallback with a length-based dispatch: a single char compare for the 1-char "t"/"f" tokens and compareToIgnoreCase for "true"/"false", matching the idiom already used in DoubleArray.parseDouble. This avoids allocating a lower-cased copy and rejects non-boolean strings immediately. Restore throwing DMLRuntimeException on unparseable input. The previous re-throw of the raw NumberFormatException changed the exception type and broke callers such as Array.extractDouble that expect DMLRuntimeException; the throw path is the genuinely-exceptional case, so the wrapping cost is irrelevant there.

DecoderBin gained derived decode state (_srcCols/_dcCols) that was not captured by writeExternal/readExternal, so a decoder broadcast to Spark executors (which do not re-run initMetaData) would NPE on every binned transformdecode. Serialize _srcCols/_dcCols alongside the bin boundaries, mirroring DecoderPassThrough. Register DecoderBin in DecoderFactory.getDecoderType/createInstance so a composite decoder containing a bin decoder can be serialized at all; previously it threw "Unsupported decoder type" before reaching DecoderBin.writeExternal. Remove the _numBins field: it was allocated but never populated, so writeExternal wrote a zero length for every column and dropped all bin boundaries on deserialization. The per-column bin count is recovered from the boundary array length instead, and readExternal now allocates the boundary arrays it reads into. Add serialization round-trip tests (plain bin and bin-with-dummycode) plus tests for the bin source-column offset mapping, the key==0 bin branch, and the StringArray boolean-token getAsDouble fallback.

For feature-hashed columns the hash domain size K is stored in the single transform meta cell, not materialized as numDistinct rows like recode/bin. Stop overloading numDistinct=K on the hash meta column and instead pass the set of dummycoded hash columns from DecoderFactory to the dummycode/bin/ passthrough decoders, which read K from the cell when sizing the dummycode expansion. This keeps numDistinct semantically the meta column's own cardinality and avoids any sentinel in the cell value. Also trim verbose comments introduced in the transform decoders and remove the dead commented-out _rcMapsDirect block (and its unused max accumulator) in DecoderRecode.

- Remove leftover debug try/catch with LOG.error dumps from DecoderBin.decode so the per-row bin decode loop no longer materializes a full column slice and logs raw values on failure - Replace no-op continue with break in DecoderDummycode.decodeDense so the one-hot scan stops at the first set bit, matching the sparse decode path - Rewrite transform decode test comments to describe the real mechanism: the hash domain size K is stored as a plain integer in the single meta cell and read instead of numDistinct (drops the obsolete "magic K" wording)

Document why hash columns are excluded from the recode set: feature hashing is a lossy, one-way transform with no inverse recode map, so hash columns are never recode-decoded. Also note the recode-on-output flag is set when dummycoding is present, since recode then runs after the categorical columns are rebuilt.

- Add tests covering the parallel decode error wrapper, recode meta reinitialization (corrupt-entry rejection and end-of-map break), and the dummycode sub-range decoder and index-range shift used by federated transform-decode - Remove the unused 3-arg DecoderBin constructor; the factory only constructs the 4-arg form with hash columns These exercise the decode paths added when handling hash columns in transform decoders, closing the patch coverage gaps on the new logic.

- Bound the recode reinit error to the offending column id and meta row instead of dumping the entire recode-map column, which can be large and may contain sensitive category values - Propagate parallel-decode worker failures as DMLRuntimeException and restore the interrupt flag on InterruptedException - Extract the duplicated dummycode source-column offset walk from DecoderBin and DecoderPassThrough into a shared Decoder.buildSrcCols helper, which also guards against a null dummycode column array - Strengthen the recode and parallel-decode error-path tests to assert the specific failure surface and preserved cause

Add a deterministic test using a minimal in-test decoder that interrupts the calling thread mid parallel-decode, verifying the wrapper restores the interrupt flag (which Future.get clears on throw) and surfaces the failure as a DMLRuntimeException. The same decoder exercises the base subRangeDecoder rejection path, closing the remaining uncovered lines in Decoder.

github-project-automation Bot added this to SystemDS PR Queue Jun 8, 2026

github-project-automation Bot moved this to In Progress in SystemDS PR Queue Jun 8, 2026

Baunsgaard force-pushed the split/decoderHash branch 2 times, most recently from d0083b1 to 7e357a5 Compare June 16, 2026 12:18

Baunsgaard added 11 commits June 23, 2026 20:53

Baunsgaard force-pushed the split/decoderHash branch from 04df63b to 1203139 Compare June 23, 2026 22:05

Baunsgaard merged commit 2687bbe into apache:main Jun 23, 2026
50 checks passed

github-project-automation Bot moved this from In Progress to Done in SystemDS PR Queue Jun 23, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479

[BWARE] Handle hash columns in transform decoders and tighten decode metadata#2479
Baunsgaard merged 11 commits into
apache:mainfrom
Baunsgaard:split/decoderHash

Baunsgaard commented Jun 8, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Baunsgaard commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Baunsgaard commented Jun 8, 2026 •

edited

Loading

codecov Bot commented Jun 8, 2026 •

edited

Loading