Fix dictionary/output buffer aliasing corruption#1
Merged
handlerug merged 1 commit intoJun 18, 2026
Merged
Conversation
This was referenced Jun 18, 2026
Owner
|
Thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Builds on the dictionary-decompression support added in this branch (101arrowz#18).
The bug
When a dictionary is loaded,
rdic()replacesst.wwith the dictionary history:In
decompress()(no output buffer provided), the fast path then reusesst.wdirectly as the output buffer when its length equals the frame's decompressed size:So output is written into the same buffer that holds the dictionary history, while sequence back-references read from it. As soon as a back-reference points into the dictionary region that has already been overwritten by output, the result is silently corrupted (no error thrown). It only triggers when the dictionary-content length happens to equal the decompressed size, so it slips through typical tests.
The fix
Skip the fast path when a dictionary is present, so a separate output buffer is used and the dictionary history stays read-only:
Reproduction
A regression test covering exactly this case is included in
tests/simple_cases_test.ts(generates the dictionary, embeds the 22-byte compressed frame, asserts the decode equals the dictionary content).Context
Found while integrating this branch into a read-only Icechunk reader (icechunk-js), which uses zstd dictionary compression for virtual chunk locations. Verified the fix against the reference
zstdCLI and confirmed it does not regress non-dictionary decompression.