Unify compression codec representation with parameterized codec support#18187
Open
xiangfu0 wants to merge 5 commits intoapache:masterfrom
Open
Unify compression codec representation with parameterized codec support#18187xiangfu0 wants to merge 5 commits intoapache:masterfrom
xiangfu0 wants to merge 5 commits intoapache:masterfrom
Conversation
Introduce internal parsing of compressionCodec strings so Pinot can interpret codec names with optional parameters like ZSTD(3), while keeping the external config field unchanged as a plain string. - Add CodecSpec model and CodecSpecParser in pinot-spi for structured codec name + parameter map representation - Thread optional compression level through the forward index write path: FieldConfig -> ForwardIndexConfig -> creators -> writers -> ChunkCompressorFactory -> compressor instances - Add level-aware constructors to ZstandardCompressor, GzipCompressor, LZ4Compressor, and LZ4WithLengthCompressor - Preserve all existing behavior when no parameter is provided - No enum-based codec modeling; string-based design supports future plugin codecs and codec pipelines Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18187 +/- ##
============================================
- Coverage 63.29% 55.28% -8.01%
+ Complexity 1627 838 -789
============================================
Files 3226 2536 -690
Lines 196636 145722 -50914
Branches 30401 23437 -6964
============================================
- Hits 124466 80567 -43899
+ Misses 62192 58219 -3973
+ Partials 9978 6936 -3042
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…decSpec Replace three overlapping codec representations with a single unified CompressionCodec class: - Delete ChunkCompressionType enum (pinot-segment-spi) - Delete FieldConfig.CompressionCodec nested enum (pinot-spi) - Delete CodecSpec and CodecSpecParser (merged into CompressionCodec) - Create top-level CompressionCodec class with static constants, string-based parsing, wire format IDs for segment headers, and optional compression level support The new CompressionCodec is an immutable name+params object (not an enum) that supports parameterized syntax like ZSTD(3) while preserving backward compatibility with existing segment files and JSON configs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix 6 checkstyle line-length violations in test files (wrap long CompressionCodec array initializers) - Fix spotless import ordering in OfflineClusterIntegrationTest - Make CompressionCodec.register() public for plugin extensions - Add CompressionCodec.registerAlias() for custom alias registration - Use ConcurrentHashMap for thread-safe runtime registration Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xiangfu0
commented
Apr 14, 2026
| /** | ||
| * Can be overridden to force the compression codec. | ||
| */ | ||
| private static final CompressionCodec[] RAW_INDEX_CODECS = { |
Contributor
Author
There was a problem hiding this comment.
move this to CompressionCodec class
Delete the single-value DictIdCompressionType enum and use CompressionCodec.MV_ENTRY_DICT directly. Replace the nullable getDictIdCompressionType() getter with a boolean isDictIdCompression() method on ForwardIndexConfig and ForwardIndexReader. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the raw-index-applicable codec array from BaseStarTreeV2Test into CompressionCodec.RAW_CODECS so it is reusable across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replace three overlapping compression codec types with a single unified
CompressionCodecclass, and add support for parameterized compression codecs likeZSTD(3).Before: Three separate types for the same concept:
FieldConfig.CompressionCodecenum (config layer)ChunkCompressionTypeenum (segment layer)After: One
CompressionCodecclass used everywhere — config, SPI, compressors, writers, readers — with built-in support for optional parameters like compression level.User Manual
Compression codec syntax
The
compressionCodecfield in table config now accepts an optional integer parameter in parentheses:ZSTANDARDZSTANDARD(9)ZSTDZSTD(3)GZIPGZIP(1)GZIP(9)LZ4SNAPPYPASS_THROUGHParsing is case-insensitive:
zstd(3),Zstd(3), andZSTD(3)are equivalent and all normalize toZSTANDARD(3).Supported level ranges
ZSTANDARD/ZSTDGZIPLZ4SNAPPYPASS_THROUGHConfig examples
Legacy format (unchanged, still works):
{ "fieldConfigList": [ { "name": "message", "encodingType": "RAW", "compressionCodec": "ZSTANDARD" } ] }Parameterized codec (new):
{ "fieldConfigList": [ { "name": "message", "encodingType": "RAW", "compressionCodec": "ZSTD(9)" }, { "name": "payload", "encodingType": "RAW", "compressionCodec": "GZIP(1)" } ] }Forward index config under
indexes(also supports parameterized syntax):{ "fieldConfigList": [ { "name": "message", "indexes": { "forward": { "compressionCodec": "ZSTANDARD(9)", "deriveNumDocsPerChunk": true } } } ] }Malformed values (rejected with clear error)
ZSTD()ZSTD(a)ZSTD(1,2)Plugin extension
Plugins can register custom codecs at runtime:
Backward Compatibility
Existing table configs
All existing table configs continue to work without any changes:
"compressionCodec": "ZSTANDARD""compressionCodec": "SNAPPY""compressionCodec": "LZ4""compressionCodec": "GZIP""compressionCodec": "PASS_THROUGH""compressionCodec": "MV_ENTRY_DICT""compressionCodec": "CLP""compressionCodec": "DELTA""chunkCompressionType": "SNAPPY"Existing segments
JSON round-trip
"ZSTANDARD"serializes back as"ZSTANDARD"(unchanged)"ZSTD(3)"serializes back as"ZSTANDARD(3)"(alias resolved, level preserved)"ZSTD"serializes back as"ZSTANDARD"(alias resolved)API changes (SPI-breaking)
This PR removes three enum types and replaces them with a single class:
FieldConfig.CompressionCodecenumCompressionCodecclass (top-level)ChunkCompressionTypeenumCompressionCodecclass (usestoWireId()/fromWireId()for headers)DictIdCompressionTypeenumCompressionCodec.MV_ENTRY_DICTconstant +isDictIdCompression()ChunkCompressor.compressionType()ChunkCompressor.compressionCodec()ForwardIndexConfig.getChunkCompressionType()ForwardIndexConfig.getCompressionCodec()ForwardIndexReader.getDictIdCompressionType()ForwardIndexReader.isDictIdCompression()Plugin authors implementing
ChunkCompressororForwardIndexReaderwill need to update their implementations.Validation
🤖 Generated with Claude Code