Skip to content

Unify compression codec representation with parameterized codec support#18187

Open
xiangfu0 wants to merge 5 commits intoapache:masterfrom
xiangfu0:claude/musing-bell
Open

Unify compression codec representation with parameterized codec support#18187
xiangfu0 wants to merge 5 commits intoapache:masterfrom
xiangfu0:claude/musing-bell

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 13, 2026

Summary

Replace three overlapping compression codec types with a single unified CompressionCodec class, and add support for parameterized compression codecs like ZSTD(3).

Before: Three separate types for the same concept:

  • FieldConfig.CompressionCodec enum (config layer)
  • ChunkCompressionType enum (segment layer)
  • Manual mapping between the two with no support for codec parameters

After: One CompressionCodec class used everywhere — config, SPI, compressors, writers, readers — with built-in support for optional parameters like compression level.

User Manual

Compression codec syntax

The compressionCodec field in table config now accepts an optional integer parameter in parentheses:

Syntax Meaning
ZSTANDARD Zstandard with default level (3)
ZSTANDARD(9) Zstandard with explicit level 9
ZSTD Alias for ZSTANDARD (default level)
ZSTD(3) Alias for ZSTANDARD with level 3
GZIP GZIP with default level (6)
GZIP(1) GZIP with level 1 (fastest)
GZIP(9) GZIP with level 9 (best compression)
LZ4 LZ4 fast compressor (default)
SNAPPY Snappy (no level parameter)
PASS_THROUGH No compression

Parsing is case-insensitive: zstd(3), Zstd(3), and ZSTD(3) are equivalent and all normalize to ZSTANDARD(3).

Supported level ranges

Codec Level range Notes
ZSTANDARD / ZSTD 1–22 Default is 3 (library default)
GZIP 0–9 Default is 6 (JDK Deflater default)
LZ4 No level parameter; uses fast compressor
SNAPPY No level parameter
PASS_THROUGH No compression

Config examples

Legacy format (unchanged, still works):

{
  "fieldConfigList": [
    {
      "name": "message",
      "encodingType": "RAW",
      "compressionCodec": "ZSTANDARD"
    }
  ]
}

Parameterized codec (new):

{
  "fieldConfigList": [
    {
      "name": "message",
      "encodingType": "RAW",
      "compressionCodec": "ZSTD(9)"
    },
    {
      "name": "payload",
      "encodingType": "RAW",
      "compressionCodec": "GZIP(1)"
    }
  ]
}

Forward index config under indexes (also supports parameterized syntax):

{
  "fieldConfigList": [
    {
      "name": "message",
      "indexes": {
        "forward": {
          "compressionCodec": "ZSTANDARD(9)",
          "deriveNumDocsPerChunk": true
        }
      }
    }
  ]
}

Malformed values (rejected with clear error)

Input Error
ZSTD() Empty parentheses
ZSTD(a) Non-integer level
ZSTD(1,2) Multiple arguments not supported

Plugin extension

Plugins can register custom codecs at runtime:

// Register a custom codec
CompressionCodec MY_CODEC = CompressionCodec.register("MY_CODEC", true, false);

// Register an alias
CompressionCodec.registerAlias("MYCODEC", "MY_CODEC");

Backward Compatibility

Existing table configs

All existing table configs continue to work without any changes:

Existing config value Status
"compressionCodec": "ZSTANDARD" ✅ Works unchanged
"compressionCodec": "SNAPPY" ✅ Works unchanged
"compressionCodec": "LZ4" ✅ Works unchanged
"compressionCodec": "GZIP" ✅ Works unchanged
"compressionCodec": "PASS_THROUGH" ✅ Works unchanged
"compressionCodec": "MV_ENTRY_DICT" ✅ Works unchanged
"compressionCodec": "CLP" ✅ Works unchanged
"compressionCodec": "DELTA" ✅ Works unchanged
Forward index config with "chunkCompressionType": "SNAPPY" ✅ Deprecated field still accepted

Existing segments

  • No segment format changes. Wire format integer IDs (0–7) in forward index file headers are preserved exactly.
  • Existing segments remain readable without migration.
  • Compression level is a write-time only concern — decompression does not require the level.
  • Mixed segments (some written with default level, some with explicit level of the same codec) are fully compatible.

JSON round-trip

  • "ZSTANDARD" serializes back as "ZSTANDARD" (unchanged)
  • "ZSTD(3)" serializes back as "ZSTANDARD(3)" (alias resolved, level preserved)
  • "ZSTD" serializes back as "ZSTANDARD" (alias resolved)

API changes (SPI-breaking)

This PR removes three enum types and replaces them with a single class:

Removed Replacement
FieldConfig.CompressionCodec enum CompressionCodec class (top-level)
ChunkCompressionType enum CompressionCodec class (uses toWireId()/fromWireId() for headers)
DictIdCompressionType enum CompressionCodec.MV_ENTRY_DICT constant + isDictIdCompression()
ChunkCompressor.compressionType() ChunkCompressor.compressionCodec()
ForwardIndexConfig.getChunkCompressionType() ForwardIndexConfig.getCompressionCodec()
ForwardIndexReader.getDictIdCompressionType() ForwardIndexReader.isDictIdCompression()

Plugin authors implementing ChunkCompressor or ForwardIndexReader will need to update their implementations.

Validation

# Full codebase compilation
./mvnw compile -DskipTests -T4                    # BUILD SUCCESS

# Core tests
./mvnw test -pl pinot-segment-local -Dtest=TestCompression           # 19/19 pass
./mvnw test -pl pinot-segment-spi -Dtest=ForwardIndexConfigTest      # 6/6 pass
./mvnw test -pl pinot-common -Dtest=TableConfigSerDeUtilsTest        # 1/1 pass (backward compat)

# Pre-commit checks
./mvnw spotless:check checkstyle:check license:check -pl pinot-spi,pinot-segment-spi,pinot-segment-local

🤖 Generated with Claude Code

Introduce internal parsing of compressionCodec strings so Pinot can
interpret codec names with optional parameters like ZSTD(3), while
keeping the external config field unchanged as a plain string.

- Add CodecSpec model and CodecSpecParser in pinot-spi for structured
  codec name + parameter map representation
- Thread optional compression level through the forward index write
  path: FieldConfig -> ForwardIndexConfig -> creators -> writers ->
  ChunkCompressorFactory -> compressor instances
- Add level-aware constructors to ZstandardCompressor, GzipCompressor,
  LZ4Compressor, and LZ4WithLengthCompressor
- Preserve all existing behavior when no parameter is provided
- No enum-based codec modeling; string-based design supports future
  plugin codecs and codec pipelines

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 47.79412% with 142 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.28%. Comparing base (31eac83) to head (c33780e).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
...pache/pinot/spi/config/table/CompressionCodec.java 51.48% 41 Missing and 8 partials ⚠️
...t/local/io/compression/ChunkCompressorFactory.java 20.00% 11 Missing and 5 partials ⚠️
...he/pinot/segment/spi/index/ForwardIndexConfig.java 21.42% 9 Missing and 2 partials ⚠️
...ent/creator/impl/fwd/CLPForwardIndexCreatorV2.java 0.00% 7 Missing ⚠️
...ment/index/forward/ForwardIndexCreatorFactory.java 56.25% 2 Missing and 5 partials ⚠️
...ment/local/io/compression/ZstandardCompressor.java 50.00% 4 Missing and 1 partial ⚠️
.../local/io/compression/LZ4WithLengthCompressor.java 20.00% 4 Missing ⚠️
...ocal/segment/index/loader/ForwardIndexHandler.java 63.63% 1 Missing and 3 partials ⚠️
...he/pinot/segment/local/utils/TableConfigUtils.java 20.00% 3 Missing and 1 partial ⚠️
...t/segment/local/io/compression/GzipCompressor.java 25.00% 3 Missing ⚠️
... and 20 more

❗ There is a different number of reports uploaded between BASE (31eac83) and HEAD (c33780e). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (31eac83) HEAD (c33780e)
unittests 4 2
temurin 10 9
java-11 5 4
unittests2 2 0
custom-integration1 2 1
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18187      +/-   ##
============================================
- Coverage     63.29%   55.28%   -8.01%     
+ Complexity     1627      838     -789     
============================================
  Files          3226     2536     -690     
  Lines        196636   145722   -50914     
  Branches      30401    23437    -6964     
============================================
- Hits         124466    80567   -43899     
+ Misses        62192    58219    -3973     
+ Partials       9978     6936    -3042     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 55.26% <47.79%> (-8.01%) ⬇️
java-21 55.23% <47.79%> (-8.03%) ⬇️
temurin 55.28% <47.79%> (-8.01%) ⬇️
unittests 55.28% <47.79%> (-8.01%) ⬇️
unittests1 55.28% <47.79%> (+0.02%) ⬆️
unittests2 ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

xiangfu0 and others added 2 commits April 14, 2026 02:24
…decSpec

Replace three overlapping codec representations with a single unified
CompressionCodec class:

- Delete ChunkCompressionType enum (pinot-segment-spi)
- Delete FieldConfig.CompressionCodec nested enum (pinot-spi)
- Delete CodecSpec and CodecSpecParser (merged into CompressionCodec)
- Create top-level CompressionCodec class with static constants,
  string-based parsing, wire format IDs for segment headers, and
  optional compression level support

The new CompressionCodec is an immutable name+params object (not an
enum) that supports parameterized syntax like ZSTD(3) while preserving
backward compatibility with existing segment files and JSON configs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix 6 checkstyle line-length violations in test files (wrap long
  CompressionCodec array initializers)
- Fix spotless import ordering in OfflineClusterIntegrationTest
- Make CompressionCodec.register() public for plugin extensions
- Add CompressionCodec.registerAlias() for custom alias registration
- Use ConcurrentHashMap for thread-safe runtime registration

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
/**
* Can be overridden to force the compression codec.
*/
private static final CompressionCodec[] RAW_INDEX_CODECS = {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move this to CompressionCodec class

xiangfu0 and others added 2 commits April 14, 2026 04:26
Delete the single-value DictIdCompressionType enum and use
CompressionCodec.MV_ENTRY_DICT directly. Replace the nullable
getDictIdCompressionType() getter with a boolean isDictIdCompression()
method on ForwardIndexConfig and ForwardIndexReader.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the raw-index-applicable codec array from BaseStarTreeV2Test
into CompressionCodec.RAW_CODECS so it is reusable across the
codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@xiangfu0 xiangfu0 changed the title Add parameterized compression codec support for forward indexes Unify compression codec representation with parameterized codec support Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants