Skip to content

Add parameterized forward index compression codecs#18097

Open
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/parameterized-compression-codec
Open

Add parameterized forward index compression codecs#18097
xiangfu0 wants to merge 1 commit intoapache:masterfrom
xiangfu0:codex/parameterized-compression-codec

Conversation

@xiangfu0
Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 commented Apr 3, 2026

Summary

  • add a structured CompressionCodecSpec parser/model for compressionCodec, including CODEC and CODEC(level) syntax plus ZSTD -> ZSTANDARD canonicalization
  • thread optional compression levels through raw/forward index writer creation for codecs that support them, while preserving legacy behavior for plain codec names
  • persist the explicit canonical forward-index compression spec in segment metadata so level-only config changes can trigger rewrite detection during reload
  • add parser, config serde, compression factory, and forward-index handler tests covering aliases, validation errors, and level-aware rewrites

User manual

This change keeps the external config field name as compressionCodec and extends its accepted values.

Accepted syntax:

  • CODEC
  • CODEC(level)

Parsing rules:

  • input is case-insensitive
  • ZSTD is accepted anywhere ZSTANDARD is accepted
  • canonical internal form is uppercase, with ZSTD normalized to ZSTANDARD
  • malformed values are rejected during Pinot config validation instead of being passed downstream

Examples of accepted values:

  • ZSTANDARD
  • ZSTD
  • ZSTANDARD(3)
  • ZSTD(3)
  • GZIP(6)
  • LZ4(12)

Supported level ranges:

  • ZSTANDARD / ZSTD: 1..22
  • GZIP: 0..9
  • LZ4: 1..17

Codecs that do not expose a meaningful level in Pinot continue to reject parameterized syntax. For example:

  • SNAPPY(3)
  • PASS_THROUGH(1)
  • MV_ENTRY_DICT(2)

Malformed values are also rejected. For example:

  • ZSTD()
  • ZSTD(foo)
  • UNKNOWN(3)

Sample configs

Legacy raw-index field config remains valid, now with parameterized syntax:

{
  "fieldConfigList": [
    {
      "name": "message",
      "encodingType": "RAW",
      "compressionCodec": "ZSTD(3)"
    },
    {
      "name": "payload",
      "encodingType": "RAW",
      "compressionCodec": "GZIP(6)"
    }
  ]
}

The newer forward-index config under indexes.forward also accepts the same compressionCodec syntax:

{
  "fieldConfigList": [
    {
      "name": "message",
      "indexes": {
        "forward": {
          "compressionCodec": "ZSTANDARD(3)",
          "deriveNumDocsPerChunk": true,
          "rawIndexWriterVersion": 10
        }
      }
    },
    {
      "name": "blob",
      "indexes": {
        "forward": {
          "compressionCodec": "LZ4(12)"
        }
      }
    }
  ]
}

Alias handling is equivalent, so these are treated the same:

{ "compressionCodec": "ZSTD" }
{ "compressionCodec": "ZSTANDARD" }
{ "compressionCodec": "ZSTD(3)" }
{ "compressionCodec": "ZSTANDARD(3)" }

Backward compatibility

This change is intentionally conservative.

Config compatibility:

  • existing configs using plain codec names continue to work unchanged
  • the config field name stays compressionCodec
  • existing enum-like values such as SNAPPY, PASS_THROUGH, ZSTANDARD, and LZ4 continue to deserialize and behave as before
  • ZSTD is newly accepted as an alias and normalized internally to ZSTANDARD

Write-path compatibility:

  • plain codec strings preserve existing behavior
  • plain LZ4 still uses Pinot's existing fast-compressor path
  • only explicit LZ4(level) switches to HC compression with the requested level
  • if no level is provided, Pinot keeps the prior library/default write behavior instead of introducing a new Pinot-defined default level

Read-path and segment compatibility:

  • no forward-index header/read-path format changes are introduced
  • readers continue to rely on codec identity only; compression level is still a write-time concern
  • existing segments remain readable unchanged
  • mixed segments written with different levels of the same codec are query-safe because decode does not require the level

Reload/rewrite behavior:

  • the canonical explicit spec string is persisted in segment metadata for forward indexes
  • this is used to detect level-only config changes during segment reload/rebuild
  • metadata persistence is for rewrite detection and observability, not for decode correctness

Validation

  • ./mvnw -pl pinot-segment-spi,pinot-segment-local -am -Dcheckstyle.skip=true -Dtest=ForwardIndexConfigTest,ForwardIndexTypeTest,IndexLoadingConfigTest,TestCompression,ForwardIndexHandlerTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-spi,pinot-common -am -Dcheckstyle.skip=true -Dtest=CompressionCodecSpecTest,TableConfigSerDeUtilsTest -Dsurefire.failIfNoSpecifiedTests=false test
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local spotless:apply -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local checkstyle:check -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local license:format -DskipTests
  • ./mvnw -pl pinot-spi,pinot-segment-spi,pinot-common,pinot-segment-local license:check -DskipTests

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 3, 2026

Codecov Report

❌ Patch coverage is 82.98611% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.33%. Comparing base (31eac83) to head (3ad95c2).

Files with missing lines Patch % Lines
...he/pinot/segment/spi/index/ForwardIndexConfig.java 56.81% 13 Missing and 6 partials ⚠️
...t/spi/config/table/CompressionCodecSpecParser.java 83.33% 3 Missing and 3 partials ⚠️
...t/local/io/compression/ChunkCompressorFactory.java 76.47% 2 Missing and 2 partials ⚠️
...e/pinot/spi/config/table/CompressionCodecSpec.java 80.95% 2 Missing and 2 partials ⚠️
...ocal/segment/index/loader/ForwardIndexHandler.java 92.68% 1 Missing and 2 partials ⚠️
.../pinot/segment/spi/utils/SegmentMetadataUtils.java 0.00% 3 Missing ⚠️
...ot/segment/local/io/compression/LZ4Compressor.java 66.66% 2 Missing ⚠️
...ment/index/forward/ForwardIndexCreatorFactory.java 83.33% 2 Missing ⚠️
...org/apache/pinot/spi/config/table/FieldConfig.java 87.50% 1 Missing and 1 partial ⚠️
...local/segment/creator/impl/BaseSegmentCreator.java 93.33% 0 Missing and 1 partial ⚠️
... and 3 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18097      +/-   ##
============================================
+ Coverage     63.29%   63.33%   +0.03%     
- Complexity     1627     1646      +19     
============================================
  Files          3226     3229       +3     
  Lines        196636   196842     +206     
  Branches      30401    30442      +41     
============================================
+ Hits         124466   124671     +205     
+ Misses        62192    62171      -21     
- Partials       9978    10000      +22     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.30% <82.98%> (+0.03%) ⬆️
java-21 63.30% <82.98%> (+0.04%) ⬆️
temurin 63.33% <82.98%> (+0.03%) ⬆️
unittests 63.33% <82.98%> (+0.03%) ⬆️
unittests1 55.29% <67.01%> (+0.03%) ⬆️
unittests2 34.98% <62.84%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch from 39e9169 to 938028b Compare April 3, 2026 13:33
@xiangfu0 xiangfu0 requested review from Jackie-Jiang and Copilot April 3, 2026 20:59
@xiangfu0 xiangfu0 added configuration Config changes (addition/deletion/change in behavior) feature New functionality labels Apr 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a structured, canonicalized compressionCodec specification (including optional compression level) and threads it through forward-index creation and segment metadata so level-only changes can trigger rewrites on reload.

Changes:

  • Introduce CompressionCodecSpec + parser/capabilities to support CODEC and CODEC(level) syntax with alias canonicalization (e.g., ZSTDZSTANDARD).
  • Plumb optional compression levels into raw/forward-index writers and chunk compressor construction while preserving legacy behavior for non-parameterized codecs.
  • Persist explicit forward-index compression spec in segment metadata and enhance reload logic to detect level-only changes; add/extend unit and integration tests.

Reviewed changes

Copilot reviewed 38 out of 38 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pinot-spi/src/test/java/org/apache/pinot/spi/config/table/CompressionCodecSpecTest.java Adds parser/serde tests for canonicalization, level ranges, and invalid specs.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/FieldConfig.java Switches compressionCodec to CompressionCodecSpec while retaining enum-based compatibility paths.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecSpecParser.java Implements parsing/validation for CODEC and CODEC(level) syntax.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecSpec.java Adds canonical spec model with Jackson (de)serialization via config string.
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/CompressionCodecCapabilities.java Centralizes alias canonicalization and per-codec supported level ranges.
pinot-segment-spi/src/test/java/org/apache/pinot/segment/spi/index/ForwardIndexConfigTest.java Adds coverage for parameterized compression codec parsing in forward-index config.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/V1Constants.java Adds a metadata key for persisting forward-index compression spec.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/ForwardIndexConfig.java Adds CompressionCodecSpec support, legacy interop, and equality/hash updates.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java Updates FieldConfig construction to match new compressionCodec type.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/IndexLoadingConfigTest.java Verifies parameterized codec is propagated through index loading config.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandlerTest.java Tests rewrite detection and metadata persistence for level-only codec changes.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexTypeTest.java Ensures legacy field-config path supports parameterized codec specs.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/io/compression/TestCompression.java Adds compressor roundtrip tests with explicit levels and rejection for unsupported codecs.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/InvertedIndexAndDictionaryBasedForwardIndexCreator.java Persists forward-index compression spec into metadata for raw forward indexes.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/loader/ForwardIndexHandler.java Uses persisted spec to detect level-only changes and updates metadata on rewrite.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexType.java Switches FieldConfig-to-ForwardIndexConfig mapping to use CompressionCodecSpec.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/forward/ForwardIndexCreatorFactory.java Threads optional compression level into raw index creator selection.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueVarByteRawIndexCreator.java Adds compression-level plumbing into var-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/SingleValueFixedByteRawIndexCreator.java Adds compression-level plumbing into fixed-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/MultiValueVarByteRawIndexCreator.java Adds compression-level plumbing into MV var-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/fwd/MultiValueFixedByteRawIndexCreator.java Adds compression-level plumbing into MV fixed-byte raw forward index writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/creator/impl/BaseSegmentCreator.java Writes canonical explicit forward-index compression spec to segment metadata.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV6.java Adds constructors to pass optional compression level to base writer.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV5.java Adds constructors to pass optional compression level to base writer.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriterV4.java Uses level-aware compressor factory when building chunk compressor.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/VarByteChunkForwardIndexWriter.java Adds compression-level support in base var-byte writer constructor path.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/FixedByteChunkForwardIndexWriter.java Adds compression-level support in fixed-byte writer constructor path.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/writer/impl/BaseChunkForwardIndexWriter.java Passes optional level to compressor factory for chunk writers.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/ZstandardCompressor.java Adds configurable Zstd compression level.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/LZ4WithLengthCompressor.java Adds configurable LZ4 HC level for length-prefixed LZ4 compressor.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/LZ4Compressor.java Adds configurable LZ4 HC level and avoids recreating compressor on each call.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/GzipCompressor.java Adds configurable Deflater compression level.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/io/compression/ChunkCompressorFactory.java Adds factory overload to construct compressors with optional explicit level.
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TextIndicesTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/queries/BaseFSTBasedRegexpLikeQueriesTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/core/operator/transform/function/BaseTransformFunctionTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-core/src/test/java/org/apache/pinot/core/data/manager/TableIndexingTest.java Updates FieldConfig construction for new compression codec spec type.
pinot-common/src/test/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtilsTest.java Adds serde coverage for parameterized codec spec in table config.

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch 4 times, most recently from d522a5e to c1aeec4 Compare April 9, 2026 11:53
Copy link
Copy Markdown
Contributor Author

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Backward Compatibility & Segment Format Analysis

I've reviewed this PR for critical backward compatibility concerns with segment format changes. Here are my findings:

CRITICAL: Segment Rewrite Detection Logic Gap

Existing segments written before this PR will lack the new FORWARD_INDEX_COMPRESSION_CODEC metadata key. If a user later modifies only the compression level (e.g., changing from ZSTD to ZSTD(3)), the ForwardIndexHandler rewrite detection will fail to trigger because:

  1. Old segments have no persisted codec spec in metadata
  2. The rewrite logic compares current config against persisted metadata spec
  3. Missing metadata = no comparison possible = no rewrite triggered
  4. Result: Silent configuration skew - user expects rewrite, segments remain at old level

Recommendation: ForwardIndexHandler should explicitly handle missing codec metadata as "unknown level" state and trigger rewrites when config specifies explicit levels for old segments.

SAFE: Segment Format Read Compatibility

✓ New metadata key is additive (V1Constants.Column.FORWARD_INDEX_COMPRESSION_CODEC)
✓ Old code reading new segments: unaffected (skips unknown metadata)
✓ New code reading old segments: unaffected (spec is optional, defaults apply)
✓ No changes to forward-index header or chunk encoding format

SAFE: Configuration Serialization

✓ CompressionCodecSpec uses Jackson @JsonValue/@JsonCreator for stable string format
✓ Format is human-readable: CODEC or CODEC(level)
✓ Parser is strict (rejects malformed input early in config validation)
✓ Alias canonicalization (ZSTD → ZSTANDARD) handled consistently

SAFE: Legacy Configuration Compatibility

✓ Plain codec names (ZSTANDARD, GZIP, LZ4) preserve prior behavior
✓ Explicit levels default to library defaults when omitted
✓ FieldConfig compatibility constructors maintain config deserialization paths
✓ Comprehensive test coverage for serde paths

@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch 2 times, most recently from 3433bb5 to 370ffb1 Compare April 12, 2026 05:17
@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch from c3ced6a to 3503105 Compare April 13, 2026 09:28
public MultiValueFixedByteRawIndexCreator(File indexFile, ChunkCompressionType compressionType, int totalDocs,
DataType valueType, int maxNumberOfMultiValueElements, boolean deriveNumDocsPerChunk, int writerVersion,
int targetMaxChunkSizeBytes, int targetDocsPerChunk)
public MultiValueFixedByteRawIndexCreator(File indexFile, ChunkCompressionType compressionType,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment

public MultiValueVarByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType, String column,
int totalDocs, DataType valueType, int writerVersion, int maxRowLengthInBytes, int maxNumberOfElements,
int targetMaxChunkSizeBytes, int targetDocsPerChunk)
public MultiValueVarByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add back comments

*/
public SingleValueFixedByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType, String column,
int totalDocs, DataType valueType, int writerVersion, int targetDocsPerChunk)
public SingleValueFixedByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add back comments

public SingleValueVarByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType, String column,
int totalDocs, DataType valueType, int maxLength, boolean deriveNumDocsPerChunk, int writerVersion,
int targetMaxChunkSizeBytes, int targetDocsPerChunk)
public SingleValueVarByteRawIndexCreator(File baseIndexDir, ChunkCompressionType compressionType,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add back comments

Support parameterized codec specs in forward index configs and propagate them through segment creation, load, and rewrite paths.
@xiangfu0 xiangfu0 force-pushed the codex/parameterized-compression-codec branch from 3503105 to 3ad95c2 Compare April 13, 2026 20:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

configuration Config changes (addition/deletion/change in behavior) feature New functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants