Skip to content

Push element length and ASCII stats into Dictionary implementations#18189

Open
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:dictionary_stats
Open

Push element length and ASCII stats into Dictionary implementations#18189
Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Jackie-Jiang:dictionary_stats

Conversation

@Jackie-Jiang
Copy link
Copy Markdown
Contributor

@Jackie-Jiang Jackie-Jiang commented Apr 14, 2026

Summary

  • Move incremental computation of getLengthOfShortestElement, getLengthOfLongestElement and isAscii out of MutableColumnStatistics and into the Dictionary hierarchy, eliminating the segment-scan TODO
  • Add BaseConstantValueDictionary as abstract base for all constant-value dictionaries, providing default implementations for stats methods
  • Track element length and ASCII stats incrementally during index() in all MutableDictionary implementations (Int/Long/Float/Double/BigDecimal/String/Bytes × OnHeap/OffHeap)
  • Simplify MutableColumnStatistics to delegate stats to Dictionary instead of computing them via segment scan
  • Rewrite MutableColumnStatisticsTest to use real dictionary implementations instead of mocks, covering onHeap mutable, offHeap mutable, and constant-value dictionaries

Test plan

  • Existing unit tests pass (MutableColumnStatisticsTest rewritten with real dictionaries)
  • Verify MutableDictionaryTest still passes
  • Run pinot-segment-local and pinot-segment-spi module tests
  • Verify no downstream compile breaks in startree-pinot modules

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

@Jackie-Jiang Jackie-Jiang added ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior labels Apr 14, 2026
…mplementations

Move incremental computation of getLengthOfShortestElement, getLengthOfLongestElement,
isAscii, and getSortedValues out of MutableColumnStatistics and into the Dictionary
hierarchy. This eliminates the segment-scan TODO in MutableColumnStatistics by computing
stats on the fly during value indexing.

- Add BaseConstantValueDictionary as abstract base for all constant-value dictionaries
- Track element length and ASCII stats in all MutableDictionary implementations
  (Int/Long/Float/Double/BigDecimal/String/Bytes × OnHeap/OffHeap)
- Update SameValueMutableDictionary with stats tracking
- Simplify MutableColumnStatistics to delegate to Dictionary
- Remove getSortedValues from Dictionary SPI (unused after this change)
- Rewrite MutableColumnStatisticsTest to use real dictionaries instead of mocks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors stats collection during realtime segment sealing by pushing element-length and ASCII tracking into dictionary implementations, reducing work in MutableColumnStatistics and standardizing behavior across mutable and constant-value dictionaries.

Changes:

  • Add dictionary-level APIs for shortest/longest element length and ASCII detection, and delegate MutableColumnStatistics to these dictionary methods.
  • Introduce BaseConstantValueDictionary and migrate constant-value dictionaries to use it with stats-friendly defaults/overrides.
  • Rewrite MutableColumnStatisticsTest to use real on-heap/off-heap mutable dictionaries and constant-value dictionaries (instead of mocks).

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
pinot-spi/src/main/java/org/apache/pinot/spi/utils/Utf8Utils.java Removes encodedLength() and Guava Utf8 dependency from this utility (keeps encode/decode/isAscii).
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/Dictionary.java Adds default dictionary stats APIs (sorted values, element lengths, ASCII) used for sealing-time stats.
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableDictionary.java Provides default implementations for new stats APIs for fixed-width mutable dictionaries.
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatisticsTest.java Rewrites tests to use real dictionary implementations (on/off-heap + constant-value) and validate stats behavior.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/FixedIntArrayOffHeapIdMap.java Removes now-redundant getSortedValues() override (SPI provides a default).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/DocIdDictionary.java Tightens indexOf(int) implementation (range-check + NULL_VALUE_INDEX) and makes _numDocs private.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueStringDictionary.java Migrates to BaseConstantValueDictionary, uses Utf8Utils, and implements element-length/ASCII stats.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueLongDictionary.java Migrates to BaseConstantValueDictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueIntDictionary.java Migrates to BaseConstantValueDictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueFloatDictionary.java Migrates to BaseConstantValueDictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueDoubleDictionary.java Migrates to BaseConstantValueDictionary.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java Migrates to BaseConstantValueDictionary and implements element-length stats.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBigDecimalDictionary.java Migrates to BaseConstantValueDictionary and implements element-length stats.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseImmutableDictionary.java Removes redundant getSortedValues() override (SPI provides a default).
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseConstantValueDictionary.java New base class for constant-value dictionaries with sealing-time stats defaults.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOnHeapMutableDictionary.java Tracks min/max, element-length min/max, and ASCII incrementally; switches to Utf8Utils for encoding.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOffHeapMutableDictionary.java Tracks min/max, element-length min/max, and ASCII incrementally; uses Utf8Utils encode/decode.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java Adds element-length and ASCII stats for same-value dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/LongOnHeapMutableDictionary.java Moves getValueType/indexOf implementations earlier; aligns with updated stats delegation.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/LongOffHeapMutableDictionary.java Same as above for off-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/IntOnHeapMutableDictionary.java Same as above for on-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/IntOffHeapMutableDictionary.java Same as above for off-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/FloatOnHeapMutableDictionary.java Same as above for on-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/FloatOffHeapMutableDictionary.java Same as above for off-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/DoubleOnHeapMutableDictionary.java Same as above for on-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/DoubleOffHeapMutableDictionary.java Same as above for off-heap.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java Tracks element-length min/max incrementally for BYTES dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java Tracks element-length min/max incrementally for BYTES dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOnHeapMutableDictionary.java Tracks element-length min/max incrementally for BIG_DECIMAL dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOffHeapMutableDictionary.java Tracks element-length min/max incrementally for BIG_DECIMAL dictionaries.
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java Removes dictionary-scan logic; delegates length/ASCII stats to dictionary APIs.
pinot-perf/src/main/java/org/apache/pinot/perf/aggregation/BenchmarkDistinctCountHLLThreshold.java Removes redundant getSortedValues() implementation from benchmark dictionary stub.

Comment on lines +141 to +149
/// Returns a sorted array of all values in the dictionary. For type INT/LONG/FLOAT/DOUBLE, primitive type array
/// will be returned; for type BIG_DECIMAL, `BigDecimal[]` will be returned; for type STRING, `String[]` will be
/// returned; for type BYTES, `ByteArray[]` will be returned.
///
/// This method is for the stats-collection phase when sealing the consuming segment. It should be overridden when
/// the dictionary is used in a mutable segment.
default Object getSortedValues() {
throw new UnsupportedOperationException();
}
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 14, 2026

Codecov Report

❌ Patch coverage is 76.96970% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.34%. Comparing base (aa483d3) to head (dd384b0).

Files with missing lines Patch % Lines
...dictionary/BigDecimalOffHeapMutableDictionary.java 76.47% 4 Missing ⚠️
.../dictionary/BigDecimalOnHeapMutableDictionary.java 75.00% 4 Missing ⚠️
...che/pinot/segment/spi/index/reader/Dictionary.java 0.00% 4 Missing ⚠️
...impl/dictionary/StringOnHeapMutableDictionary.java 85.71% 2 Missing and 1 partial ⚠️
...t/segment/spi/index/mutable/MutableDictionary.java 0.00% 3 Missing ⚠️
.../impl/dictionary/BytesOnHeapMutableDictionary.java 86.66% 2 Missing ⚠️
...mpl/dictionary/DoubleOffHeapMutableDictionary.java 33.33% 2 Missing ⚠️
...impl/dictionary/DoubleOnHeapMutableDictionary.java 33.33% 2 Missing ⚠️
.../impl/dictionary/LongOffHeapMutableDictionary.java 33.33% 2 Missing ⚠️
...e/impl/dictionary/LongOnHeapMutableDictionary.java 33.33% 2 Missing ⚠️
... and 8 more
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #18189      +/-   ##
============================================
+ Coverage     63.31%   63.34%   +0.02%     
  Complexity     1627     1627              
============================================
  Files          3229     3230       +1     
  Lines        196705   196777      +72     
  Branches      30408    30427      +19     
============================================
+ Hits         124544   124646     +102     
+ Misses        62183    62157      -26     
+ Partials       9978     9974       -4     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.31% <76.96%> (+0.03%) ⬆️
java-21 63.28% <75.75%> (+0.01%) ⬆️
temurin 63.34% <76.96%> (+0.02%) ⬆️
unittests 63.34% <76.96%> (+0.02%) ⬆️
unittests1 55.27% <10.90%> (-0.02%) ⬇️
unittests2 35.00% <76.36%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Override
public DataType getValueType() {
return DataType.INT;
return (value >= 0 && value < _numDocs) ? value : NULL_VALUE_INDEX;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is behavior change, worth to confirm the usage of indexOf and insertionIndexOf are not mixed.

Copy link
Copy Markdown
Contributor

@xiangfu0 xiangfu0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion Related to data ingestion pipeline refactor Code restructuring without changing behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants