Push element length and ASCII stats into Dictionary implementations#18189
Push element length and ASCII stats into Dictionary implementations#18189Jackie-Jiang wants to merge 1 commit intoapache:masterfrom
Conversation
…mplementations Move incremental computation of getLengthOfShortestElement, getLengthOfLongestElement, isAscii, and getSortedValues out of MutableColumnStatistics and into the Dictionary hierarchy. This eliminates the segment-scan TODO in MutableColumnStatistics by computing stats on the fly during value indexing. - Add BaseConstantValueDictionary as abstract base for all constant-value dictionaries - Track element length and ASCII stats in all MutableDictionary implementations (Int/Long/Float/Double/BigDecimal/String/Bytes × OnHeap/OffHeap) - Update SameValueMutableDictionary with stats tracking - Simplify MutableColumnStatistics to delegate to Dictionary - Remove getSortedValues from Dictionary SPI (unused after this change) - Rewrite MutableColumnStatisticsTest to use real dictionaries instead of mocks Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
e202c2c to
dd384b0
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors stats collection during realtime segment sealing by pushing element-length and ASCII tracking into dictionary implementations, reducing work in MutableColumnStatistics and standardizing behavior across mutable and constant-value dictionaries.
Changes:
- Add dictionary-level APIs for shortest/longest element length and ASCII detection, and delegate
MutableColumnStatisticsto these dictionary methods. - Introduce
BaseConstantValueDictionaryand migrate constant-value dictionaries to use it with stats-friendly defaults/overrides. - Rewrite
MutableColumnStatisticsTestto use real on-heap/off-heap mutable dictionaries and constant-value dictionaries (instead of mocks).
Reviewed changes
Copilot reviewed 31 out of 31 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
pinot-spi/src/main/java/org/apache/pinot/spi/utils/Utf8Utils.java |
Removes encodedLength() and Guava Utf8 dependency from this utility (keeps encode/decode/isAscii). |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/reader/Dictionary.java |
Adds default dictionary stats APIs (sorted values, element lengths, ASCII) used for sealing-time stats. |
pinot-segment-spi/src/main/java/org/apache/pinot/segment/spi/index/mutable/MutableDictionary.java |
Provides default implementations for new stats APIs for fixed-width mutable dictionaries. |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatisticsTest.java |
Rewrites tests to use real dictionary implementations (on/off-heap + constant-value) and validate stats behavior. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/utils/FixedIntArrayOffHeapIdMap.java |
Removes now-redundant getSortedValues() override (SPI provides a default). |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/DocIdDictionary.java |
Tightens indexOf(int) implementation (range-check + NULL_VALUE_INDEX) and makes _numDocs private. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueStringDictionary.java |
Migrates to BaseConstantValueDictionary, uses Utf8Utils, and implements element-length/ASCII stats. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueLongDictionary.java |
Migrates to BaseConstantValueDictionary. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueIntDictionary.java |
Migrates to BaseConstantValueDictionary. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueFloatDictionary.java |
Migrates to BaseConstantValueDictionary. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueDoubleDictionary.java |
Migrates to BaseConstantValueDictionary. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBytesDictionary.java |
Migrates to BaseConstantValueDictionary and implements element-length stats. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/ConstantValueBigDecimalDictionary.java |
Migrates to BaseConstantValueDictionary and implements element-length stats. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseImmutableDictionary.java |
Removes redundant getSortedValues() override (SPI provides a default). |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/segment/index/readers/BaseConstantValueDictionary.java |
New base class for constant-value dictionaries with sealing-time stats defaults. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOnHeapMutableDictionary.java |
Tracks min/max, element-length min/max, and ASCII incrementally; switches to Utf8Utils for encoding. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/StringOffHeapMutableDictionary.java |
Tracks min/max, element-length min/max, and ASCII incrementally; uses Utf8Utils encode/decode. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/SameValueMutableDictionary.java |
Adds element-length and ASCII stats for same-value dictionaries. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/LongOnHeapMutableDictionary.java |
Moves getValueType/indexOf implementations earlier; aligns with updated stats delegation. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/LongOffHeapMutableDictionary.java |
Same as above for off-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/IntOnHeapMutableDictionary.java |
Same as above for on-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/IntOffHeapMutableDictionary.java |
Same as above for off-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/FloatOnHeapMutableDictionary.java |
Same as above for on-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/FloatOffHeapMutableDictionary.java |
Same as above for off-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/DoubleOnHeapMutableDictionary.java |
Same as above for on-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/DoubleOffHeapMutableDictionary.java |
Same as above for off-heap. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOnHeapMutableDictionary.java |
Tracks element-length min/max incrementally for BYTES dictionaries. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BytesOffHeapMutableDictionary.java |
Tracks element-length min/max incrementally for BYTES dictionaries. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOnHeapMutableDictionary.java |
Tracks element-length min/max incrementally for BIG_DECIMAL dictionaries. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/impl/dictionary/BigDecimalOffHeapMutableDictionary.java |
Tracks element-length min/max incrementally for BIG_DECIMAL dictionaries. |
pinot-segment-local/src/main/java/org/apache/pinot/segment/local/realtime/converter/stats/MutableColumnStatistics.java |
Removes dictionary-scan logic; delegates length/ASCII stats to dictionary APIs. |
pinot-perf/src/main/java/org/apache/pinot/perf/aggregation/BenchmarkDistinctCountHLLThreshold.java |
Removes redundant getSortedValues() implementation from benchmark dictionary stub. |
| /// Returns a sorted array of all values in the dictionary. For type INT/LONG/FLOAT/DOUBLE, primitive type array | ||
| /// will be returned; for type BIG_DECIMAL, `BigDecimal[]` will be returned; for type STRING, `String[]` will be | ||
| /// returned; for type BYTES, `ByteArray[]` will be returned. | ||
| /// | ||
| /// This method is for the stats-collection phase when sealing the consuming segment. It should be overridden when | ||
| /// the dictionary is used in a mutable segment. | ||
| default Object getSortedValues() { | ||
| throw new UnsupportedOperationException(); | ||
| } |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #18189 +/- ##
============================================
+ Coverage 63.31% 63.34% +0.02%
Complexity 1627 1627
============================================
Files 3229 3230 +1
Lines 196705 196777 +72
Branches 30408 30427 +19
============================================
+ Hits 124544 124646 +102
+ Misses 62183 62157 -26
+ Partials 9978 9974 -4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| @Override | ||
| public DataType getValueType() { | ||
| return DataType.INT; | ||
| return (value >= 0 && value < _numDocs) ? value : NULL_VALUE_INDEX; |
There was a problem hiding this comment.
This is behavior change, worth to confirm the usage of indexOf and insertionIndexOf are not mixed.
Summary
getLengthOfShortestElement,getLengthOfLongestElementandisAsciiout ofMutableColumnStatisticsand into theDictionaryhierarchy, eliminating the segment-scan TODOBaseConstantValueDictionaryas abstract base for all constant-value dictionaries, providing default implementations for stats methodsindex()in allMutableDictionaryimplementations (Int/Long/Float/Double/BigDecimal/String/Bytes × OnHeap/OffHeap)MutableColumnStatisticsto delegate stats toDictionaryinstead of computing them via segment scanMutableColumnStatisticsTestto use real dictionary implementations instead of mocks, covering onHeap mutable, offHeap mutable, and constant-value dictionariesTest plan
MutableColumnStatisticsTestrewritten with real dictionaries)MutableDictionaryTeststill passespinot-segment-localandpinot-segment-spimodule testsstartree-pinotmodules🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com