[core] Introduce metadata.stats-dense-store to reduce meta size for multiple columns table#4322
Conversation
|
Good,we just encountered a similar problem today! |
| + " none statistic mode is set.") | ||
| .linebreak() | ||
| .text( | ||
| "Note, When this mode is enabled, the sdk in reading engine requires at least" |
There was a problem hiding this comment.
1、Change the "When" to "when".
2、the Paimon sdk in reading engine requires at least version 0.9.1 or 1.0.0 or higher?
| return fieldId >= SYSTEM_FIELD_ID_START; | ||
| } | ||
|
|
||
| public static boolean isSystemField(String field) { |
There was a problem hiding this comment.
Why not add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES?
There was a problem hiding this comment.
If KEY_FIELD_PREFIX is not in SYSTEM_FIELD_NAMES, then the funtion name "isSystemField" is inappropriate.
There was a problem hiding this comment.
So add KEY_FIELD_PREFIX to SYSTEM_FIELD_NAMES, what problem can be solved? Can you write a ut demo?
There was a problem hiding this comment.
A solution is using a SYSTEM_FIELD_PREFIXS, and always using starWith. But it is not good for performance.
Let it go now.
| */ | ||
| public class ProjectedArray implements InternalArray { | ||
|
|
||
| protected final int[] indexMapping; |
There was a problem hiding this comment.
The indexMapping、array and ProjectedArray should be private?
|
|
||
| this.keyStatsConverter = new SimpleStatsConverter(keyType); | ||
| this.valueStatsConverter = new SimpleStatsConverter(valueType); | ||
| this.keyStatsConverter = new SimpleStatsConverter(keyType, false); |
There was a problem hiding this comment.
this.keyStatsConverter = new SimpleStatsConverter(keyType);
| private final InternalArray array; | ||
| private final long notFoundValue; | ||
|
|
||
| protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
| private final InternalArray array; | ||
| private final long notFoundValue; | ||
|
|
||
| protected NullCountsEvoArray(int[] indexMapping, InternalArray array, long notFoundValue) { |
| fileDeserializer.get(), fileDeserializer.get(), fileDeserializer.get()), | ||
| new IndexIncrement( | ||
| indexEntrySerializer.deserializeList(view), | ||
| indexEntrySerializer.deserializeList(view))); |
There was a problem hiding this comment.
Condition 'version <= 2' is always 'true'
| } else if (version == 2) { | ||
| DataFileMeta09Serializer serializer = new DataFileMeta09Serializer(); | ||
| return serializer::deserialize; | ||
| } else if (version == 3) { |
There was a problem hiding this comment.
public static final int DATA_FILE_META_VERSION_1 = 1;
public static final int DATA_FILE_META_VERSION_2= 2;
public static final int DATA_FILE_META_VERSION_3= 3;
There was a problem hiding this comment.
No need to do this.
|
|
||
| public static final ConfigOption<String> METADATA_STATS_MODE = | ||
| key("metadata." + STATS_MODE_SUFFIX) | ||
| key("metadata.stats-mode") |
There was a problem hiding this comment.
Can become more intuitive in the code.
| null); | ||
| List<DataFileMeta> dataFiles = Collections.singletonList(dataFile); | ||
|
|
||
| LinkedHashMap<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>(); |
There was a problem hiding this comment.
Map<String, Pair<Integer, Integer>> dvRanges = new LinkedHashMap<>();
There was a problem hiding this comment.
IndexFileMeta requires a LinkedHashMap
|
Please modify this doc : https://paimon.apache.org/docs/master/flink/sql-ddl/#specify-statistics-mode |
|
A question : metadata.stats-dense-store=true |
| public static final ConfigOption<Boolean> METADATA_STATS_DENSE_STORE = | ||
| key("metadata.stats-dense-store") | ||
| .booleanType() | ||
| .defaultValue(false) |
There was a problem hiding this comment.
Why the default value is not true?
You are worry about that many users are using old versions of Paimon sdk in their reading engine?
Correct! |
|
Looks good to me! +1 |
|
It should be related to this change : #5035 |
Purpose
At present, Paimon uses
BinaryRowto store statistical information. Generally, it is not a problem, but some businesses have fields with over 3000 columns.The
BinaryRowstructure has a characteristic that each field occupies a fixed 8 bytes, and for a 3000 column table, oneBinaryRowhas 25kb of storage.SimpleStatshas 3BinaryRow, then it will be 100 kb storage. 100000 files have GB level storage. This is unacceptable.This PR:
metadata.stats-dense-storeto a dense mode to storeSimpleStatsandvalueStatsColsinDataFileMeta.metadata.stats-mode=none, thenvalueStatsColswill be empty list, andSimpleStatsis empty.fields.b.stats-mode=fullto enable stats for specific columns to enable data skipping, the meta storage will only containbcolumn.Tests
AppendOnlyFileStoreTableTestPrimaryKeyFileStoreTableTestAPI and Format
Documentation