Parquet: Add row position reader#1254
Conversation
df20b0b to
9b131c3
Compare
|
This looks good to me! Would be great if someone familiar with Parquet could take a second pass. |
|
|
||
| void setPageSource(PageReadStore pageStore); | ||
|
|
||
| default void setRowOffsetForRowGroup(long position) {} |
There was a problem hiding this comment.
Why not add the position to the page source? Then the two operations are tied together: the row offset is the start offset for the new pages.
There was a problem hiding this comment.
I can change to that. Just one thing that do we mind to change the function signature in the public API?
|
|
||
| public static ParquetValueReader<Long> position() { | ||
| return new PositionReader(); | ||
| } |
There was a problem hiding this comment.
Can you move this to the top of the file with the other factory methods?
| return offsetToStartRowPosMap; | ||
| } | ||
|
|
||
| long[] getRowGroupsStartRowPos() { |
There was a problem hiding this comment.
How about naming this startPositions?
There was a problem hiding this comment.
startPositions may confuse with rowGroup.startingPosition, how about startRowPosititions?
| return shouldSkip; | ||
| } | ||
|
|
||
| private Map<Long, Long> generateRowGroupsStartRowPos() { |
There was a problem hiding this comment.
Why does this separately read the Parquet file to create a map that is used to initialize an array, when the starting position could be set for the array in the existing loop? I don't think this method is needed.
There was a problem hiding this comment.
The existing loop of row groups is based on the row groups that had been filtered with options. So we need to read the Parquet file without any filter to get each starting row position of row group.
There was a problem hiding this comment.
You're right. Good catch!
Can you add some comments to explain why this is needed for later?
|
|
||
| static class PositionReader implements ParquetValueReader<Long> { | ||
| private long rowOffsetInCurrentRowGroup = -1; | ||
| private long rowGroupRowOffsetInFile; |
There was a problem hiding this comment.
In general, try to be specific with names, but avoid unnecessary context. In this case, these names can be simpler: rowGroupStart and rowOffset would work fine. Extra context like InFile and InCurrent aren't adding clarity.
| Assert.assertFalse("Should not have extra rows", actualRows.hasNext()); | ||
| } finally { | ||
| if (reader != null) { | ||
| reader.close(); |
There was a problem hiding this comment.
Why not use try-with-resources instead of a finally block?
| } | ||
|
|
||
| long[] startRowPositions() { | ||
| return rowGroupsStartRowPos; |
There was a problem hiding this comment.
Can we use the same name for the variable?
| this.rowGroups = reader.getRowGroups(); | ||
| this.shouldSkip = new boolean[rowGroups.size()]; | ||
|
|
||
| Map<Long, Long> offsetToStartRowPosMap = generateRowGroupsStartRowPos(); |
There was a problem hiding this comment.
How about naming this offsetToStartPos and similarly updating the method name? There's no need to include a type in the variable name, usually.
| return triplesRead < triplesCount; | ||
| } | ||
|
|
||
| public void setRowPosition(long rowPosition) { |
There was a problem hiding this comment.
Instead of adding this, can you update setPageSource like the other interface that changed?
| protected long triplesRead = 0L; | ||
| protected long advanceNextPageCount = 0L; | ||
| protected Dictionary dictionary; | ||
| protected long rowPosition; |
There was a problem hiding this comment.
Is this needed? I don't see any uses.
There was a problem hiding this comment.
Right, this and setRowPosition are no longer needed.
9ee93a0 to
07bb9fb
Compare
|
+1 Thanks @chenjunjiedada, it looks good now. Nice work catching that the metadata was already filtered using the file range, too. |
| this.shouldSkip = new boolean[rowGroups.size()]; | ||
|
|
||
| // Fetch all row groups starting positions to compute the row offsets of the filtered row groups | ||
| Map<Long, Long> offsetToStartPos = generateOffsetToStartPos(); |
There was a problem hiding this comment.
It just occurred to me (after merging this) that we may want to make this lazy, like we do in Avro. That way if the row positions are never used, we don't incur the cost of reading the footer another time.
There was a problem hiding this comment.
I used to think to apply Caffeine cache this. Let me think about this again and also check what Avro does. I will update this in follow up vectorization code path.
There was a problem hiding this comment.
@chenjunjiedada yes we should make this lazy , do you have issue to track improvements to existing logic?
There was a problem hiding this comment.
This adds position reader for parquet readers.
TODO: