Core: Support combining position deletes during writes#11222
Conversation
| } | ||
| } | ||
|
|
||
| public static <T extends StructLike> CharSequenceMap<PositionDeleteIndex> toPositionIndexes( |
There was a problem hiding this comment.
Purely to avoid breaking the API.
| private final Supplier<FileWriter<PositionDelete<T>, DeleteWriteResult>> writers; | ||
| private final DeleteGranularity granularity; | ||
| private final CharSequenceMap<Roaring64Bitmap> positionsByPath; | ||
| private final CharSequenceMap<PositionDeleteIndex> positionsByPath; |
There was a problem hiding this comment.
Is this an area we'd want to explore using Map<String, PositionDeleteIndex> instead of the CharSequenceMap? Doesn't need to be in this PR, more so just wondering
There was a problem hiding this comment.
Actually, I think it'll probably make more sense to look at that when I do the update to use location instead of the deprecated path.
There was a problem hiding this comment.
We may want to keep this as CharSequenceMap as writers may use arbitrary CharSequence implementations and it is a bit different from DataFile/DeleteFile structs.
|
|
||
| try { | ||
| PositionDelete<T> positionDelete = PositionDelete.create(); | ||
| for (CharSequence path : sort(paths)) { |
There was a problem hiding this comment.
Another aspect I'm curious about, have we ever compared with using a TreeMap instead of sorting? It'll be the same time complexity in the end but interested in seeing if there's any significant differences in practice.
There was a problem hiding this comment.
I'd say TreeMap starts to make sense if we access the collection in sorted order more than once. Otherwise, paying the extra cost during inserts may not be worth it.
41fd3b0 to
19c1779
Compare
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Thanks @aokolnychyi! Had minor comments but I think this looks great overall.
| } | ||
|
|
||
| private PositionDeleteIndex loadPreviousDeletes(CharSequence path) { | ||
| return loadPreviousDeletes != null ? loadPreviousDeletes.apply(path) : null; |
There was a problem hiding this comment.
Nit: Would it make sense to default loadPreviousDeletes to be a function implementation which just returns null (I think it'd be a one line lambda charSequence -> null? Then I think we could remove this helper method and directly use loadPreviousDeletes.apply(path) on line 150.
There was a problem hiding this comment.
I like that, let me update.
| } | ||
|
|
||
| FileWriter<PositionDelete<T>, DeleteWriteResult> writer = writers.get(); | ||
| List<DeleteFile> rewrittenDeleteFile = Lists.newArrayList(); |
There was a problem hiding this comment.
rewrittenDeleteFiles?
|
Thanks, @amogh-jahagirdar! |
This PR adds support for combing historical position deletes in writers, enabling sync maintenance.