Core: Prevent duplicate data/delete files#10007
Conversation
850abf9 to
b7efe92
Compare
| // update data | ||
| private final List<DataFile> newDataFiles = Lists.newArrayList(); | ||
| private final CharSequenceSet newDataFilePaths = CharSequenceSet.empty(); | ||
| private final CharSequenceSet newDeleteFilePaths = CharSequenceSet.empty(); |
There was a problem hiding this comment.
subclasses of DataFile / DeleteFile don't implement equals() / hashCode(), so using a separate set to store the file path
|
|
||
| String tagSnapshotBName = "t2"; | ||
| table.newFastAppend().appendFile(FILE_B).appendFile(FILE_B).commit(); | ||
| table.newFastAppend().appendFile(FILE_B).appendFile(FILE_C).commit(); |
There was a problem hiding this comment.
updating these tests so that they don't use duplicate data files
| this.hasNewFiles = true; | ||
| newFiles.add(file); | ||
| summaryBuilder.addedFile(spec, file); | ||
| Preconditions.checkNotNull(file, "Invalid data file: null"); |
There was a problem hiding this comment.
aligns it with MergingSnapshotProducer.addFile(...)
b7efe92 to
2b1fedd
Compare
2b1fedd to
ee40ad4
Compare
ee40ad4 to
7becd19
Compare
7becd19 to
890f785
Compare
fqaiser94
left a comment
There was a problem hiding this comment.
LGTM
For posterity's sake, might just want to call out in the PR description that this only prevents duplicate data/delete files within the same commit (not across commits).
890f785 to
c6a251a
Compare
| summaryBuilder.addedFile(spec, file); | ||
| Preconditions.checkNotNull(file, "Invalid data file: null"); | ||
| if (newFilePaths.add(file.path())) { | ||
| this.hasNewFiles = true; |
There was a problem hiding this comment.
Nit: This flag could also be replaced by newFilePaths.length() > 0. We could also do this in a follow up PR.
There was a problem hiding this comment.
I agree but I didn't want to do any additional refactorings as part of this PR, unless people think I should include it. I can do a follow-up PR to refactor this
amogh-jahagirdar
left a comment
There was a problem hiding this comment.
Sorry for the super late review but this looks good to me!
|
thanks for the reviews @danielcweeks @Fokko @fqaiser94 @amogh-jahagirdar |
One of the reasons for preventing duplicate files is that adding/deleting the same data/delete file X times will update snapshot summary stats for each file, resulting in unexpected stats.
I've added tests for all subclasses of
SnapshotUpdateto see how stats behave with multiple files.Snapshot summary stats can be off with duplicates
TestSnapshotSummary#rewriteWithDuplicateFilesshows e.g. how stats can be off when duplicates exist. In the below example, stats like"total-data-files"="5"/"total-files-size"="50"are far off:With the fix from this PR, stats for the same example are now: