Skip to content

Core: Prevent duplicate data/delete files#10007

Merged
nastra merged 1 commit into
apache:mainfrom
nastra:duplicate-data-files-and-summary-stats
Jun 20, 2024
Merged

Core: Prevent duplicate data/delete files#10007
nastra merged 1 commit into
apache:mainfrom
nastra:duplicate-data-files-and-summary-stats

Conversation

@nastra
Copy link
Copy Markdown
Contributor

@nastra nastra commented Mar 20, 2024

One of the reasons for preventing duplicate files is that adding/deleting the same data/delete file X times will update snapshot summary stats for each file, resulting in unexpected stats.

I've added tests for all subclasses of SnapshotUpdate to see how stats behave with multiple files.

Snapshot summary stats can be off with duplicates

TestSnapshotSummary#rewriteWithDuplicateFiles shows e.g. how stats can be off when duplicates exist. In the below example, stats like "total-data-files"="5" / "total-files-size"="50" are far off:

{"added-data-files"="3", "added-files-size"="30", "added-records"="3", "changed-partition-count"="1", 
"deleted-data-files"="1", "deleted-duplicate-files"="2", "deleted-records"="1","removed-files-size"="10",
"total-data-files"="5", "total-delete-files"="0","total-equality-deletes"="0", "total-files-size"="50",
"total-position-deletes"="0", "total-records"="5"}

With the fix from this PR, stats for the same example are now:

{"added-data-files"="1", "added-files-size"="10", "added-records"="1", "changed-partition-count"="1", 
"deleted-data-files"="1", "deleted-records"="1", "removed-files-size"="10", 
"total-data-files"="1", "total-delete-files"="0", "total-equality-deletes"="0", "total-files-size"="10", 
"total-position-deletes"="0", "total-records"="1"}


Note that this only prevents duplicate data/delete files within a single commit (not across multiple commits).

@nastra nastra marked this pull request as draft March 20, 2024 15:27
@github-actions github-actions Bot added the core label Mar 20, 2024
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch 3 times, most recently from 850abf9 to b7efe92 Compare March 21, 2024 10:41
Comment thread core/src/main/java/org/apache/iceberg/FastAppend.java
// update data
private final List<DataFile> newDataFiles = Lists.newArrayList();
private final CharSequenceSet newDataFilePaths = CharSequenceSet.empty();
private final CharSequenceSet newDeleteFilePaths = CharSequenceSet.empty();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subclasses of DataFile / DeleteFile don't implement equals() / hashCode(), so using a separate set to store the file path


String tagSnapshotBName = "t2";
table.newFastAppend().appendFile(FILE_B).appendFile(FILE_B).commit();
table.newFastAppend().appendFile(FILE_B).appendFile(FILE_C).commit();
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updating these tests so that they don't use duplicate data files

this.hasNewFiles = true;
newFiles.add(file);
summaryBuilder.addedFile(spec, file);
Preconditions.checkNotNull(file, "Invalid data file: null");
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aligns it with MergingSnapshotProducer.addFile(...)

@nastra nastra changed the title Core: Prevent duplicate data files Core: Prevent duplicate data/delete files Mar 21, 2024
@nastra nastra marked this pull request as ready for review March 21, 2024 11:09
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch from b7efe92 to 2b1fedd Compare March 21, 2024 14:57
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch from 2b1fedd to ee40ad4 Compare March 21, 2024 15:05
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch from ee40ad4 to 7becd19 Compare March 22, 2024 08:56
@nastra nastra requested a review from rdblue March 25, 2024 15:39
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch from 7becd19 to 890f785 Compare April 24, 2024 08:17
Copy link
Copy Markdown
Contributor

@fqaiser94 fqaiser94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

For posterity's sake, might just want to call out in the PR description that this only prevents duplicate data/delete files within the same commit (not across commits).

Comment thread core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java Outdated
@nastra nastra force-pushed the duplicate-data-files-and-summary-stats branch from 890f785 to c6a251a Compare April 29, 2024 08:52
Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

summaryBuilder.addedFile(spec, file);
Preconditions.checkNotNull(file, "Invalid data file: null");
if (newFilePaths.add(file.path())) {
this.hasNewFiles = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This flag could also be replaced by newFilePaths.length() > 0. We could also do this in a follow up PR.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree but I didn't want to do any additional refactorings as part of this PR, unless people think I should include it. I can do a follow-up PR to refactor this

Copy link
Copy Markdown
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the super late review but this looks good to me!

@nastra
Copy link
Copy Markdown
Contributor Author

nastra commented Jun 20, 2024

thanks for the reviews @danielcweeks @Fokko @fqaiser94 @amogh-jahagirdar

@nastra nastra merged commit 23a578e into apache:main Jun 20, 2024
@nastra nastra deleted the duplicate-data-files-and-summary-stats branch June 20, 2024 07:44
jasonf20 pushed a commit to jasonf20/iceberg that referenced this pull request Aug 4, 2024
sasankpagolu pushed a commit to sasankpagolu/iceberg that referenced this pull request Oct 27, 2024
zachdisc pushed a commit to zachdisc/iceberg that referenced this pull request Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants