Skip to content

Support vectorized reading int96 timestamps in imported data#6962

Merged
jackye1995 merged 12 commits into
apache:masterfrom
yabola:newice
Apr 11, 2023
Merged

Support vectorized reading int96 timestamps in imported data#6962
jackye1995 merged 12 commits into
apache:masterfrom
yabola:newice

Conversation

@yabola

@yabola yabola commented Mar 1, 2023

Copy link
Copy Markdown
Contributor

Add vectorized read support for Parquet INT96 timestamps ( fix comment and vectorized reads are enabled by default after that PR).
Before there is only non-vectorized reading support(see #1184).
This is needed so that parquet files written by Spark, that used INT96 timestamps, are able to be read by Iceberg without having to rewrite these files. This is specially useful for migrations from spark or delta lake.

@yabola yabola changed the title Support vectorized read for timestampInt96 Support vectorized read for Int96 Mar 1, 2023
@yabola yabola changed the title Support vectorized read for Int96 Support vectorized read for timestamp Mar 1, 2023
@yabola yabola changed the title Support vectorized read for timestamp Support vectorized reading int96 timestamps in imported data Mar 1, 2023
@yabola yabola changed the title Support vectorized reading int96 timestamps in imported data Support vectorized reading int96 timestamps Mar 1, 2023
@yabola yabola changed the title Support vectorized reading int96 timestamps Support vectorized reading int96 timestamps in imported data Mar 1, 2023
@yabola

yabola commented Mar 1, 2023

Copy link
Copy Markdown
Contributor Author

@rdblue @aokolnychyi @gustavoatt If you have time, please take a look, thanks

@nastra nastra self-requested a review March 1, 2023 07:15
@rdblue

rdblue commented Mar 3, 2023

Copy link
Copy Markdown
Contributor

@nastra can you take a look at this?

@nastra

nastra commented Mar 7, 2023

Copy link
Copy Markdown
Contributor

@yabola I'll try to review it this week

@nastra nastra left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks mostly good to me


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we adjust Sparks version of this in TimestampInt96Reader to reuse TimestampUtil?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced all timestampint96 related places, please take a look~

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like something that could live in ParquetUtil, instead of creating a new class?

ValuesAsBytesReader valuesReader,
int typeWidth,
byte[] byteArray) {
ByteBuffer buffer = valuesReader.getBuffer(12);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe add a comment what the 12 means here (8 bytes = time of day in nanos / 4 bytes = julian day)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Test
public void testTableWithInt96Timestamp() throws IOException {
try {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the try-catch here? I think it would work fine without it

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is only the finally method here, it does a cleanup operation and I imitate what other UT wrote

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS

if (!catalog.tableExists(currentIdentifier)) {
return;
}
dropTable(currentIdentifier);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: whitespace missing after }

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a space

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.

@nastra nastra requested a review from aokolnychyi March 13, 2023 13:58
@nastra

nastra commented Mar 13, 2023

Copy link
Copy Markdown
Contributor

@aokolnychyi could you review this one please?

@aokolnychyi

Copy link
Copy Markdown
Contributor

I'll try this week but can't promise.

cc @flyrain

@yabola

yabola commented Mar 20, 2023

Copy link
Copy Markdown
Contributor Author

@nastra @aokolnychyi @rdblue @flyrain Hi~ I have updated my PR, if you have time, please take a look, thank you

@yabola

yabola commented Mar 29, 2023

Copy link
Copy Markdown
Contributor Author

@nastra emmm, looks like there are no more comments, could you help take a look again?

@jackye1995

Copy link
Copy Markdown
Contributor

sorry for the lack of reviews, I will take a look during the weekend


@Test
public void testTableWithInt96Timestamp() throws IOException {
try {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we always have a parquet_table that might need to be dropped. We could add that to @After and do DROP TABLE IF EXISTS


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like something that could live in ParquetUtil, instead of creating a new class?


return TimeUnit.DAYS.toMicros(julianDay - UNIX_EPOCH_JULIAN)
+ TimeUnit.NANOSECONDS.toMicros(timeOfDayNanos);
return TimestampUtil.extractTimestampInt96(byteBuffer);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the refactoring!

if (!catalog.tableExists(currentIdentifier)) {
return;
}
dropTable(currentIdentifier);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like the newline is not added? Overall we encourage newline after any logical blocks like if, for, while, try, etc.


private static final long UNIX_EPOCH_JULIAN = 2_440_588L;

public static long extractTimestampInt96(ByteBuffer buffer) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add javadoc for public util method.

@yabola

yabola commented Apr 3, 2023

Copy link
Copy Markdown
Contributor Author

@jackye1995 Thank you for your review, I have updated my PR, please take a look~

@yabola

yabola commented Apr 8, 2023

Copy link
Copy Markdown
Contributor Author

It seems that the PR check is not triggered automatically and needs to be triggered manually...

@jackye1995

Copy link
Copy Markdown
Contributor

Running CI

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good to me! @nastra could you take another look?

@jackye1995 jackye1995 requested a review from nastra April 9, 2023 04:14

@JonasJ-ap JonasJ-ap left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thank you for the contribution. To provide some additional feedback, I conducted some further testing on the reading of iceberg tables after migration from Delta Lake tables that contain an INT96 timestamp column. Based on my observations, the vectorized reading appears to be functioning perfectly.

BTW, I think this PR can close #4200.

.load(loadLocation(tableIdentifier))
.select("tmp_col")
.collectAsList();
Assert.assertEquals("Rows must match", expected, actual);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Assert.assertEquals("Rows must match", expected, actual);
Assertions.assertThat(actual).containsExactlyInAnyOrderElementsOf(expected);

How about using org.assertj.core.api.Assertions here? I think AsserJ can provide a more organized error message (with new line for each element of the list) when actual and expected differ. Also, based on the discussion here #7160 (comment), we may want to move from Junit4 to Junit5 + AsserJ in the future.

@yabola yabola Apr 9, 2023

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Thank you very much for testing on Delta !

@jackye1995 jackye1995 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

@jackye1995

Copy link
Copy Markdown
Contributor

Thanks for the work! And thanks for the review @nastra @JonasJ-ap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants