Skip to content

[SPARK-28596][SQL] Use Java 8 time API in the date_trunc() function#25329

Closed
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:date_trunc
Closed

[SPARK-28596][SQL] Use Java 8 time API in the date_trunc() function#25329
MaxGekk wants to merge 2 commits into
apache:masterfrom
MaxGekk:date_trunc

Conversation

@MaxGekk

@MaxGekk MaxGekk commented Aug 1, 2019

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

In the PR, I propose to re-implement DateTimeUtils.truncTimestamp and use Java 8 time API to adjust timestamps to YEAR, MONTH, QUARTER, WEEK, DAY, HOUR, MINUTE and SECOND. This simplifies implementation and will allow to easily support other level of truncation, see https://issues.apache.org/jira/browse/SPARK-28017.

Also I pass ZoneId instances to the function instead of TimeZone instances to avoid unnecessary conversions.

How was this patch tested?

By existing test suites. I am going to re-run the DateTime benchmark.

@SparkQA

SparkQA commented Aug 1, 2019

Copy link
Copy Markdown

Test build #108526 has finished for PR 25329 at commit a8008bc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks plausible to me.

@MaxGekk

MaxGekk commented Aug 4, 2019

Copy link
Copy Markdown
Member Author

Unfortunately, benchmarks show that this implementation is 2-3 times slower than current one. Most of the time, it spends inside of resolving zone ids to zone offsets. For example, the truncTimestamp method is bounded by the binary search by historical zone infos in DateTimeBenchmark:
Screen Shot 2019-08-04 at 10 43 30
I have changed the benchmark to avoid the search on zone history but truncTimestamp still spends most of the time in zone conversions. For recent year - 2019, it is bound by getting zone info from an internal cache:
Screen Shot 2019-08-04 at 10 52 16

I will close the PR since I don't know how to avoid the zone conversions. Maybe in newer versions of JDK, this will be faster, and we will come back to the changes again.

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you for the benchmark, @MaxGekk . I also didn't expect the slowdown. When you close this PR, could you add a comment about this? (You can point the above comment.) That would be helpful when we revisit this with JDK11.

@MaxGekk MaxGekk changed the title [WIP][SPARK-28596][SQL] Use Java 8 time API in the date_trunc() function [SPARK-28596][SQL] Use Java 8 time API in the date_trunc() function Aug 5, 2019
@MaxGekk

MaxGekk commented Aug 5, 2019

Copy link
Copy Markdown
Member Author

I linked this to Build and Run Spark on JDK11

@MaxGekk

MaxGekk commented Aug 5, 2019

Copy link
Copy Markdown
Member Author

I am closing this PR because new implementation based on Java 8 API slows down date/timestamp truncations up to 3 times ( see details #25329 (comment)). SPARK-28596 has been linked to SPARK-24417 to try it on JDK11.

@MaxGekk MaxGekk closed this Aug 5, 2019
@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you for linking to JDK11. That will remind us definitely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants