Skip to content

[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive#8400

Closed
davies wants to merge 6 commits into
apache:masterfrom
davies:timestamp_parquet
Closed

[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive#8400
davies wants to merge 6 commits into
apache:masterfrom
davies:timestamp_parquet

Conversation

@davies

@davies davies commented Aug 24, 2015

Copy link
Copy Markdown
Contributor

We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use 2440588 as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

@SparkQA

SparkQA commented Aug 24, 2015

Copy link
Copy Markdown

Test build #41466 has finished for PR 8400 at commit e96f92f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 24, 2015

Copy link
Copy Markdown

Test build #41474 has finished for PR 8400 at commit 9b1b9ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Aug 25, 2015

Copy link
Copy Markdown

Test build #1688 has finished for PR 8400 at commit 9b1b9ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: removing the trailing //?

@yhuai

yhuai commented Aug 25, 2015

Copy link
Copy Markdown
Contributor

Can you explain what "they are overlapped" means?

@liancheng

Copy link
Copy Markdown
Contributor

I feel like Impala and Hive did Julian day conversion in the wrong way at first, but left it as is and made the wrong conversion logic the de facto "standard" way? I don't see an intuitive reason why we should store a Julian timestamp 12 hr later than the real Julian timestamp at the first place.

But making Spark SQL behave the same as Impala/Hive makes sense, even if they are doing the conversion wrong. Otherwise it would be pretty hard to deal with legacy data, since neither Spark SQL nor Hive writes version information into generated Parquet files.

@davies

davies commented Aug 25, 2015

Copy link
Copy Markdown
Contributor Author

@yhuai The Julian day of epoch of unix timestamp (1970-01-01 0:0:0) is 2440587.5, if we use integer for julian days, then it will be cut off at noon (12pm) of a day. But the nanoseconds is counted from 12am, so they are overlapped from 12am to 12pm. In Hive, it create a calendar from Julian day (which will have hours as 12pm), it then re-set the hours from nanoseconds.

In this patch, we use a trick to shift Julian days by 0.5 day then rounding 0.5 down to zero (In Hive, it rounds 0.5 up to 1), then we can add these two parts together.

@liancheng

Copy link
Copy Markdown
Contributor

LGTM now

@SparkQA

SparkQA commented Aug 25, 2015

Copy link
Copy Markdown

Test build #41520 has finished for PR 8400 at commit 05437e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng

Copy link
Copy Markdown
Contributor

Merging to master and branch-1.5.

asfgit pushed a commit that referenced this pull request Aug 25, 2015
We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

Author: Davies Liu <davies@databricks.com>
Author: Cheng Lian <lian@databricks.com>

Closes #8400 from davies/timestamp_parquet.

(cherry picked from commit 2f493f7)
Signed-off-by: Cheng Lian <lian@databricks.com>
@asfgit asfgit closed this in 2f493f7 Aug 25, 2015
@liancheng

Copy link
Copy Markdown
Contributor

(EDIT: The conclusion made in this comment is WRONG.)

Just would like to point out that our original DateTimeUtils doesn't do Julian date conversion right either. Here is a spark-shell snippet executed against 053d94f:

import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)

val (sparkJulianDay, sparkTimeOfDayNanos) =
  DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
println(
  s"""Spark converts "$timestamp" to Julian timestamp:
     |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
   """.stripMargin)

The result is:

Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)

Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=43200000000000)

while the correct answer should be:

(julianDay=2440587, timeOfDayNanos=43200000000000)

So Hive is 12 hours later than the expected Julian timestamp, while we were 24 hours later :)

@liancheng

Copy link
Copy Markdown
Contributor

Asked for more background about Hive's 12 hr offset on HIVE-6394, which is the original JIRA ticket for introducing timestamp support for ParquetHiveSerDe.

@davies

davies commented Aug 25, 2015

Copy link
Copy Markdown
Contributor Author

@liancheng Thanks for the comments, but you may made a mistake somewhere, I got this WITHOUT this PR:

scala> TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
scala> val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")
timestamp: java.sql.Timestamp = 1970-01-01 00:00:00.0

scala> val (sparkJulianDay, sparkTimeOfDayNanos) =
     |   DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
sparkJulianDay: Int = 2440587
sparkTimeOfDayNanos: Long = 43200000000000

scala> println(
     |   s"""Spark converts "$timestamp" to Julian timestamp:
     |      |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
     |    """.stripMargin)
Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440587, timeOfDayNanos=43200000000000)

@davies

davies commented Aug 25, 2015

Copy link
Copy Markdown
Contributor Author

The result you posted for Spark is totally wrong (you can't get the timestamp back after one round trip), fortunately this is a fake alarm :)

For the Hive one, it bad but not wrong (different rounding method, you can still read the timestamp back).

In general, maybe using Julian day here is the root of bad things, use signed integer for number of days since epoch should be enough (could be negative).

@liancheng

Copy link
Copy Markdown
Contributor

@davies Yeah, you're right. Verified that I was actually running the posted snippet against a local revision which I used to debug the Julian conversion logic. Forgot to rebuild after checkout out the code back to the earlier master revision.

Anyway, behavior of the current master branch is consistent with Hive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants