[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive by davies · Pull Request #8400 · apache/spark

davies · 2015-08-24T19:19:06Z

We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly.

In order to avoid the confusing rounding when do the converting, we use 2440588 as the Julian Day of epoch of unix timestamp (which should be 2440587.5).

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

SparkQA · 2015-08-24T21:42:23Z

Test build #41466 has finished for PR 8400 at commit e96f92f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-24T23:46:28Z

Test build #41474 has finished for PR 8400 at commit 9b1b9ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-08-25T02:07:11Z

Test build #1688 has finished for PR 8400 at commit 9b1b9ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-25T02:54:47Z

Nit: removing the trailing //?

yhuai · 2015-08-25T03:57:06Z

Can you explain what "they are overlapped" means?

liancheng · 2015-08-25T04:56:34Z

I feel like Impala and Hive did Julian day conversion in the wrong way at first, but left it as is and made the wrong conversion logic the de facto "standard" way? I don't see an intuitive reason why we should store a Julian timestamp 12 hr later than the real Julian timestamp at the first place.

But making Spark SQL behave the same as Impala/Hive makes sense, even if they are doing the conversion wrong. Otherwise it would be pretty hard to deal with legacy data, since neither Spark SQL nor Hive writes version information into generated Parquet files.

davies · 2015-08-25T05:30:17Z

@yhuai The Julian day of epoch of unix timestamp (1970-01-01 0:0:0) is 2440587.5, if we use integer for julian days, then it will be cut off at noon (12pm) of a day. But the nanoseconds is counted from 12am, so they are overlapped from 12am to 12pm. In Hive, it create a calendar from Julian day (which will have hours as 12pm), it then re-set the hours from nanoseconds.

In this patch, we use a trick to shift Julian days by 0.5 day then rounding 0.5 down to zero (In Hive, it rounds 0.5 up to 1), then we can add these two parts together.

liancheng · 2015-08-25T05:46:56Z

LGTM now

SparkQA · 2015-08-25T07:57:59Z

Test build #41520 has finished for PR 8400 at commit 05437e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-08-25T07:58:18Z

Merging to master and branch-1.5.

We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet. (cherry picked from commit 2f493f7) Signed-off-by: Cheng Lian <lian@databricks.com>

liancheng · 2015-08-25T10:28:13Z

(EDIT: The conclusion made in this comment is WRONG.)

Just would like to point out that our original DateTimeUtils doesn't do Julian date conversion right either. Here is a spark-shell snippet executed against 053d94f:

import java.sql._
import java.util._

import org.apache.hadoop.hive.ql.io.parquet.timestamp._
import org.apache.spark.sql.catalyst.util._

TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")

val hiveNanoTime = NanoTimeUtils.getNanoTime(timestamp, false)
val hiveJulianDay = hiveNanoTime.getJulianDay
val hiveTimeOfDayNanos = hiveNanoTime.getTimeOfDayNanos

println(
  s"""Hive converts "$timestamp" to Julian timestamp:
     |(julianDay=$hiveJulianDay, timeOfDayNanos=$hiveTimeOfDayNanos)
   """.stripMargin)

val (sparkJulianDay, sparkTimeOfDayNanos) =
  DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
println(
  s"""Spark converts "$timestamp" to Julian timestamp:
     |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
   """.stripMargin)

The result is:

Hive converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=0)

Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440588, timeOfDayNanos=43200000000000)

while the correct answer should be:

(julianDay=2440587, timeOfDayNanos=43200000000000)

So Hive is 12 hours later than the expected Julian timestamp, while we were 24 hours later :)

liancheng · 2015-08-25T10:31:19Z

Asked for more background about Hive's 12 hr offset on HIVE-6394, which is the original JIRA ticket for introducing timestamp support for ParquetHiveSerDe.

davies · 2015-08-25T17:12:27Z

@liancheng Thanks for the comments, but you may made a mistake somewhere, I got this WITHOUT this PR:

scala> TimeZone.setDefault(TimeZone.getTimeZone("GMT"))
scala> val timestamp = Timestamp.valueOf("1970-01-01 00:00:00")
timestamp: java.sql.Timestamp = 1970-01-01 00:00:00.0

scala> val (sparkJulianDay, sparkTimeOfDayNanos) =
     |   DateTimeUtils.toJulianDay(DateTimeUtils.fromJavaTimestamp(timestamp))
sparkJulianDay: Int = 2440587
sparkTimeOfDayNanos: Long = 43200000000000

scala> println(
     |   s"""Spark converts "$timestamp" to Julian timestamp:
     |      |(julianDay=$sparkJulianDay, timeOfDayNanos=$sparkTimeOfDayNanos)
     |    """.stripMargin)
Spark converts "1970-01-01 00:00:00.0" to Julian timestamp:
(julianDay=2440587, timeOfDayNanos=43200000000000)

davies · 2015-08-25T17:16:27Z

The result you posted for Spark is totally wrong (you can't get the timestamp back after one round trip), fortunately this is a fake alarm :)

For the Hive one, it bad but not wrong (different rounding method, you can still read the timestamp back).

In general, maybe using Julian day here is the root of bad things, use signed integer for number of days since epoch should be enough (could be negative).

liancheng · 2015-08-25T17:30:15Z

@davies Yeah, you're right. Verified that I was actually running the posted snippet against a local revision which I used to debug the Julian conversion logic. Forgot to rebuild after checkout out the code back to the earlier master revision.

Anyway, behavior of the current master branch is consistent with Hive.

liancheng and others added 5 commits August 24, 2015 15:32

Refactors ParquetHiveCompatibilitySuite and adds more test cases

808ae3b

read timestamp in parquet generated from Hive

809e164

Merge branch 'parquet_tests'

3491f2f

enable regression test

e96f92f

Merge branch 'master' of github.com:apache/spark into timestamp_parquet

9b1b9ce

Conflicts: sql/hive/src/test/scala/org/apache/spark/sql/hive/ParquetHiveCompatibilitySuite.scala

liancheng reviewed Aug 25, 2015
View reviewed changes

address comment

05437e1

asfgit closed this in 2f493f7 Aug 25, 2015

Uh oh!

Conversation

davies commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 24, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

liancheng Aug 25, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

SparkQA commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

davies commented Aug 25, 2015

Uh oh!

liancheng commented Aug 25, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants