From 60ddd9c7f5f87c14fefd1febeed2adfe9f8ae6da Mon Sep 17 00:00:00 2001 From: Matt Butrovich Date: Mon, 31 Mar 2025 16:38:23 -0400 Subject: [PATCH 1/5] Add int96_from_spark.parquet --- data/int96_from_spark.parquet | Bin 0 -> 495 bytes 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 data/int96_from_spark.parquet diff --git a/data/int96_from_spark.parquet b/data/int96_from_spark.parquet new file mode 100644 index 0000000000000000000000000000000000000000..5b7fbefc0e0df9aed1d88e5dd57e1e5bb57fa415 GIT binary patch literal 495 zcmZXR!D|yy5XNUWF~quha+Y0LBs^$nWPR?kNt=dVL@7ldv^g!d*pXNe(s-~p1yte z{x<^Jcm+Rw`}O^jhZfDQU=`rPf(ZoQD3x|s!S(Dz1J`U)&~*JZs~C#>*^?r9sxpBl zU}fHP3<0*~dahSLG(4=ht#Y;El8UBn*(U})GU`=t(FE70nDVp{&I;H+4YXlOS@kNF z%Rr3d4DwAbhJs@FSt8$7dC?p6opTXy@(7&9e?+mBa$0U{q%_Y5J3LcbNTG%EaLn|J zp<*3Y=zNrD=Ch5SN@c-1kH)MovZ*CVol&}%sX~ivNK${4rXo&>pjPFxe?wVkJefS6 zfOBUSc|FT(vmI+84 Date: Mon, 31 Mar 2025 17:01:50 -0400 Subject: [PATCH 2/5] Add a description for int96_from_spark. --- data/int96_from_spark.md | 70 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 data/int96_from_spark.md diff --git a/data/int96_from_spark.md b/data/int96_from_spark.md new file mode 100644 index 0000000..5922eaa --- /dev/null +++ b/data/int96_from_spark.md @@ -0,0 +1,70 @@ + + +`int96_from_spark.parquet` is generated by Apache Spark 3.4.3 with parquet-mr version 1.13.1. + +It has a single column of int96 type with 6 values. int96 typically represents a timestamp with +int32 representing the number of days since the epoch, and an int64 representing +nanoseconds. Due to its nanosecond resolution, many systems handle int96 timestamps by +converting the int32 days to nanoseconds and adding the two values to form a single +64-bit nanosecond timestamp. However, Spark's default timestamp resolution is microseconds, which +results in being able to read and write timestamps with a larger range of dates. + +Note that this type is now deprecated in the Parquet spec. It exists only for systems that wish +to maintain compatibility with Apache Spark and other systems that still write this type. + +This file contains timestamps that are not all representable with 64-bit nanosecond timestamps. +It originates from [a test for DataFusion Comet](https://github.com/apache/datafusion-comet/blob/fa5910efd927e115d1717b5f0c78fad0ece75c6c/spark/src/test/scala/org/apache/comet/CometCastSuite.scala#L902), +and can be reproduced in a Spark shell with the code below: + +```scala +val values = Seq(Some("2024-01-01T12:34:56.123456"), Some("2024-01-01T01:00:00Z"), Some("9999-12-31T01:00:00-02:00"), Some("2024-12-31T01:00:00+02:00"), None, Some("290000-12-31T01:00:00+02:00")) +import org.apache.spark.sql.types.DataTypes +val df = values.toDF("str").select(col("str").cast(DataTypes.TimestampType).as("a")).coalesce(1) +df.write.parquet("int96_spark.parquet") +``` + +# File Metadata (from parquet-cli meta command) +``` +File path: int96_from_spark.parquet +Created by: parquet-mr version 1.13.1 (build db4183109d5b734ec5930d870cdae161e408ddba) +Properties: + org.apache.spark.version: 3.4.3 + org.apache.spark.sql.parquet.row.metadata: {"type":"struct","fields":[{"name":"a","type":"timestamp","nullable":true,"metadata":{}}]} +Schema: +message spark_schema { + optional int96 a; +} + + +Row group 0: count: 6 18.83 B records start: 4 total(compressed): 113 B total(uncompressed):113 B +-------------------------------------------------------------------------------- + type encodings count avg size nulls min / max +a INT96 S _ R 6 18.83 B 1 +``` + +# Column Index (from parquet-cli column-index command) +``` +row-group 0: +column index for column a: +NONE +offset index for column a: + offset compressed size first row index +page-0 81 36 0 +``` From 5cf995b79f185242c79437111cd491208255c1b8 Mon Sep 17 00:00:00 2001 From: Matt Butrovich Date: Mon, 31 Mar 2025 17:11:44 -0400 Subject: [PATCH 3/5] Fix type. --- data/int96_from_spark.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/data/int96_from_spark.md b/data/int96_from_spark.md index 5922eaa..216ef9d 100644 --- a/data/int96_from_spark.md +++ b/data/int96_from_spark.md @@ -37,7 +37,7 @@ and can be reproduced in a Spark shell with the code below: val values = Seq(Some("2024-01-01T12:34:56.123456"), Some("2024-01-01T01:00:00Z"), Some("9999-12-31T01:00:00-02:00"), Some("2024-12-31T01:00:00+02:00"), None, Some("290000-12-31T01:00:00+02:00")) import org.apache.spark.sql.types.DataTypes val df = values.toDF("str").select(col("str").cast(DataTypes.TimestampType).as("a")).coalesce(1) -df.write.parquet("int96_spark.parquet") +df.write.parquet("int96_from_spark.parquet") ``` # File Metadata (from parquet-cli meta command) From 05c2e10638241dd2443d7a7a35388f8726070365 Mon Sep 17 00:00:00 2001 From: Matt Butrovich Date: Mon, 31 Mar 2025 17:14:30 -0400 Subject: [PATCH 4/5] Add expected values. --- data/int96_from_spark.md | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/data/int96_from_spark.md b/data/int96_from_spark.md index 216ef9d..4293030 100644 --- a/data/int96_from_spark.md +++ b/data/int96_from_spark.md @@ -40,6 +40,11 @@ val df = values.toDF("str").select(col("str").cast(DataTypes.TimestampType).as(" df.write.parquet("int96_from_spark.parquet") ``` +As microseconds since the epoch, they correspond to: +``` +1704141296123456, 1704070800000000, 253402225200000000, 1735599600000000, null, 9089380393200000000 +``` + # File Metadata (from parquet-cli meta command) ``` File path: int96_from_spark.parquet From ffe455a16a8c8c0fa426c6b4b15e7621859a23cf Mon Sep 17 00:00:00 2001 From: Matt Butrovich Date: Tue, 1 Apr 2025 14:37:43 -0400 Subject: [PATCH 5/5] Update data/README.md --- data/README.md | 1 + 1 file changed, 1 insertion(+) diff --git a/data/README.md b/data/README.md index cc7909b..d9ab77c 100644 --- a/data/README.md +++ b/data/README.md @@ -58,6 +58,7 @@ | map_no_value.parquet | MAP with null values, MAP with INT32 keys and no values, and LIST column with same values as the MAP keys. See [map_no_value.md](map_no_value.md) | | page_v2_empty_compressed.parquet | An INT32 column with DataPageV2, all values are null, the zero-sized data is compressed using ZSTD | | unknown-logical-type.parquet | A file containing a column annotated with a LogicalType whose identifier has been set to an abitrary high value to check the behaviour of an old reader reading a file written by a new writer containing an unsupported type (see [related issue](https://github.com/apache/arrow/issues/41764)). | +| int96_from_spark.parquet | Single column of (deprecated) int96 values that originated as Apache Spark microsecond-resolution timestamps. Some values are outside the range typically representable by 64-bit nanosecond-resolution timestamps. See [int96_from_spark.md](int96_from_spark.md) for details. | TODO: Document what each file is in the table above.