From c86090d84c4e895c2ff1a659e6be99e58b8b5119 Mon Sep 17 00:00:00 2001 From: Zoltan Ivanfi Date: Fri, 16 Jun 2017 17:59:39 +0200 Subject: [PATCH 1/4] PARQUET-686: Clarifications about min-max stats. Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the SIGNED and UNSIGNED sort orders, which were removed in favour of a single TYPE_ORDER. * Removed obsolete references to column_orders's effect on the min and max values, since those were declared obsolete instead and column_orders only affects the new min_value and max_value fields. * Clarified ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp. --- LogicalTypes.md | 20 ++++++++++---------- src/main/thrift/parquet.thrift | 14 +++++++------- 2 files changed, 17 insertions(+), 17 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 29cf5272c..86361e359 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -37,7 +37,7 @@ may require additional metadata fields, as well as rules for those fields. `UTF8` may only be used to annotate the binary primitive type and indicates that the byte array should be interpreted as a UTF-8 encoded character string. -The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison. +The sort order used for `UTF8` strings is unsigned byte-wise comparison. ## Numeric Types @@ -57,7 +57,7 @@ allows. implied by the `int32` and `int64` primitive types if no other annotation is present and should be considered optional. -The sort order used for signed integer types is `SIGNED`. +The sort order used for signed integer types is signed. ### Unsigned Integers @@ -74,7 +74,7 @@ allows. `UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and `UINT_64` must annotate an `int64` primitive type. -The sort order used for unsigned integer types is `UNSIGNED`. +The sort order used for unsigned integer types is unsigned. ### DECIMAL @@ -104,7 +104,7 @@ integer. A precision too large for the underlying type (see below) is an error. A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both `scale` and `precision` fields set, even if scale is 0 by default. -The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent +The sort order used for `DECIMAL` values is signed. The order is equivalent to signed comparison of decimal values. If the column uses `int32` or `int64` physical types, then signed comparison of @@ -121,7 +121,7 @@ comparison. annotate an `int32` that stores the number of days from the Unix epoch, 1 January 1970. -The sort order used for `DATE` is `SIGNED`. +The sort order used for `DATE` is signed. ### TIME\_MILLIS @@ -129,7 +129,7 @@ The sort order used for `DATE` is `SIGNED`. without a date. It must annotate an `int32` that stores the number of milliseconds after midnight. -The sort order used for `TIME\_MILLIS` is `SIGNED`. +The sort order used for `TIME\_MILLIS` is signed. ### TIME\_MICROS @@ -137,7 +137,7 @@ The sort order used for `TIME\_MILLIS` is `SIGNED`. without a date. It must annotate an `int64` that stores the number of microseconds after midnight. -The sort order used for `TIME\_MICROS` is `SIGNED`. +The sort order used for `TIME\_MICROS` is signed. ### TIMESTAMP\_MILLIS @@ -145,7 +145,7 @@ The sort order used for `TIME\_MICROS` is `SIGNED`. millisecond precision. It must annotate an `int64` that stores the number of milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC. -The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`. +The sort order used for `TIMESTAMP\_MILLIS` is signed. ### TIMESTAMP\_MICROS @@ -153,7 +153,7 @@ The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`. microsecond precision. It must annotate an `int64` that stores the number of microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC. -The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`. +The sort order used for `TIMESTAMP\_MICROS` is signed. ### INTERVAL @@ -169,7 +169,7 @@ example, there is no requirement that a large number of days should be expressed as a mix of months and days because there is not a constant conversion from days to months. -The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by +The sort order used for `INTERVAL` is unsigned, produced by sorting by the value of months, then days, then milliseconds with unsigned comparison. ## Embedded Types diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 47812abc1..d88afb39e 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -219,12 +219,12 @@ struct Statistics { * Values are encoded using PLAIN encoding, except that variable-length byte * arrays do not include a length prefix. * - * These fields encode min and max values determined by SIGNED comparison + * These fields encode min and max values determined by signed comparison * only. New files should use the correct order for a column's logical type * and store the values in the min_value and max_value fields. * * To support older readers, these may be set when the column order is - * SIGNED. + * signed. */ 1: optional binary max; 2: optional binary min; @@ -583,6 +583,8 @@ struct TypeDefinedOrder {} /** * Union to specify the order used for min, max, and sorting values in a column. + * This union takes the role of an enhanced enum that allows rich elements + * (which will be needed for a collation-based ordering in the future). * * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or @@ -626,11 +628,9 @@ struct FileMetaData { 6: optional string created_by /** - * Sort order used for each column in this file. - * - * If this list is not present, then the order for each column is assumed to - * be Signed. In addition, min and max values for INTERVAL or DECIMAL stored - * as fixed or bytes should be ignored. + * Sort order used for each column in this file. Each sort order corresponds + * to one column, determined by its position in the list, matching the + * position of the column in the schema. */ 7: optional list column_orders; } From f8fab0b13874f06f66b70bdb9c99a391848ebd87 Mon Sep 17 00:00:00 2001 From: Zoltan Ivanfi Date: Tue, 20 Jun 2017 13:40:23 +0200 Subject: [PATCH 2/4] PARQUET-686: Minor improvements in Thrift comments. --- src/main/thrift/parquet.thrift | 45 +++++++++++++++++++++++++--------- 1 file changed, 34 insertions(+), 11 deletions(-) diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index d88afb39e..6c24488f4 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -28,17 +28,6 @@ namespace java org.apache.parquet.format * with the encodings to control the on disk storage format. * For example INT16 is not included as a type since a good encoding of INT32 * would handle this. - * - * When a logical type is not present, the type-defined sort order of these - * physical types are: - * * BOOLEAN - false, true - * * INT32 - signed comparison - * * INT64 - signed comparison - * * INT96 - signed comparison - * * FLOAT - signed comparison - * * DOUBLE - signed comparison - * * BYTE_ARRAY - unsigned byte-wise comparison - * * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison */ enum Type { BOOLEAN = 0; @@ -594,6 +583,40 @@ struct TypeDefinedOrder {} * for this column should be ignored. */ union ColumnOrder { + + /** + * The sort orders for logical types are: + * UTF8 - unsigned byte-wise comparison + * INT8 - signed comparison + * INT16 - signed comparison + * INT32 - signed comparison + * INT64 - signed comparison + * UINT8 - unsigned comparison + * UINT16 - unsigned comparison + * UINT32 - unsigned comparison + * UINT64 - unsigned comparison + * DECIMAL - signed comparison + * DATE - signed comparison + * TIME_MILLIS - signed comparison + * TIME_MICROS - signed comparison + * TIMESTAMP_MILLIS - signed comparison + * TIMESTAMP_MICROS - signed comparison + * INTERVAL - unsigned comparison + * JSON - undefined + * BSON - undefined + * LIST - undefined + * MAP - undefined + * + * In the absence of logical types, the sort order is determined by the physical type: + * BOOLEAN - false, true + * INT32 - signed comparison + * INT64 - signed comparison + * INT96 - signed comparison + * FLOAT - signed comparison + * DOUBLE - signed comparison + * BYTE_ARRAY - unsigned byte-wise comparison + * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison + */ 1: TypeDefinedOrder TYPE_ORDER; } From 0c973f7dfd93d06d571afab05da2a0838b20c45d Mon Sep 17 00:00:00 2001 From: Zoltan Ivanfi Date: Tue, 20 Jun 2017 15:08:09 +0200 Subject: [PATCH 3/4] PARQUET-686: Further clarifications. --- src/main/thrift/parquet.thrift | 19 +++++++++++++------ 1 file changed, 13 insertions(+), 6 deletions(-) diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 6c24488f4..913e7ffef 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -571,9 +571,9 @@ struct RowGroup { struct TypeDefinedOrder {} /** - * Union to specify the order used for min, max, and sorting values in a column. - * This union takes the role of an enhanced enum that allows rich elements - * (which will be needed for a collation-based ordering in the future). + * Union to specify the order used for the min_value and max_value fields for a + * column. This union takes the role of an enhanced enum that allows rich + * elements (which will be needed for a collation-based ordering in the future). * * Possible values are: * * TypeDefinedOrder - the column uses the order defined by its logical or @@ -651,9 +651,16 @@ struct FileMetaData { 6: optional string created_by /** - * Sort order used for each column in this file. Each sort order corresponds - * to one column, determined by its position in the list, matching the - * position of the column in the schema. + * Sort order used for the min_value and max_value fields of each column in + * this file. Each sort order corresponds to one column, determined by its + * position in the list, matching the position of the column in the schema. + * + * Without column_orders, the meaning of the min_value and max_value fields is + * undefined. To ensure well-defined behaviour, if min_value and max_value are + * written to a Parquet file, column_orders must be written as well. + * + * The obsolete min and max fields are always sorted by signed comparison + * regardless of column_orders. */ 7: optional list column_orders; } From a499d861c250d647fa9f0545eb5a301841f22a79 Mon Sep 17 00:00:00 2001 From: Zoltan Ivanfi Date: Thu, 6 Jul 2017 17:20:17 +0200 Subject: [PATCH 4/4] Comparison rules updates. Modified INT96 comparison to be unsigned in agreement with the actual code and the theoretically correct ordering. Distuingish more explicitly between byte-wise comparison and comparison of represented values when the two are not equivalent. --- LogicalTypes.md | 8 ++++++-- src/main/thrift/parquet.thrift | 13 +++++++------ 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/LogicalTypes.md b/LogicalTypes.md index 86361e359..6e5c9db8f 100644 --- a/LogicalTypes.md +++ b/LogicalTypes.md @@ -104,8 +104,8 @@ integer. A precision too large for the underlying type (see below) is an error. A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both `scale` and `precision` fields set, even if scale is 0 by default. -The sort order used for `DECIMAL` values is signed. The order is equivalent -to signed comparison of decimal values. +The sort order used for `DECIMAL` values is signed comparison of the represented +value. If the column uses `int32` or `int64` physical types, then signed comparison of the integer values produces the correct ordering. If the physical type is @@ -184,6 +184,8 @@ string of valid JSON as defined by the [JSON specification][json-spec] [json-spec]: http://json.org/ +The sort order used for `JSON` is unsigned byte-wise comparison. + ### BSON `BSON` is used for an embedded BSON document. It must annotate a `binary` @@ -192,6 +194,8 @@ defined by the [BSON specification][bson-spec]. [bson-spec]: http://bsonspec.org/spec.html +The sort order used for `BSON` is unsigned byte-wise comparison. + ## Nested Types This section specifies how `LIST` and `MAP` can be used to encode nested types diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index 913e7ffef..3c51639d3 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -595,15 +595,16 @@ union ColumnOrder { * UINT16 - unsigned comparison * UINT32 - unsigned comparison * UINT64 - unsigned comparison - * DECIMAL - signed comparison + * DECIMAL - signed comparison of the represented value * DATE - signed comparison * TIME_MILLIS - signed comparison * TIME_MICROS - signed comparison * TIMESTAMP_MILLIS - signed comparison * TIMESTAMP_MICROS - signed comparison * INTERVAL - unsigned comparison - * JSON - undefined - * BSON - undefined + * JSON - unsigned byte-wise comparison + * BSON - unsigned byte-wise comparison + * ENUM - unsigned byte-wise comparison * LIST - undefined * MAP - undefined * @@ -611,9 +612,9 @@ union ColumnOrder { * BOOLEAN - false, true * INT32 - signed comparison * INT64 - signed comparison - * INT96 - signed comparison - * FLOAT - signed comparison - * DOUBLE - signed comparison + * INT96 (only used for legacy timestamps) - unsigned comparison + * FLOAT - signed comparison of the represented value + * DOUBLE - signed comparison of the represented value * BYTE_ARRAY - unsigned byte-wise comparison * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison */