From 462fe277b0f897141fad89a23d6739da6c5945f3 Mon Sep 17 00:00:00 2001 From: Dongjoon Hyun Date: Tue, 5 Sep 2017 11:22:44 -0700 Subject: [PATCH 1/4] [MINOR][DOC] Add ORC in `Partition Discovery` section. --- docs/sql-programming-guide.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index ee231a934a3af..98a5fa70444b8 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,7 +733,7 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. The Parquet data source is now able to discover and infer +the path of each partition directory. The Parquet/ORC data sources are able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, `gender` and `country` as partitioning columns: @@ -762,8 +762,8 @@ path {% endhighlight %} -By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL -will automatically extract the partitioning information from the paths. +By passing `path/to/table` to either `SparkSession.read.parquet`, `SparkSession.read.orc`, or `SparkSession.read.load`, +Spark SQL will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes: {% highlight text %} @@ -784,7 +784,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass `path/to/table/gender=male` to either -`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a +`SparkSession.read.parquet`, `SparkSession.read.orc`, or `SparkSession.read.load`, `gender` will not be considered as a partitioning column. If users need to specify the base path that partition discovery should start with, they can set `basePath` in the data source options. For example, when `path/to/table/gender=male` is the path of the data and From fd00fbd108c4cc4c8effbadea99f8228bfe1a460 Mon Sep 17 00:00:00 2001 From: Dongjoon Hyun Date: Tue, 5 Sep 2017 12:42:49 -0700 Subject: [PATCH 2/4] All built-in data source supports it. --- docs/sql-programming-guide.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 98a5fa70444b8..dfac53c2b37b4 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,7 +733,7 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. The Parquet/ORC data sources are able to discover and infer +the path of each partition directory. All built-in data sources are able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, `gender` and `country` as partitioning columns: @@ -762,8 +762,8 @@ path {% endhighlight %} -By passing `path/to/table` to either `SparkSession.read.parquet`, `SparkSession.read.orc`, or `SparkSession.read.load`, -Spark SQL will automatically extract the partitioning information from the paths. +By passing `path/to/table` to either `SparkSession.read.parquet` or `SparkSession.read.load`, Spark SQL +will automatically extract the partitioning information from the paths. Now the schema of the returned DataFrame becomes: {% highlight text %} @@ -784,7 +784,7 @@ can be configured by `spark.sql.sources.partitionColumnTypeInference.enabled`, w Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths by default. For the above example, if users pass `path/to/table/gender=male` to either -`SparkSession.read.parquet`, `SparkSession.read.orc`, or `SparkSession.read.load`, `gender` will not be considered as a +`SparkSession.read.parquet` or `SparkSession.read.load`, `gender` will not be considered as a partitioning column. If users need to specify the base path that partition discovery should start with, they can set `basePath` in the data source options. For example, when `path/to/table/gender=male` is the path of the data and From 128c7790a79392a048a6808ccc7412d3fd4d1a5d Mon Sep 17 00:00:00 2001 From: Dongjoon Hyun Date: Tue, 5 Sep 2017 13:31:54 -0700 Subject: [PATCH 3/4] Address comments. --- docs/sql-programming-guide.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index dfac53c2b37b4..39e088c7212a2 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,8 +733,9 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. All built-in data sources are able to discover and infer -partitioning information automatically. For example, we can store all our previously used +the path of each partition directory. All built-in data sources (including TEXT/CSV/JSON/ORC/Parquet) +are able to discover and infer partitioning information automatically. +For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra columns, `gender` and `country` as partitioning columns: From 018fdb381f7f9bbaed099086dd954f8ee1be2ecb Mon Sep 17 00:00:00 2001 From: Dongjoon Hyun Date: Tue, 5 Sep 2017 14:09:37 -0700 Subject: [PATCH 4/4] Address comments. --- docs/sql-programming-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 39e088c7212a2..032073bfc40dd 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -733,7 +733,7 @@ SELECT * FROM parquetTable Table partitioning is a common optimization approach used in systems like Hive. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in -the path of each partition directory. All built-in data sources (including TEXT/CSV/JSON/ORC/Parquet) +the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet) are able to discover and infer partitioning information automatically. For example, we can store all our previously used population data into a partitioned table using the following directory structure, with two extra