Skip to content

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible#30017

Closed
viirya wants to merge 1 commit into
apache:branch-2.4from
viirya:SPARK-25271-2.4
Closed

[SPARK-25271][SQL][2.4] Hive ctas commands should use data source if it is convertible#30017
viirya wants to merge 1 commit into
apache:branch-2.4from
viirya:SPARK-25271-2.4

Conversation

@viirya

@viirya viirya commented Oct 12, 2020

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR SPARK-22977. Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

Does this PR introduce any user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

How was this patch tested?

Unit tests.

@viirya

viirya commented Oct 12, 2020

Copy link
Copy Markdown
Member Author

cc @dongjoon-hyun

@SparkQA

SparkQA commented Oct 12, 2020

Copy link
Copy Markdown

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you, @viirya .

@dongjoon-hyun

Copy link
Copy Markdown
Member

cc @anuragmantri

@SparkQA

SparkQA commented Oct 12, 2020

Copy link
Copy Markdown

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34304/

@viirya

viirya commented Oct 12, 2020

Copy link
Copy Markdown
Member Author

cc @cloud-fan

@SparkQA

SparkQA commented Oct 12, 2020

Copy link
Copy Markdown

Test build #129698 has finished for PR 30017 at commit e3ffaaf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CreateHiveTableAsSelectBase extends DataWritingCommand
  • case class CreateHiveTableAsSelectCommand(
  • case class OptimizedCreateHiveTableAsSelectCommand(

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @viirya and all.
Merged to branch-2.4.

dongjoon-hyun pushed a commit that referenced this pull request Oct 12, 2020
…it is convertible

### What changes were proposed in this pull request?

In Spark 2.3.0 and previous versions, Hive CTAS command will convert to use data source to write data into the table when the table is convertible. This behavior is controlled by the configs like HiveUtils.CONVERT_METASTORE_ORC and HiveUtils.CONVERT_METASTORE_PARQUET.

In 2.3.1, we drop this optimization by mistake in the PR [SPARK-22977](https://github.com/apache/spark/pull/20521/files#r217254430). Since that Hive CTAS command only uses Hive Serde to write data.

This patch adds this optimization back to Hive CTAS command. This patch adds OptimizedCreateHiveTableAsSelectCommand which uses data source to write data.

This is to backport #22514 to branch-2.4.

### Why are the changes needed?

This bug was originally reported in 2.3.1, but only fixed in 3.0. We should have it in branch-2.4 because the branch is LTS.

### Does this PR introduce _any_ user-facing change?

Yes. Users can use the config to use built-in data source writer instead of Hive serde in CTAS.

### How was this patch tested?

Unit tests.

Closes #30017 from viirya/SPARK-25271-2.4.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
@viirya

viirya commented Oct 12, 2020

Copy link
Copy Markdown
Member Author

Thanks!

@viirya viirya deleted the SPARK-25271-2.4 branch December 27, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants