[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog by tejasapatil · Pull Request #15228 · apache/spark

tejasapatil · 2016-09-24T02:44:41Z

What changes were proposed in this pull request?

Currently Spark does not respect bucketing for Hive tables. This PR includes following changes:

will extract table's bucketing information in HiveClientImpl
while writing table info to metastore, MetastoreRelation now populates the bucketing information in the hive Table object
HiveTableScanExec now exposes outputPartitioning and outputOrdering as per bucketing spec.
InsertIntoHiveTable now exposes requiredChildDistribution and requiredChildOrdering based on the target table's bucketing spec.

TODOs (which will be done in linked PRs and not this one):

ClusteredDistribution does not guarantee the number of partitions (which corresponds to output bucket files created) generated. This will require adding strict guarantees to ClusteredDistribution. I think it will need more thought and better to do incrementally and not packing in this PR.
While writing to bucketed files, Hive's hashing function should be used. I have a PR open to implement Hive hashing native in Spark : [SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047
Allow creating Hive bucketed tables

How was this patch tested?

Tested with Hive tables created locally. Adding a new test case will need implementing bucketed table creation which is not supported :( Suggestions welcome.

SparkQA · 2016-09-24T04:52:56Z

Test build #65857 has finished for PR 15228 at commit caef89a.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

tejasapatil closed this Sep 24, 2016

tejasapatil force-pushed the SPARK-17654_hive_extract_bucketing branch from caef89a to 7c38252 Compare September 24, 2016 04:31

tejasapatil deleted the SPARK-17654_hive_extract_bucketing branch September 24, 2016 05:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog#15228

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog#15228
tejasapatil wants to merge 0 commit into
apache:masterfrom
tejasapatil:SPARK-17654_hive_extract_bucketing

tejasapatil commented Sep 24, 2016

Uh oh!

SparkQA commented Sep 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

tejasapatil commented Sep 24, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 24, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants