Skip to content

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog#15228

Closed
tejasapatil wants to merge 0 commit into
apache:masterfrom
tejasapatil:SPARK-17654_hive_extract_bucketing
Closed

[SPARK-17654] [SQL] Propagate bucketing information for Hive tables to / from Catalog#15228
tejasapatil wants to merge 0 commit into
apache:masterfrom
tejasapatil:SPARK-17654_hive_extract_bucketing

Conversation

@tejasapatil

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Currently Spark does not respect bucketing for Hive tables. This PR includes following changes:

  • will extract table's bucketing information in HiveClientImpl
  • while writing table info to metastore, MetastoreRelation now populates the bucketing information in the hive Table object
  • HiveTableScanExec now exposes outputPartitioning and outputOrdering as per bucketing spec.
  • InsertIntoHiveTable now exposes requiredChildDistribution and requiredChildOrdering based on the target table's bucketing spec.

TODOs (which will be done in linked PRs and not this one):

  • ClusteredDistribution does not guarantee the number of partitions (which corresponds to output bucket files created) generated. This will require adding strict guarantees to ClusteredDistribution. I think it will need more thought and better to do incrementally and not packing in this PR.
  • While writing to bucketed files, Hive's hashing function should be used. I have a PR open to implement Hive hashing native in Spark : [SPARK-17495] [SQL] Add Hash capability semantically equivalent to Hive's #15047
  • Allow creating Hive bucketed tables

How was this patch tested?

Tested with Hive tables created locally. Adding a new test case will need implementing bucketed table creation which is not supported :( Suggestions welcome.

@tejasapatil tejasapatil force-pushed the SPARK-17654_hive_extract_bucketing branch from caef89a to 7c38252 Compare September 24, 2016 04:31
@SparkQA

SparkQA commented Sep 24, 2016

Copy link
Copy Markdown

Test build #65857 has finished for PR 15228 at commit caef89a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@tejasapatil tejasapatil deleted the SPARK-17654_hive_extract_bucketing branch September 24, 2016 05:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants