Skip to content

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client#15667

Closed
viirya wants to merge 4 commits into
apache:masterfrom
viirya:improve-hive-insertoverwrite
Closed

[SPARK-18107][SQL] Insert overwrite statement runs much slower in spark-sql than it does in hive-client#15667
viirya wants to merge 4 commits into
apache:masterfrom
viirya:improve-hive-insertoverwrite

Conversation

@viirya

@viirya viirya commented Oct 28, 2016

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

It seems there is a patch HIVE-11940 which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such improvement.

The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.

Note: The case reported on the jira is insert overwrite to partition. Since Hive.loadTable also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.

How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@SparkQA

SparkQA commented Oct 28, 2016

Copy link
Copy Markdown

Test build #67686 has finished for PR 15667 at commit 81dbeb1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin

rxin commented Oct 28, 2016

Copy link
Copy Markdown
Contributor

cc @ericl

@viirya

viirya commented Oct 28, 2016

Copy link
Copy Markdown
Member Author

I checked the failed test. Special characters in the partition path cause the failure, e..g.,

alter table ppr_test add partition (ds = '12:4');
alter table ppr_test add partition (ds = '12%4');
alter table ppr_test add partition (ds = '12*4');

@ericl

ericl commented Oct 28, 2016

Copy link
Copy Markdown
Contributor

How does the performance look like before / after this patch?

@viirya

viirya commented Oct 28, 2016

Copy link
Copy Markdown
Member Author

@ericl I don't have the Hive environment to compare. We need to wait for issue reporter to verify that.

@snodawn

snodawn commented Oct 29, 2016

Copy link
Copy Markdown

Sorry, I'm late, for preparing the environment for testing cost me a lot of time. I have tested the performance before and after the patch. But it seems to improve a few after patching, where it costs 531 seconds before patching, and costs 518 seconds after patching. So I think I need to do more testing to find out the problem.

@viirya

viirya commented Oct 29, 2016

Copy link
Copy Markdown
Member Author

@snodawn Interesting...I will try to find out it too.

@viirya

viirya commented Oct 30, 2016

Copy link
Copy Markdown
Member Author

@snodawn Can you try it again? I've updated this.

@SparkQA

SparkQA commented Oct 30, 2016

Copy link
Copy Markdown

Test build #67785 has finished for PR 15667 at commit 5257d7f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn

snodawn commented Oct 31, 2016

Copy link
Copy Markdown

@viirya I have tested the new patch, which performs better than expected. Before patching, it costs about 500~600 seconds, but now it just cost me about 16 seconds to run the same statement. But it still runs slow when I run such sql:

insert overwrite table login4game partition(pt,dt) select distinct account_name,role_id,server,'1476979200' as recdate, 'mix' as platform, 'mix' as pid, 'mix' as dev, pt, dt from tbllog_login where pt='mix_en' and dt='2016-10-21';

It's the dynamic partition in hive, where we needn't to specify the partition value when inserting. I test it in hive 2.0.1, it costs 47.822 seconds, but in hive 1.2.1, it costs 574.33 seconds, as the same with what it does in spark, which is 526.44 seconds.

@viirya

viirya commented Oct 31, 2016

Copy link
Copy Markdown
Member Author

@snodawn I just updated this with a new commit. Do you use this new patch to test?

Yeah, current fixing doesn't consider dynamic partition. I would like to see if we can improve static partition. I will look into dynamic partition later.

@snodawn

snodawn commented Oct 31, 2016

Copy link
Copy Markdown

The execution logs in spark show me that, it does the same thing as what it does before I add the patch, which may be the reason why it runs so slow when running dynamic insert overwrite statement.

@viirya

viirya commented Oct 31, 2016

Copy link
Copy Markdown
Member Author

@snodawn Current fixing does not do anything for dynamic partition. So we can expect this.

@viirya

viirya commented Oct 31, 2016

Copy link
Copy Markdown
Member Author

@snodawn You just said the new patch runs better. Do you use the latest patch I just updated in about 1 hr ago? Thanks.

@snodawn

snodawn commented Oct 31, 2016

Copy link
Copy Markdown

Ok, I see. I haven't tested the newest code, I would try it later.

@viirya

viirya commented Oct 31, 2016

Copy link
Copy Markdown
Member Author

@snodawn Thanks. I expect it should perform as good as you tested. But there are few tests failed I fix in newer patch.

@SparkQA

SparkQA commented Oct 31, 2016

Copy link
Copy Markdown

Test build #67799 has finished for PR 15667 at commit 74f0f3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn

snodawn commented Oct 31, 2016

Copy link
Copy Markdown

@viirya I have tested the newest patch. It performs good in running the same sql as I ran before.

@viirya

viirya commented Oct 31, 2016

Copy link
Copy Markdown
Member Author

@snodawn Thanks! I will address dynamic partition in next commit.

throw new RuntimeException(
"Cannot remove partition directory '" + partitionPath.toString)
} else {
fs.mkdirs(partitionPath, pathPermission)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the mkdir necessary?

@viirya viirya Nov 1, 2016

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking Hive will complain if the dir is not existing. But looks like it won't. Let me remove this and see if all tests can be passed.

table.catalogTable.identifier.table,
partitionSpec)

var doOverwrite = overwrite

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: doHiveOverwrite?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. updated.

@ericl

ericl commented Nov 1, 2016

Copy link
Copy Markdown
Contributor

lgtm if tests pass

@viirya

viirya commented Nov 1, 2016

Copy link
Copy Markdown
Member Author

@ericl Dynamic partition would be more complicated. Should we do it in this or in follow-up?

@ericl

ericl commented Nov 1, 2016

Copy link
Copy Markdown
Contributor

Let's do it in a follow-up.

@SparkQA

SparkQA commented Nov 1, 2016

Copy link
Copy Markdown

Test build #67853 has finished for PR 15667 at commit bd22150.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@snodawn

snodawn commented Nov 1, 2016

Copy link
Copy Markdown

@viirya @ericl Thanks a lot. The new patch is also passed in my test. And I sincerely hope dynamic partition could be completed in the future, thanks.

@rxin

rxin commented Nov 1, 2016

Copy link
Copy Markdown
Contributor

Merging in master. Thanks.

@asfgit asfgit closed this in dd85eb5 Nov 1, 2016
@viirya

viirya commented Nov 1, 2016

Copy link
Copy Markdown
Member Author

@rxin @ericl Thanks!

@snodawn I will address dynamic partition in a follow-up pr.

uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…rk-sql than it does in hive-client

## What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

It seems there is a patch [HIVE-11940](apache/hive@ba21806) which largely improves insert overwrite performance on Hive. HIVE-11940 is patched after Hive 2.0.0.

Because Spark SQL uses older Hive library, we can not benefit from such improvement.

The reporter verified that there is also a big performance gap between Hive 1.2.1 (520.037 secs) and Hive 2.0.1 (35.975 secs) on insert overwrite execution.

Instead of upgrading to Hive 2.0 in Spark SQL, which might not be a trivial task, this patch provides an approach to delete the partition before asking Hive to load data files into the partition.

Note: The case reported on the jira is insert overwrite to partition. Since `Hive.loadTable` also uses the function to replace files, insert overwrite to table should has the same issue. We can take the same approach to delete the table first. I will upgrade this to include this.
## How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#15667 from viirya/improve-hive-insertoverwrite.
@viirya viirya deleted the improve-hive-insertoverwrite branch December 27, 2023 18:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants