Skip to content

[SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement runs much slower in spark-sql than it does in hive-client#15726

Closed
viirya wants to merge 3 commits into
apache:masterfrom
viirya:improve-hive-insertoverwrite-followup
Closed

[SPARK-18107][SQL][FOLLOW-UP] Insert overwrite statement runs much slower in spark-sql than it does in hive-client#15726
viirya wants to merge 3 commits into
apache:masterfrom
viirya:improve-hive-insertoverwrite-followup

Conversation

@viirya

@viirya viirya commented Nov 2, 2016

Copy link
Copy Markdown
Member

What changes were proposed in this pull request?

As reported on the jira, insert overwrite statement runs much slower in Spark, compared with hive-client.

We have addressed this issue for static partition at #15667. This is a follow-up pr for #15667 to address dynamic partition.

How was this patch tested?

Jenkins tests.

There are existing tests using insert overwrite statement. Those tests should be passed. I added a new test to specially test insert overwrite into dynamic partition.

For performance issue, as I don't have Hive 2.0 environment, this needs the reporter to verify it. Please refer to the jira.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

@viirya

viirya commented Nov 2, 2016

Copy link
Copy Markdown
Member Author

cc @snodawn Would you like to test this patch for dynamic partition? Thanks.

@SparkQA

SparkQA commented Nov 2, 2016

Copy link
Copy Markdown

Test build #67945 has finished for PR 15726 at commit eae8f1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin

rxin commented Nov 2, 2016

Copy link
Copy Markdown
Contributor

cc @ericl

Do we need to do this for data source tables?

@ericl

ericl commented Nov 2, 2016

Copy link
Copy Markdown
Contributor

In datasource tables we already delete the partition beforehand, so this
should not be needed (we also don't follow the hive insert path so don't
know if the perf regression exists).

On Wed, Nov 2, 2016, 12:07 AM Reynold Xin notifications@github.com wrote:

cc @ericl https://github.com/ericl

Do we need to do this for data source tables?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15726 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6SlK6-sNgly91PcU9d1v8qnwvNpdaks5q6Da5gaJpZM4Kmy8E
.

@snodawn

snodawn commented Nov 2, 2016

Copy link
Copy Markdown

@viirya Ok, I will try it soon.

@snodawn

snodawn commented Nov 2, 2016

Copy link
Copy Markdown

I have tested the new patch for dynamic partition. It still costs a long time in running overwrite statement as the same with hive 1.2.1. The execution logs show that when running in dynamic partition it move each file of the partition to .Trash instead of the whole partition, which may cost a lot of time in this way.

@viirya

viirya commented Nov 2, 2016

Copy link
Copy Markdown
Member Author

@snodawn Thanks for reporting this.

One thing I want to make sure is how you test that? Are you insert the partition first and then overwrite to the existing partition? Or you just use insert overwrite to write to a new partition, i.e., actually it is not overwriting?

@viirya

viirya commented Nov 2, 2016

Copy link
Copy Markdown
Member Author

@snodawn OK. I got the reason why dynamic partition is still much slower than Hive 2.0.

There is another patch to optimize dynamic partition, apache/hive@d297b51.

Basically it optimizes the sequential dynamic partition insertion as many asynchronous tasks with an executor pool.

We can also do it in InsertIntoHiveTable for dynamic partition. Don't know if it is worth? @ericl @rxin What do you think?

@ericl

ericl commented Nov 2, 2016

Copy link
Copy Markdown
Contributor

IIUC, you would have to call loadPartition in parallel for each new partition created instead of loadDynamicPartitions once? There might be an issue there since each Hive client operation in Spark currently holds a global lock. So it would all be serialized anyways.

@viirya

viirya commented Nov 3, 2016

Copy link
Copy Markdown
Member Author

@ericl Thanks. Looks like we have more than one level lock (at least two in HiveExternalCatalog, HiveClientImpl). This might hard to tackle.

Although it is still possibly to have a workaround by having customized methods to wrap those loadPartition as one task which obtains the locks, I think the work may not be worth.

@ericl @rxin Do you agree that?

@snodawn

snodawn commented Nov 3, 2016

Copy link
Copy Markdown

@viirya I test both inserting overwrite a new partition and a existing partition. Of cause, inserting overwrite a new partition runs faster.

@ericl

ericl commented Nov 3, 2016

Copy link
Copy Markdown
Contributor

Yeah, this sounds like more complexity than it's worth. We should probably fix the hive client locking issue first.

@viirya

viirya commented Nov 4, 2016

Copy link
Copy Markdown
Member Author

@ericl About the hive client locking issue, any thing you can suggest?

@SparkQA

SparkQA commented Nov 4, 2016

Copy link
Copy Markdown

Test build #68102 has finished for PR 15726 at commit 4624f1a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ericl

ericl commented Nov 4, 2016

Copy link
Copy Markdown
Contributor

Hm, iirc the issue is not super hard to fix, but basically since the hive
thrift client is not thread safe, only one client can use it at a time. We
would need some sort of hive client pool to solve the issue (search for
retryLocked in the hive client management code to see the global lock.)

On Thu, Nov 3, 2016, 9:14 PM UCB AMPLab notifications@github.com wrote:

Merged build finished. Test PASSed.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#15726 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAA6Slm6mZjAdO3bnW30wira4IoDnSDMks5q6rE2gaJpZM4Kmy8E
.

@viirya

viirya commented Nov 4, 2016

Copy link
Copy Markdown
Member Author

@ericl yeah, I have checked the codes with retryLocked in HiveClientImpl. Do you mean we can create multiple hive client in the pool and serve concurrently?

@ericl

ericl commented Nov 4, 2016

Copy link
Copy Markdown
Contributor

@viirya I think there are a few options there, either HiveClientImpl can create multiple internal thrift clients (see private def client there), or the external catalog could create multiple clients.

@viirya

viirya commented Nov 5, 2016

Copy link
Copy Markdown
Member Author

@ericl Currently I prefer the first one, let HiveClientImpl create multiple internal thrift clients, since I don't like to change external catalog for this.

@ericl

ericl commented Nov 5, 2016

Copy link
Copy Markdown
Contributor

@viirya that makes sense to me

@viirya

viirya commented Nov 21, 2016

Copy link
Copy Markdown
Member Author

@ericl I am thinking this recently. What I am not very sure is this multiple hive client approach is safe to use under multiple thread environment. E.g., for now, because we synchronize on the single hive client, we run hive operations in sequence. Once we have multiple hive clients, would the concurrent hive operations conflict each other?

My first thought is because the hive operations use metastore, these operations would need to acquire some locks on the items (e.g., tables) in metastore before running. Is my guess correct or not?

@ericl

ericl commented Nov 22, 2016

Copy link
Copy Markdown
Contributor

@yhuai do you know if it would be safe to have multiple concurrent Hive operations in HiveClientImpl. From a cursory audit of the code it seems that only thread-local state is mutated for withHiveState so maybe it's no different from having multiple Spark clusters connect to the same metastore.

@viirya

viirya commented Jan 5, 2017

Copy link
Copy Markdown
Member Author

I would close this for now and may be reopen this when we get correct answer from @yhuai.

@viirya viirya closed this Jan 5, 2017
@yuananf

yuananf commented Apr 6, 2017

Copy link
Copy Markdown

@viirya Is it possible to upgrade the built-in hive-exec to resolve this problem? We are facing the same problem, insert overwrite dynamic partition is extremely slow, the data written is over in 5 minutes, but the following action takes more than 1 hour.

I believe the built-in hive-exec is this https://github.com/JoshRosen/hive/tree/release-1.2.1-spark2
Can we just upgrade this to release-2.1.1-spark2 or something else?

@grantnicholas

Copy link
Copy Markdown

@viirya @yuananf a quick check of recent spark releases shows this fix is not in. Any suggested workarounds in the meantime for dynamic partition insert overwrites?

It sounds like if the user does the logic of deleting the necessary partitions before running the dynamic insert overwrite query then hive will go down the "happy" performant path. This will require calculating the dynamic partitions before running the insert query, but if you can do that then this workaround will work right?

@kaiseu

kaiseu commented May 16, 2019

Copy link
Copy Markdown

Do we have any solutions so far to resolve or workaround this issue? Spark2.4.3 also encountered this problem.

@viirya

viirya commented May 16, 2019

Copy link
Copy Markdown
Member Author

Hmm, since Spark community is working on upgrading Hive version in Spark, I think once it is done, this shouldn't be an issue after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants