Skip to content

[SPARK-5068][SQL]fix bug query data when path doesn't exists#3891

Closed
jeanlyn wants to merge 3 commits into
apache:masterfrom
jeanlyn:SPARK-5068
Closed

[SPARK-5068][SQL]fix bug query data when path doesn't exists#3891
jeanlyn wants to merge 3 commits into
apache:masterfrom
jeanlyn:SPARK-5068

Conversation

@jeanlyn

@jeanlyn jeanlyn commented Jan 4, 2015

Copy link
Copy Markdown
Contributor

the issue is descript on SPARK-5068
the purpose of this pull request is to prevent to make RDD for the path which doesn't exists

@AmplabJenkins

Copy link
Copy Markdown

Can one of the admins verify this patch?

@jeanlyn

jeanlyn commented Jan 5, 2015

Copy link
Copy Markdown
Contributor Author

hi, @marmbrus ,can you please take a look and give some suggestions?thx.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Space before {

@marmbrus

marmbrus commented Jan 5, 2015

Copy link
Copy Markdown
Contributor

What is the rational behind this change? It seems like the table is corrupted and you should know about it. Does hive work in this case?

@jeanlyn

jeanlyn commented Jan 6, 2015

Copy link
Copy Markdown
Contributor Author

Yes,hive is work in this situation.I found this issue from our production environment when i try to use spark-sql to test some sql which run in hive original. I am not familiar with the business logic,but i think we should strengthen the compatibility of spark.Thanks for your check.

@marmbrus

marmbrus commented Jan 6, 2015

Copy link
Copy Markdown
Contributor

Okay, that is reasonable and we should probably support this. So then the question is can we do this check on the executor in parallel (or just catch the exception if it is thrown) instead of doing it serially when constructing the RDD?

@jeanlyn

jeanlyn commented Jan 6, 2015

Copy link
Copy Markdown
Contributor Author

Thanks for suggestion! I would optimize this and commit later.

@marmbrus

marmbrus commented Jan 6, 2015

Copy link
Copy Markdown
Contributor

ok to test

@SparkQA

SparkQA commented Jan 6, 2015

Copy link
Copy Markdown

Test build #25090 has started for PR 3891 at commit 55636f3.

  • This patch merges cleanly.

@SparkQA

SparkQA commented Jan 6, 2015

Copy link
Copy Markdown

Test build #25090 has finished for PR 3891 at commit 55636f3.

  • This patch fails RAT tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25090/
Test FAILed.

@SparkQA

SparkQA commented Feb 2, 2015

Copy link
Copy Markdown

Test build #26511 has started for PR 3891 at commit 3ce56ba.

  • This patch does not merge cleanly.

@SparkQA

SparkQA commented Feb 2, 2015

Copy link
Copy Markdown

Test build #26513 has started for PR 3891 at commit 40d1c94.

  • This patch merges cleanly.

@SparkQA

SparkQA commented Feb 2, 2015

Copy link
Copy Markdown

Test build #26511 has finished for PR 3891 at commit 3ce56ba.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins

Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26511/
Test FAILed.

@SparkQA

SparkQA commented Feb 2, 2015

Copy link
Copy Markdown

Test build #26513 has finished for PR 3891 at commit 40d1c94.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class FPGrowthModel(val freqItemsets: RDD[(Array[String], Long)]) extends Serializable
    • class Node[T](val parent: Node[T]) extends Serializable
    • logDebug(s"Did not load class $name from REPL class server at $uri", e)
    • logError(s"Failed to check existence of class $name on REPL class server at $uri", e)

@AmplabJenkins

Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26513/
Test PASSed.

@srowen

srowen commented Feb 17, 2015

Copy link
Copy Markdown
Member

Is this superseded by #3907 or #4356 ? if so can this be closed?

@jeanlyn

jeanlyn commented Feb 20, 2015

Copy link
Copy Markdown
Contributor Author

OK.I close this one

@jeanlyn jeanlyn closed this Feb 20, 2015
asfgit pushed a commit that referenced this pull request Apr 12, 2015
…ontext

This PR follow up PR #3907 & #3891 & #4356.
According to  marmbrus  liancheng 's comments, I try to use fs.globStatus to retrieve all FileStatus objects under path(s), and then do the filtering locally.

[1]. get pathPattern by path, and put it into pathPatternSet. (hdfs://cluster/user/demo/2016/08/12 -> hdfs://cluster/user/demo/*/*/*)
[2]. retrieve all FileStatus objects ,and cache them by undating existPathSet.
[3]. do the filtering locally
[4]. if we have new pathPattern,do 1,2 step again. (external table maybe have more than one partition pathPattern)

chenghao-intel jeanlyn

Author: lazymam500 <lazyman500@gmail.com>
Author: lazyman <lazyman500@gmail.com>

Closes #5059 from lazyman500/SPARK-5068 and squashes the following commits:

5bfcbfd [lazyman] move spark.sql.hive.verifyPartitionPath to SQLConf,fix scala style
e1d6386 [lazymam500] fix scala style
f23133f [lazymam500] bug fix
47e0023 [lazymam500] fix scala style,add config flag,break the chaining
04c443c [lazyman] SPARK-5068: fix bug when partition path doesn't exists #2
41f60ce [lazymam500] Merge pull request #1 from apache/master
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants