[SPARK-2460] Optimize SparkContext.hadoopFile api#1385
Conversation
…y in SparkContext.hadoopFile
|
Can one of the admins verify this patch? |
|
Thanks for submitting the pull request. Did I read this correctly? The master branch deadlocks? If yes, we should file a JIRA for that also and make that more clear. If it is simply about optimizing an API to reduce code, it is a much lower priority issue. If it deadlocks, this needs to be a BLOCKER. |
|
Is this related to the other conf-related concurrency issue that was fixed recently? #1273 |
|
@rxin and @aarondav, yeah ,the master branch deadlocks, it seems locks of #1273 and Hadoop-10456 lead to the problem. when run hivesql self join sql--- hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)"), the program stucks. i think clean SparkContext.hadoopFile api is a better way for fix it. in this way, we do not need the lock in |
There was a problem hiding this comment.
Do we still need this check? I think this may have been the only place we put the JobConf inside the cache.
There was a problem hiding this comment.
yeah, there is no need to cache the jobconf if it is in broadcast
There was a problem hiding this comment.
yes,i agree with you. broadcastedConf is cached by blockManager in Broadcast
…dd/drop partition command' (apache#1385) ### What changes were proposed in this pull request? https://github.com/apache/spark/blob/cbffc12f90e45d33e651e38cf886d7ab4bcf96da/sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala#L979 It should be `partDir2` instead of `partDir1`. Looks like it is a copy paste bug. ### Why are the changes needed? Due to this test bug, the drop command was dropping a wrong (`partDir1`) underlying file in the test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added extra underlying file location check. Closes apache#36075 from kazuyukitanimura/SPARK-38786. Authored-by: Kazuyuki Tanimura <ktanimura@apple.com> Signed-off-by: Chao Sun <sunchao@apple.com>
1 use SparkContext.hadoopRDD() instead of instantiate HadoopRDD directly in SparkContext.hadoopFile.
SparkContext.hadoopRDD() Add necessary security credentials to the JobConf before broadcasting it.
2 broadcast jobConf in HadoopRDD, not Configuration. this will resolve the dead lock issue----
now HadoopRDD broadcast Configuration and in each task (compute method) to get jobConf
then the lock of
conf.synchronized {
val newJobConf = new JobConf(conf)
initLocalJobConfFuncOpt.map(f => f(newJobConf))
HadoopRDD.putCachedMetadata(jobConfCacheKey, newJobConf)
newJobConf
}
will conflict with hadoop version fix Hadoop-10456:
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/conf/Configuration.java
synchronized(Configuration.class) {
REGISTRY.put(this, null);
}
we can reproduce the bug like this:
hadoop version 2.4.1
spark master branch
hql("SELECT t1.a, t1.b, t1.c FROM table_A t1 JOIN table_A t2 ON (t1.a = t2.a)")