Skip to content

[SPARK-18817][SparkR] set default spark-warehouse path to tempdir()#16247

Closed
bdwyer2 wants to merge 6 commits into
apache:masterfrom
bdwyer2:default_sparkr_spark_warehouse_fix
Closed

[SPARK-18817][SparkR] set default spark-warehouse path to tempdir()#16247
bdwyer2 wants to merge 6 commits into
apache:masterfrom
bdwyer2:default_sparkr_spark_warehouse_fix

Conversation

@bdwyer2

@bdwyer2 bdwyer2 commented Dec 10, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Set the default location of spark.sql.warehouse.dir to be compliant with the CRAN policy (https://cran.r-project.org/web/packages/policies.html) regarding writing files outside of the tmp directory. Previously a folder named spark-warehouse was created in the working directory when sparkR.session() was called.

See SPARK-15799 for discussion.
cc @shivaram

How was this patch tested?

Added new test and manually verified nothing was created in my working directory after running the following code:

sparkR.session(master = "local[*]",
               sparkConfig = list(spark.driver.memory = "2g"),
               enableHiveSupport = FALSE)

@HyukjinKwon

Copy link
Copy Markdown
Member

I may missed something but isn't tmpdir per R's session? Maybe I should have tried it first but I guess it wouldn't be accessible after the R session is ended and another one is started. Is this behaviour expected?

@shivaram

Copy link
Copy Markdown
Contributor

Jenkins, ok to test

@bdwyer2

bdwyer2 commented Dec 10, 2016

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon yes but we are restricted by CRAN policies

Packages should not write in the users’ home filespace, nor anywhere else on the file system apart from the R session’s temporary directory

@shivaram

Copy link
Copy Markdown
Contributor

@HyukjinKwon That is a good point. As @bdwyer2 says the question is one of defaults. If the user does specify a more permanent location we will use that. I wonder if we should print a warning or info message as well ?

@HyukjinKwon

Copy link
Copy Markdown
Member

I see. Thank you both for your comments. (BTW, this should have a JIRA I think.)

@bdwyer2

bdwyer2 commented Dec 10, 2016

Copy link
Copy Markdown
Contributor Author

Should I open a JIRA under SPARK-15799 myself or leave that to one of the admins?

@SparkQA

SparkQA commented Dec 10, 2016

Copy link
Copy Markdown

Test build #69974 has finished for PR 16247 at commit c855c2c.

  • This patch fails R style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bdwyer2 bdwyer2 force-pushed the default_sparkr_spark_warehouse_fix branch from 21bf50b to 79caf1c Compare December 10, 2016 21:59
@shivaram

Copy link
Copy Markdown
Contributor

@bdwyer2 Feel free to open a JIRA as a sub-task under SPARK-15799

@bdwyer2 bdwyer2 changed the title [MINOR][SparkR] set default spark-warehouse path to tempdir() [SPARK-18817][SparkR] set default spark-warehouse path to tempdir() Dec 10, 2016
Comment thread R/pkg/R/sparkR.R
...) {

if (length(sparkConfig[["spark.sql.warehouse.dir"]]) == 0) {
sparkConfig[["spark.sql.warehouse.dir"]] <- tempdir()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the JIRA, I think this change is not enough. If the user has a hive-site.xml and the warehouse dir set inside that, this change will override that [1]. We might need to change the Scala code and/or add a new option to specify that a new default for SparkR.

[1]

val hiveWarehouseDir = sparkContext.hadoopConfiguration.get("hive.metastore.warehouse.dir")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plus this property spark.sql.warehouse.dir could also be set in spark-defaults.conf which isn't known by this method at this point. By setting it here it would override any other possible values from spark config or hive site.

@bdwyer2 bdwyer2 Dec 12, 2016

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about an argument named sparkWorkingDirectory default that to tempdir()?

@felixcheung felixcheung Dec 13, 2016

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Shivaram is talking about a spark property, not a parameter (if perhaps your camel casing is an indication)
Basically, we don't want to change spark.sql.warehouse.dir here because it could be set already in an earlier point (just not accessible here)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is the pb of circular dependency for getting the hive warehouse dir.

I think currently in sparkR, to create SparkSession, we need all those parameter and once spark session is there we can call

sparkContext <- SparkR:::callJMethod(sc, "sparkContext") 
haddopConf <- SparkR:::callJMethod(sparkContext, "hadoopConfiguration")
hivewarehouseDir <- SparkR:::callJMethod(hadoopConf, "get", "hive.metastore.warehouse.dir"") 

but the above logic requires that we have sparksession available but there is no spark session yet at this line of code. So there must a way to get the hive.conf directory from env.
we have to make decision that is it ok to get the env from system or not? something like the following

require(XML)
data <- xmlParse(file.path(Sys.env("HADOOP_HOME", "conf", "hive-site.xml")
xml_data <- xmlToList(data)

it will make sparkR dependent on xml parser . just some thoughts :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung I'm confused. By "spark property" do you mean something passed to sparkR.session() via the sparkConfig argument?

@shivaram

Copy link
Copy Markdown
Contributor

@bdwyer2 One more thing: Is there a good way to test this ?

@SparkQA

SparkQA commented Dec 10, 2016

Copy link
Copy Markdown

Test build #69975 has finished for PR 16247 at commit 79caf1c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung

Copy link
Copy Markdown
Member

I think we should only change spark.sql.warehouse.dir when we are loading SparkR as a package. This should minimize changes in the case where we are running in cluster mode and so on.

For that purpose, let's check for R interactive().

Conceptually, we might want to pass the R tempdir() along as a property and then check again after config and hive config is applied, if it is still not set, take that property and set it to spark.sql.warehouse.dir.

@shivaram

shivaram commented Dec 11, 2016 via email

Copy link
Copy Markdown
Contributor

@felixcheung

felixcheung commented Dec 11, 2016

Copy link
Copy Markdown
Member

... that's actually what I meant with "we might want to pass the R tempdir() along as a property" <-- this would be a new property and not spark.sql.warehouse.dir

@bdwyer2

bdwyer2 commented Dec 12, 2016

Copy link
Copy Markdown
Contributor Author

@shivaram I can create a test to verify the output of list.files() is the same before and after running sparkR.session().

@shivaram

Copy link
Copy Markdown
Contributor

@bdwyer2 The test case idea sounds good !

Regarding the conf naming for warehouse dir lets also check with contributors who are more familiar with SQL.
cc @yhuai @cloud-fan

@SparkQA

SparkQA commented Dec 12, 2016

Copy link
Copy Markdown

Test build #70032 has finished for PR 16247 at commit cee3944.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Comment thread R/pkg/inst/tests/testthat/test_sparkR.R Outdated
test_that("sparkR.session", {
# nothing should be written outside the tempdir() without explicit user premission
inital_working_directory_files <- list.files()
sparkR.session()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might want to explicitly run some SQL query here and then call sparkR.stop() and then check the files

@SparkQA

SparkQA commented Dec 12, 2016

Copy link
Copy Markdown

Test build #70035 has finished for PR 16247 at commit 17b7bff.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bdwyer2

bdwyer2 commented Dec 12, 2016

Copy link
Copy Markdown
Contributor Author

I don't see how my last commit could of caused this

functions in sparkR.R: .....
SparkSQL functions: Spark package found in SPARK_HOME: /home/jenkins/workspace/SparkPullRequestBuilder
Error in handleErrors(returnStatus, conn) : 
  java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
	at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
	at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)
	at org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)
	at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
	at scala.collection.TraversableLike$WithF
Calls: test_package ... sparkR.session -> callJStatic -> invokeJava -> handleErrors

@jodersky

Copy link
Copy Markdown
Member

jenkins, retest this please

@SparkQA

SparkQA commented Dec 12, 2016

Copy link
Copy Markdown

Test build #70040 has finished for PR 16247 at commit 17b7bff.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 13, 2016

Copy link
Copy Markdown

Test build #70047 has finished for PR 16247 at commit 84a110e.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan

Copy link
Copy Markdown
Contributor

a new property for default warehouse LGTM

@felixcheung

felixcheung commented Dec 13, 2016

Copy link
Copy Markdown
Member

re: test failure. it might be related to this change? the call stack is hidden, it should be trying to call into https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionState.scala and failed.

Is it possible that tempdir() is not writeable on Jenkins?

@HyukjinKwon

HyukjinKwon commented Dec 13, 2016

Copy link
Copy Markdown
Member

This test failure seems related with this PR. It seems because the previous Hive-enabled spark session is not closed properly between test_saprkSQL.R and test_sparkR.R here, in particular, I suspect the lock in derby via Hive client (not sure). It seems apparently related with SPARK-16027?

There was only single instance of Hive-enabled spark session in test_saprkSQL.R but the test here introduces another one. I manually tested for each after removing each other and it seems working fine exclusively.

@felixcheung

Copy link
Copy Markdown
Member

Possibly, SPARK-16027 was just a hack, the root issue remains I think

@shivaram

Copy link
Copy Markdown
Contributor

Yeah disabling Hive for the test is fine. @bdwyer2 Can you add the new config flag as well ? We can do one final pass of review after that

@SparkQA

SparkQA commented Dec 13, 2016

Copy link
Copy Markdown

Test build #70093 has finished for PR 16247 at commit 9142397.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung

felixcheung commented Dec 14, 2016 via email

Copy link
Copy Markdown
Member

@felixcheung

Copy link
Copy Markdown
Member

But yes other Spark config properties would be set by the user in sparkConfig parameter of sparkR.session method. We would just add to that like

sparkConfig[["spark.sql.warehouse.default.dir"]] <- tempdir()

without adding another parameter to sparkR.session()

@bdwyer2

bdwyer2 commented Dec 14, 2016

Copy link
Copy Markdown
Contributor Author

How would we access that value on the scala side? Would sparkContext.hadoopConfiguration.get("spark.sql.warehouse.default.dir") work?

I'm currently unable to compile Spark which makes experimenting with scala difficult.

@felixcheung

Copy link
Copy Markdown
Member

@shivaram

Copy link
Copy Markdown
Contributor

@bdwyer2 Let us know if you have problems setting up the environment - If so me or @felixcheung can open a new PR that includes your changes (we can still assign the JIRA as your contribution)

The reason I ask is that it will be good to get this in before the RC3 cut as this helps the SparkR CRAN release etc.

@bdwyer2

bdwyer2 commented Dec 14, 2016

Copy link
Copy Markdown
Contributor Author

@shivaram @felixcheung I'll close this PR so that one of you can take over in order to have it done in time for the RC.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants