-
Notifications
You must be signed in to change notification settings - Fork 29.3k
[SPARK-12919][SPARKR] Implement dapply() on DataFrame in SparkR. #12493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
51beb71
ccc610c
79fac3b
908bc28
1ecef08
ed29678
a3326d7
04d44e6
605814e
fefa98e
f9efa7f
af16c46
64395eb
21c856c
b39466c
2264b57
3efe9f5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -21,6 +21,7 @@ | |
| NULL | ||
|
|
||
| setOldClass("jobj") | ||
| setOldClass("structType") | ||
|
|
||
| #' @title S4 class that represents a SparkDataFrame | ||
| #' @description DataFrames can be created using functions like \link{createDataFrame}, | ||
|
|
@@ -1125,6 +1126,66 @@ setMethod("summarize", | |
| agg(x, ...) | ||
| }) | ||
|
|
||
| #' dapply | ||
| #' | ||
| #' Apply a function to each partition of a DataFrame. | ||
| #' | ||
| #' @param x A SparkDataFrame | ||
| #' @param func A function to be applied to each partition of the SparkDataFrame. | ||
| #' func should have only one parameter, to which a data.frame corresponds | ||
| #' to each partition will be passed. | ||
| #' The output of func should be a data.frame. | ||
| #' @param schema The schema of the resulting DataFrame after the function is applied. | ||
| #' It must match the output of func. | ||
| #' @family SparkDataFrame functions | ||
| #' @rdname dapply | ||
| #' @name dapply | ||
| #' @export | ||
| #' @examples | ||
| #' \dontrun{ | ||
| #' df <- createDataFrame (sqlContext, iris) | ||
| #' df1 <- dapply(df, function(x) { x }, schema(df)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could we have an more elaborate example to explain how
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
| #' collect(df1) | ||
| #' | ||
| #' # filter and add a column | ||
| #' df <- createDataFrame ( | ||
| #' sqlContext, | ||
| #' list(list(1L, 1, "1"), list(2L, 2, "2"), list(3L, 3, "3")), | ||
| #' c("a", "b", "c")) | ||
| #' schema <- structType(structField("a", "integer"), structField("b", "double"), | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. btw, we already have a simpler way (string based) to define a schema in Scala and Python, we may also add that to R.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, I will investigate it. Will submit a new PR for this or reuse https://issues.apache.org/jira/browse/SPARK-11046
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sun-rui - Just a note that it'll be great to have the simpler schema specification for 2.0. Let me know if you have a new JIRA or we will use 11046, so we can track it for the release.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yeah, let me do some investigation |
||
| #' structField("c", "string"), structField("d", "integer")) | ||
| #' df1 <- dapply( | ||
| #' df, | ||
| #' function(x) { | ||
| #' y <- x[x[1] > 1, ] | ||
| #' y <- cbind(y, y[1] + 1L) | ||
| #' }, | ||
| #' schema) | ||
| #' collect(df1) | ||
| #' # the result | ||
| #' # a b c d | ||
| #' # 1 2 2 2 3 | ||
| #' # 2 3 3 3 4 | ||
| #' } | ||
| setMethod("dapply", | ||
| signature(x = "SparkDataFrame", func = "function", schema = "structType"), | ||
| function(x, func, schema) { | ||
| packageNamesArr <- serialize(.sparkREnv[[".packages"]], | ||
| connection = NULL) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we make
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. could you explain more? don't understand |
||
|
|
||
| broadcastArr <- lapply(ls(.broadcastNames), | ||
| function(name) { get(name, .broadcastNames) }) | ||
|
|
||
| sdf <- callJStatic( | ||
| "org.apache.spark.sql.api.r.SQLUtils", | ||
| "dapply", | ||
| x@sdf, | ||
| serialize(cleanClosure(func), connection = NULL), | ||
| packageNamesArr, | ||
| broadcastArr, | ||
| schema$jobj) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If schema is
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no. if schema is NULL, schema$jobj evaluates to NULL. |
||
| dataFrame(sdf) | ||
| }) | ||
|
|
||
| ############################## RDD Map Functions ################################## | ||
| # All of the following functions mirror the existing RDD map functions, # | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2017,6 +2017,46 @@ test_that("Histogram", { | |
| df <- as.DataFrame(sqlContext, data.frame(x = c(1, 2, 3, 4, 100))) | ||
| expect_equal(histogram(df, "x")$counts, c(4, 0, 0, 0, 0, 0, 0, 0, 0, 1)) | ||
| }) | ||
|
|
||
| test_that("dapply() on a DataFrame", { | ||
| df <- createDataFrame ( | ||
| sqlContext, | ||
| list(list(1L, 1, "1"), list(2L, 2, "2"), list(3L, 3, "3")), | ||
| c("a", "b", "c")) | ||
| ldf <- collect(df) | ||
| df1 <- dapply(df, function(x) { x }, schema(df)) | ||
| result <- collect(df1) | ||
| expect_identical(ldf, result) | ||
|
|
||
|
|
||
| # Filter and add a column | ||
| schema <- structType(structField("a", "integer"), structField("b", "double"), | ||
| structField("c", "string"), structField("d", "integer")) | ||
| df1 <- dapply( | ||
| df, | ||
| function(x) { | ||
| y <- x[x$a > 1, ] | ||
| y <- cbind(y, y$a + 1L) | ||
| }, | ||
| schema) | ||
| result <- collect(df1) | ||
| expected <- ldf[ldf$a > 1, ] | ||
| expected$d <- expected$a + 1L | ||
| rownames(expected) <- NULL | ||
| expect_identical(expected, result) | ||
|
|
||
| # Remove the added column | ||
| df2 <- dapply( | ||
| df1, | ||
| function(x) { | ||
| x[, c("a", "b", "c")] | ||
| }, | ||
| schema(df)) | ||
| result <- collect(df2) | ||
| expected <- expected[, c("a", "b", "c")] | ||
| expect_identical(expected, result) | ||
| }) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we have more tests for chained dapply (with and without schema)?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I think that it would be good to add other data types beside double in the schema.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
|
|
||
| unlink(parquetPath) | ||
| unlink(jsonPath) | ||
| unlink(jsonPathNa) | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -84,6 +84,13 @@ broadcastElap <- elapsedSecs() | |
| # as number of partitions to create. | ||
| numPartitions <- SparkR:::readInt(inputCon) | ||
|
|
||
| isDataFrame <- as.logical(SparkR:::readInt(inputCon)) | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This might be beyond the scope of this JIRA/PR: when we have protocol changes like this, how do we make sure the peer has matching implementation, and then we are not misinterpreting the byte stream? Should there be some sort of protocol version handshake? For example, here we are coercing an Int value into true/false - but the Int may not be 0 or 1.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a good point. I think the assumption is that R worker processes are started using the same binary release as the JVM processes. But yeah having a protocol version number or something like that might be interesting to explore.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. currently, SparkR is not a standalone package but an integral part of the Spark binary release. So it is assumed that the R worker script of correct matching version is always invoked. The protocol between JVM and the R worker is internal.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. generally, I agree. it is possible though an user could have an initialization or profile file that inadvertently loads a mismatch version of SparkR..
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the "--vanilla" option when launching R worker prevents this. |
||
|
|
||
| # If isDataFrame, then read column names | ||
| if (isDataFrame) { | ||
| colNames <- SparkR:::readObject(inputCon) | ||
| } | ||
|
|
||
| isEmpty <- SparkR:::readInt(inputCon) | ||
|
|
||
| if (isEmpty != 0) { | ||
|
|
@@ -100,7 +107,34 @@ if (isEmpty != 0) { | |
| # Timing reading input data for execution | ||
| inputElap <- elapsedSecs() | ||
|
|
||
| output <- computeFunc(partition, data) | ||
| if (isDataFrame) { | ||
| if (deserializer == "row") { | ||
| # Transform the list of rows into a data.frame | ||
| # Note that the optional argument stringsAsFactors for rbind is | ||
| # available since R 3.2.4. So we set the global option here. | ||
| oldOpt <- getOption("stringsAsFactors") | ||
| options(stringsAsFactors = FALSE) | ||
| data <- do.call(rbind.data.frame, data) | ||
| options(stringsAsFactors = oldOpt) | ||
|
|
||
| names(data) <- colNames | ||
| } else { | ||
| # Check to see if data is a valid data.frame | ||
| stopifnot(deserializer == "byte") | ||
| stopifnot(class(data) == "data.frame") | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If deserializer is not
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done |
||
| output <- computeFunc(data) | ||
| if (serializer == "row") { | ||
| # Transform the result data.frame back to a list of rows | ||
| output <- split(output, seq(nrow(output))) | ||
| } else { | ||
| # Serialize the ouput to a byte array | ||
| stopifnot(serializer == "byte") | ||
| } | ||
| } else { | ||
| output <- computeFunc(partition, data) | ||
| } | ||
|
|
||
| # Timing computing | ||
| computeElap <- elapsedSecs() | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -31,6 +31,7 @@ import org.apache.spark.annotation.{DeveloperApi, Experimental} | |
| import org.apache.spark.api.java.JavaRDD | ||
| import org.apache.spark.api.java.function._ | ||
| import org.apache.spark.api.python.PythonRDD | ||
| import org.apache.spark.broadcast.Broadcast | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.catalyst._ | ||
| import org.apache.spark.sql.catalyst.analysis._ | ||
|
|
@@ -1980,6 +1981,23 @@ class Dataset[T] private[sql]( | |
| mapPartitions(func)(encoder) | ||
| } | ||
|
|
||
| /** | ||
| * Returns a new [[DataFrame]] that contains the result of applying a serialized R function | ||
| * `func` to each partition. | ||
| * | ||
| * @group func | ||
| */ | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe we can add @SInCE attribute in the comment ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Spark 2.0 is a good chance for add "since" for SparkR API methods. But I think we can do it consistently for all methods at one. I will submit a new JIRA issue for it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| private[sql] def mapPartitionsInR( | ||
| func: Array[Byte], | ||
| packageNames: Array[Byte], | ||
| broadcastVars: Array[Broadcast[Object]], | ||
| schema: StructType): DataFrame = { | ||
| val rowEncoder = encoder.asInstanceOf[ExpressionEncoder[Row]] | ||
| Dataset.ofRows( | ||
| sparkSession, | ||
| MapPartitionsInR(func, packageNames, broadcastVars, schema, rowEncoder, logicalPlan)) | ||
| } | ||
|
|
||
| /** | ||
| * :: Experimental :: | ||
| * (Scala-specific) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add doc example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done