Skip to content

[SPARK-14938][ML] replace RDD.map with Dataset.as#12718

Closed
zhengruifeng wants to merge 12 commits into
apache:masterfrom
zhengruifeng:use_dataset
Closed

[SPARK-14938][ML] replace RDD.map with Dataset.as#12718
zhengruifeng wants to merge 12 commits into
apache:masterfrom
zhengruifeng:use_dataset

Conversation

@zhengruifeng

@zhengruifeng zhengruifeng commented Apr 26, 2016

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Replace rdd with dataset in ML.
From:

dataset.select(col($(labelCol)).cast(DoubleType), f, w).rdd.map {
    case Row(label: Double, feature: Double, weight: Double) =>
        (label, feature, weight)
}

To:

dataset.select(col($(labelCol)).cast(DoubleType), f, w)
    .as[(Double, Double, Double)].rdd

How was this patch tested?

local build

@SparkQA

SparkQA commented Apr 26, 2016

Copy link
Copy Markdown

Test build #57049 has finished for PR 12718 at commit d3df6d4.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin

rxin commented Apr 27, 2016

Copy link
Copy Markdown
Contributor

cc @jkbradley this seems like a good idea...

@zhengruifeng zhengruifeng changed the title [SPARK-14938][ML] replace rdd with dataset [SPARK-14938][ML] replace some rdd.map with Dataset.as Apr 27, 2016
@zhengruifeng zhengruifeng changed the title [SPARK-14938][ML] replace some rdd.map with Dataset.as [SPARK-14938][ML] replace some RDD.map with Dataset.as Apr 27, 2016
@zhengruifeng

Copy link
Copy Markdown
Contributor Author

Now, Encoder for Vector is missing...

@SparkQA

SparkQA commented Apr 27, 2016

Copy link
Copy Markdown

Test build #57105 has finished for PR 12718 at commit a69e1fd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 27, 2016

Copy link
Copy Markdown

Test build #57114 has finished for PR 12718 at commit ed90f24.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng

zhengruifeng commented Apr 27, 2016

Copy link
Copy Markdown
Contributor Author

Vector, LabeledPoint, Instance and AFTPoint can not be used in Dataset.as now

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57222 has finished for PR 12718 at commit e57332a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick

MLnick commented Apr 28, 2016

Copy link
Copy Markdown
Contributor

We should definitely have an encoder for vector udts... cc @dbtsai @viirya

* and put it in an RDD with strong types.
*/
protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = {
dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need the col($(labelCol)).cast(DoubleType) since now all Predictors take any NumericType for labels, see #10355

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix it

@viirya

viirya commented Apr 28, 2016

Copy link
Copy Markdown
Member

We have no implicit encoder for vector udt. But we can explicitly create it.

@MLnick

MLnick commented Apr 28, 2016

Copy link
Copy Markdown
Contributor

@viirya yeah, sorry I meant we should create an encoder that can be used in ml... whether an implicit or explicit.

@viirya

viirya commented Apr 28, 2016

Copy link
Copy Markdown
Member

encoder now supports UDTs. You just need to declare one before you want to use it, since sql implicit does not include implicit ones for them.

@MLnick

MLnick commented Apr 28, 2016

Copy link
Copy Markdown
Contributor

@viirya can you provide an example of how this works for use here in this PR?

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57225 has finished for PR 12718 at commit b2101b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya

viirya commented Apr 28, 2016

Copy link
Copy Markdown
Member

e.g., in BisectingKMeans, this patch changes

val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => point }

to

val data = dataset.select(col($(featuresCol))).as[Vector].rdd

We can add:

implicit def vectorEncoder: Encoder[Vector] = ExpressionEncoder()

in the class.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57231 has finished for PR 12718 at commit 2378829.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57235 has finished for PR 12718 at commit 530f037.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57238 has finished for PR 12718 at commit 9584026.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57245 has finished for PR 12718 at commit df5e917.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57246 has finished for PR 12718 at commit 8863991.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Apr 28, 2016

Copy link
Copy Markdown

Test build #57247 has finished for PR 12718 at commit 9ef9a51.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng zhengruifeng changed the title [SPARK-14938][ML] replace some RDD.map with Dataset.as [SPARK-14938][ML] replace RDD.map with Dataset.as Apr 28, 2016
@zhengruifeng

Copy link
Copy Markdown
Contributor Author

@viirya Thanks. I updated this PR following your example.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

@jkbradley @mengxr @jaceklaskowski The new Dataset.as API is appled to ML in this PR.

case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)
}
val input = dataset.select(col($(labelCol)).cast(DoubleType).as("label"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, couldn't resist :) I'd change "label" to be a symbol 'label. Not very widely used, but think it deserves its place in the code.

@zhengruifeng zhengruifeng Apr 29, 2016

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaceklaskowski Do you mean change from col($(labelCol)).cast(DoubleType).as("label") to col($(labelCol)).cast(DoubleType).as(getDefault(labelCol).get) ?

@jaceklaskowski

Copy link
Copy Markdown
Contributor

Other than the few places where you could use symbols not string literals LGTM. Excellent job! Thanks.

@zhengruifeng

Copy link
Copy Markdown
Contributor Author

This PR is too old and have many conflict with current master. I will close it.

@zhengruifeng zhengruifeng deleted the use_dataset branch September 30, 2016 04:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants