[SPARK-14938][ML] replace RDD.map with Dataset.as by zhengruifeng · Pull Request #12718 · apache/spark

zhengruifeng · 2016-04-26T23:24:16Z

What changes were proposed in this pull request?

Replace rdd with dataset in ML.
From:

dataset.select(col($(labelCol)).cast(DoubleType), f, w).rdd.map {
    case Row(label: Double, feature: Double, weight: Double) =>
        (label, feature, weight)
}

To:

dataset.select(col($(labelCol)).cast(DoubleType), f, w)
    .as[(Double, Double, Double)].rdd

How was this patch tested?

local build

SparkQA · 2016-04-26T23:32:05Z

Test build #57049 has finished for PR 12718 at commit d3df6d4.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-27T07:35:00Z

cc @jkbradley this seems like a good idea...

zhengruifeng · 2016-04-27T08:29:32Z

Now, Encoder for Vector is missing...

SparkQA · 2016-04-27T08:42:41Z

Test build #57105 has finished for PR 12718 at commit a69e1fd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-27T10:24:08Z

Test build #57114 has finished for PR 12718 at commit ed90f24.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-04-27T13:03:27Z

Vector, LabeledPoint, Instance and AFTPoint can not be used in Dataset.as now

SparkQA · 2016-04-28T06:16:49Z

Test build #57222 has finished for PR 12718 at commit e57332a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-28T06:22:05Z

We should definitely have an encoder for vector udts... cc @dbtsai @viirya

MLnick · 2016-04-28T06:42:15Z

   * and put it in an RDD with strong types.
   */
  protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = {
-    dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map {


You need the col($(labelCol)).cast(DoubleType) since now all Predictors take any NumericType for labels, see #10355

Thanks, I will fix it

viirya · 2016-04-28T06:45:34Z

We have no implicit encoder for vector udt. But we can explicitly create it.

MLnick · 2016-04-28T06:47:13Z

@viirya yeah, sorry I meant we should create an encoder that can be used in ml... whether an implicit or explicit.

viirya · 2016-04-28T06:49:28Z

encoder now supports UDTs. You just need to declare one before you want to use it, since sql implicit does not include implicit ones for them.

MLnick · 2016-04-28T06:52:34Z

@viirya can you provide an example of how this works for use here in this PR?

SparkQA · 2016-04-28T07:05:30Z

Test build #57225 has finished for PR 12718 at commit b2101b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2016-04-28T07:16:57Z

e.g., in BisectingKMeans, this patch changes

val data = dataset.select(col($(featuresCol))).rdd.map { case Row(point: Vector) => point }

to

val data = dataset.select(col($(featuresCol))).as[Vector].rdd

We can add:

implicit def vectorEncoder: Encoder[Vector] = ExpressionEncoder()

in the class.

SparkQA · 2016-04-28T07:53:41Z

Test build #57231 has finished for PR 12718 at commit 2378829.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T08:48:07Z

Test build #57235 has finished for PR 12718 at commit 530f037.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T09:34:52Z

Test build #57238 has finished for PR 12718 at commit 9584026.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T10:48:44Z

Test build #57245 has finished for PR 12718 at commit df5e917.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T11:01:59Z

Test build #57246 has finished for PR 12718 at commit 8863991.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-28T11:53:26Z

Test build #57247 has finished for PR 12718 at commit 9ef9a51.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2016-04-28T13:37:29Z

@viirya Thanks. I updated this PR following your example.

zhengruifeng · 2016-04-28T14:16:09Z

@jkbradley @mengxr @jaceklaskowski The new Dataset.as API is appled to ML in this PR.

jaceklaskowski · 2016-04-28T20:10:54Z

-      case Row(label: Double, features: Vector) =>
-        LabeledPoint(label, features)
-    }
+    val input = dataset.select(col($(labelCol)).cast(DoubleType).as("label"),


Sorry, couldn't resist :) I'd change "label" to be a symbol 'label. Not very widely used, but think it deserves its place in the code.

@jaceklaskowski Do you mean change from col($(labelCol)).cast(DoubleType).as("label") to col($(labelCol)).cast(DoubleType).as(getDefault(labelCol).get) ?

jaceklaskowski · 2016-04-28T20:20:54Z

Other than the few places where you could use symbols not string literals LGTM. Excellent job! Thanks.

zhengruifeng · 2016-09-30T04:44:02Z

This PR is too old and have many conflict with current master. I will close it.

zhengruifeng force-pushed the use_dataset branch from d3df6d4 to a69e1fd Compare April 27, 2016 08:21

zhengruifeng changed the title ~~[SPARK-14938][ML] replace rdd with dataset~~ [SPARK-14938][ML] replace some rdd.map with Dataset.as Apr 27, 2016

zhengruifeng changed the title ~~[SPARK-14938][ML] replace some rdd.map with Dataset.as~~ [SPARK-14938][ML] replace some RDD.map with Dataset.as Apr 27, 2016

zhengruifeng force-pushed the use_dataset branch from a69e1fd to ed90f24 Compare April 27, 2016 09:46

zhengruifeng added 5 commits April 28, 2016 13:42

replace rdd with dataset

2494924

fix a nit

dc3de9e

del unsupport vectorEncoder

474cbb1

del labeledpoint,instance,aftpoint

b481292

test Predictor.scala

e57332a

zhengruifeng force-pushed the use_dataset branch from ed90f24 to e57332a Compare April 28, 2016 05:50

test Predictor.scala 2

b2101b2

MLnick reviewed Apr 28, 2016
View reviewed changes

test Predictor.scala 3

2378829

add support of LabeledPoint,Instance,AFTPoint

530f037

test Vector in KMeans

9584026

add support for Vector

df5e917

fix scala style

8863991

fix build

9ef9a51

zhengruifeng changed the title ~~[SPARK-14938][ML] replace some RDD.map with Dataset.as~~ [SPARK-14938][ML] replace RDD.map with Dataset.as Apr 28, 2016

jaceklaskowski reviewed Apr 28, 2016
View reviewed changes

MLnick mentioned this pull request May 18, 2016

[SPARK-15339] [ML] ML 2.0 QA: Scala APIs and code audit for regression #13129

Closed

zhengruifeng closed this Sep 30, 2016

zhengruifeng deleted the use_dataset branch September 30, 2016 04:44

Uh oh!

Conversation

zhengruifeng commented Apr 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Apr 26, 2016

Uh oh!

rxin commented Apr 27, 2016

Uh oh!

zhengruifeng commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

SparkQA commented Apr 27, 2016

Uh oh!

zhengruifeng commented Apr 27, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

MLnick commented Apr 28, 2016

Uh oh!

MLnick Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

viirya commented Apr 28, 2016

Uh oh!

MLnick commented Apr 28, 2016

Uh oh!

viirya commented Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MLnick commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

viirya commented Apr 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

SparkQA commented Apr 28, 2016

Uh oh!

zhengruifeng commented Apr 28, 2016

Uh oh!

zhengruifeng commented Apr 28, 2016

Uh oh!

jaceklaskowski Apr 28, 2016

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Apr 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jaceklaskowski commented Apr 28, 2016

Uh oh!

zhengruifeng commented Sep 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zhengruifeng commented Apr 26, 2016 •

edited

Loading

zhengruifeng commented Apr 27, 2016 •

edited

Loading

viirya commented Apr 28, 2016 •

edited

Loading

viirya commented Apr 28, 2016 •

edited

Loading

zhengruifeng Apr 29, 2016 •

edited

Loading