[SPARK-14938][ML] replace RDD.map with Dataset.as#12718
Conversation
|
Test build #57049 has finished for PR 12718 at commit
|
|
cc @jkbradley this seems like a good idea... |
d3df6d4 to
a69e1fd
Compare
|
Now, Encoder for Vector is missing... |
|
Test build #57105 has finished for PR 12718 at commit
|
a69e1fd to
ed90f24
Compare
|
Test build #57114 has finished for PR 12718 at commit
|
|
Vector, LabeledPoint, Instance and AFTPoint can not be used in |
ed90f24 to
e57332a
Compare
|
Test build #57222 has finished for PR 12718 at commit
|
| * and put it in an RDD with strong types. | ||
| */ | ||
| protected def extractLabeledPoints(dataset: Dataset[_]): RDD[LabeledPoint] = { | ||
| dataset.select(col($(labelCol)).cast(DoubleType), col($(featuresCol))).rdd.map { |
There was a problem hiding this comment.
You need the col($(labelCol)).cast(DoubleType) since now all Predictors take any NumericType for labels, see #10355
There was a problem hiding this comment.
Thanks, I will fix it
|
We have no implicit encoder for vector udt. But we can explicitly create it. |
|
@viirya yeah, sorry I meant we should create an encoder that can be used in |
|
encoder now supports UDTs. You just need to declare one before you want to use it, since sql implicit does not include implicit ones for them. |
|
@viirya can you provide an example of how this works for use here in this PR? |
|
Test build #57225 has finished for PR 12718 at commit
|
|
e.g., in to We can add: in the class. |
|
Test build #57231 has finished for PR 12718 at commit
|
|
Test build #57235 has finished for PR 12718 at commit
|
|
Test build #57238 has finished for PR 12718 at commit
|
|
Test build #57245 has finished for PR 12718 at commit
|
|
Test build #57246 has finished for PR 12718 at commit
|
|
Test build #57247 has finished for PR 12718 at commit
|
|
@viirya Thanks. I updated this PR following your example. |
|
@jkbradley @mengxr @jaceklaskowski The new |
| case Row(label: Double, features: Vector) => | ||
| LabeledPoint(label, features) | ||
| } | ||
| val input = dataset.select(col($(labelCol)).cast(DoubleType).as("label"), |
There was a problem hiding this comment.
Sorry, couldn't resist :) I'd change "label" to be a symbol 'label. Not very widely used, but think it deserves its place in the code.
There was a problem hiding this comment.
@jaceklaskowski Do you mean change from col($(labelCol)).cast(DoubleType).as("label") to col($(labelCol)).cast(DoubleType).as(getDefault(labelCol).get) ?
|
Other than the few places where you could use symbols not string literals LGTM. Excellent job! Thanks. |
|
This PR is too old and have many conflict with current master. I will close it. |
What changes were proposed in this pull request?
Replace rdd with dataset in ML.
From:
To:
How was this patch tested?
local build