[SPARK-7425] [ML] spark.ml Predictor should support other numeric types for label#10355
[SPARK-7425] [ML] spark.ml Predictor should support other numeric types for label#10355BenFradet wants to merge 64 commits into
Conversation
|
This is still a wip but remarks are welcome. |
|
Test build #47917 has finished for PR 10355 at commit
|
|
Test build #47918 has finished for PR 10355 at commit
|
|
Test build #47919 has finished for PR 10355 at commit
|
|
Test build #47921 has finished for PR 10355 at commit
|
|
I assume since this is a WIP you are still going to add test cases for the other predictors? Additionally, since ShortType and DecimalType also extend NumericType, I think the tests should include those cases. |
There was a problem hiding this comment.
It might be less verbose to create the dataframe once, and then add the other label column types to the same data frame. Something like:
val dfWithTypes = df
.withColumn("shortLabel", df("label").cast(ShortType))
.withColumn("longLabel", df("label").cast(LongType))
.withColumn("intLabel", df("label").cast(IntegerType))
.withColumn("floatLabel", df("label").cast(FloatType))
.withColumn("decimalLabel", df("label").cast(DecimalType(10, 0)))Then just change the label column between training. I'm not sure which way is better, but this would reduce copying the code ~5 times per test.
There was a problem hiding this comment.
I think so too, thanks.
|
@sethah Yup, I plan to make it exhaustive but it'll take a bit of time. |
There was a problem hiding this comment.
I'm not sure it's not clear what this test is checking for. It looks like it just checks that no errors are thrown during training the tree. Maybe we should train a tree with a data frame with DoubleType as the column and check equality between training trees with other label column types?
val doubleModel = dt.fit(dfWithDoubleLabels)
val intModel = dt.fit(dfWithIntLabels)
TreeTests.checkEqual(doubleModel, intModel)|
Test build #47940 has finished for PR 10355 at commit
|
|
Test build #47996 has finished for PR 10355 at commit
|
|
Test build #47997 has finished for PR 10355 at commit
|
|
I noticed that AFTSurvivalRegression and IsotonicRegression use their own Should I make it support other numeric types in this PR or in another one? |
|
Test build #48009 has finished for PR 10355 at commit
|
There was a problem hiding this comment.
This is caused by AttributeFactory#fromStructField.
I don't know what's the best way to handle this, input welcome.
|
Test build #48023 has finished for PR 10355 at commit
|
|
Test build #48024 has finished for PR 10355 at commit
|
|
cc @jkbradley |
|
Also, what about the other things expecting a |
|
Test build #49793 has finished for PR 10355 at commit
|
|
Test build #49794 has finished for PR 10355 at commit
|
|
Test build #49796 has finished for PR 10355 at commit
|
|
Failing tests do not seem related to the changes introduced, triggering a new build. |
|
Jenkins, retest this please |
|
Test build #49805 has finished for PR 10355 at commit
|
|
Pinging @jkbradley and @mengxr |
|
Test build #52519 has finished for PR 10355 at commit
|
|
@mengxr @jkbradley is there any interest in this pr or should i close it? |
predictors accept numeric types and reject other types
|
Test build #54707 has finished for PR 10355 at commit
|
|
LGTM |
|
No problem, thanks for reviewing. |
Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method.