Skip to content

[SPARK-7425] [ML] spark.ml Predictor should support other numeric types for label#10355

Closed
BenFradet wants to merge 64 commits into
apache:masterfrom
BenFradet:SPARK-7425
Closed

[SPARK-7425] [ML] spark.ml Predictor should support other numeric types for label#10355
BenFradet wants to merge 64 commits into
apache:masterfrom
BenFradet:SPARK-7425

Conversation

@BenFradet

Copy link
Copy Markdown
Contributor

Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method.

@BenFradet

Copy link
Copy Markdown
Contributor Author

This is still a wip but remarks are welcome.

@SparkQA

SparkQA commented Dec 17, 2015

Copy link
Copy Markdown

Test build #47917 has finished for PR 10355 at commit c68ace1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 17, 2015

Copy link
Copy Markdown

Test build #47918 has finished for PR 10355 at commit 1027e83.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 17, 2015

Copy link
Copy Markdown

Test build #47919 has finished for PR 10355 at commit b014357.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 17, 2015

Copy link
Copy Markdown

Test build #47921 has finished for PR 10355 at commit 47cc81b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah

sethah commented Dec 17, 2015

Copy link
Copy Markdown
Contributor

I assume since this is a WIP you are still going to add test cases for the other predictors? Additionally, since ShortType and DecimalType also extend NumericType, I think the tests should include those cases.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be less verbose to create the dataframe once, and then add the other label column types to the same data frame. Something like:

val dfWithTypes = df
      .withColumn("shortLabel", df("label").cast(ShortType))
      .withColumn("longLabel", df("label").cast(LongType))
      .withColumn("intLabel", df("label").cast(IntegerType))
      .withColumn("floatLabel", df("label").cast(FloatType))
      .withColumn("decimalLabel", df("label").cast(DecimalType(10, 0)))

Then just change the label column between training. I'm not sure which way is better, but this would reduce copying the code ~5 times per test.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so too, thanks.

@BenFradet

Copy link
Copy Markdown
Contributor Author

@sethah Yup, I plan to make it exhaustive but it'll take a bit of time.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's not clear what this test is checking for. It looks like it just checks that no errors are thrown during training the tree. Maybe we should train a tree with a data frame with DoubleType as the column and check equality between training trees with other label column types?

val doubleModel = dt.fit(dfWithDoubleLabels)
val intModel = dt.fit(dfWithIntLabels)
TreeTests.checkEqual(doubleModel, intModel)

@SparkQA

SparkQA commented Dec 17, 2015

Copy link
Copy Markdown

Test build #47940 has finished for PR 10355 at commit 4ada320.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 18, 2015

Copy link
Copy Markdown

Test build #47996 has finished for PR 10355 at commit 5fd5069.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 18, 2015

Copy link
Copy Markdown

Test build #47997 has finished for PR 10355 at commit 0c6feab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet

Copy link
Copy Markdown
Contributor Author

I noticed that AFTSurvivalRegression and IsotonicRegression use their own validateAndTransformSchema method.

Should I make it support other numeric types in this PR or in another one?

@SparkQA

SparkQA commented Dec 18, 2015

Copy link
Copy Markdown

Test build #48009 has finished for PR 10355 at commit 67162db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is caused by AttributeFactory#fromStructField.

I don't know what's the best way to handle this, input welcome.

@BenFradet BenFradet changed the title [SPARK-7425] [ML] [WIP] spark.ml Predictor should support other numeric types for label [SPARK-7425] [ML] spark.ml Predictor should support other numeric types for label Dec 18, 2015
@SparkQA

SparkQA commented Dec 18, 2015

Copy link
Copy Markdown

Test build #48023 has finished for PR 10355 at commit 1003291.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Dec 18, 2015

Copy link
Copy Markdown

Test build #48024 has finished for PR 10355 at commit 46ca0a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet

Copy link
Copy Markdown
Contributor Author

cc @jkbradley

@BenFradet

Copy link
Copy Markdown
Contributor Author

Also, what about the other things expecting a labelCol of DoubleType like the different evaluators?
Should this be handled in a follow-up jira/pr?

@SparkQA

SparkQA commented Jan 20, 2016

Copy link
Copy Markdown

Test build #49793 has finished for PR 10355 at commit 3c7b6c4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jan 20, 2016

Copy link
Copy Markdown

Test build #49794 has finished for PR 10355 at commit 2c3b4cf.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA

SparkQA commented Jan 20, 2016

Copy link
Copy Markdown

Test build #49796 has finished for PR 10355 at commit 2c55bb9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet

Copy link
Copy Markdown
Contributor Author

Failing tests do not seem related to the changes introduced, triggering a new build.
org.apache.spark.sql.hive.MetastoreDataSourcesSuite.insert into a table
org.apache.spark.sql.hive.MetastoreDataSourcesSuite.SPARK-8156:create table to specific database by 'use dbname'

@BenFradet

Copy link
Copy Markdown
Contributor Author

Jenkins, retest this please

@SparkQA

SparkQA commented Jan 20, 2016

Copy link
Copy Markdown

Test build #49805 has finished for PR 10355 at commit 2c55bb9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet

Copy link
Copy Markdown
Contributor Author

Pinging @jkbradley and @mengxr

@SparkQA

SparkQA commented Mar 5, 2016

Copy link
Copy Markdown

Test build #52519 has finished for PR 10355 at commit dd18ddc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BenFradet

Copy link
Copy Markdown
Contributor Author

@mengxr @jkbradley is there any interest in this pr or should i close it?

@SparkQA

SparkQA commented Apr 1, 2016

Copy link
Copy Markdown

Test build #54707 has finished for PR 10355 at commit 718774b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley

Copy link
Copy Markdown
Member

LGTM
I'll go ahead and merge this with master before a conflict happens
Thanks for the PR!

@asfgit asfgit closed this in 36e8fb8 Apr 2, 2016
@BenFradet

Copy link
Copy Markdown
Contributor Author

No problem, thanks for reviewing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants