[SPARK-20351] [ML] Add trait hasTrainingSummary to replace the duplicate code#17654
[SPARK-20351] [ML] Add trait hasTrainingSummary to replace the duplicate code#17654hhbyyh wants to merge 7 commits into
Conversation
|
Test build #75845 has finished for PR 17654 at commit
|
|
Test build #96194 has finished for PR 17654 at commit
|
|
Thanks for working on this, remove duplicated code is great. I'm curious as to why we couldn't remove some of the function calls to super and instead depend on inheritance? If it's the types on the setters could we add another type parameter of the model? |
| extends RegressionModel[Vector, GeneralizedLinearRegressionModel] | ||
| with GeneralizedLinearRegressionBase with MLWritable { | ||
| with GeneralizedLinearRegressionBase with MLWritable | ||
| with HasTrainingSummary[GeneralizedLinearRegressionTrainingSummary]{ |
There was a problem hiding this comment.
It looks like there still isn't a space here.
| override val numFeatures: Int = coefficients.size | ||
|
|
||
| private[ml] | ||
| override def setSummary(summary: Option[LinearRegressionTrainingSummary]): this.type = |
There was a problem hiding this comment.
Yeah, why do we need these definitions that just invoke the superclass method?
| * | ||
| * @tparam T Summary instance type | ||
| */ | ||
| @Since("2.3.0") |
There was a problem hiding this comment.
Let's target 3.0.0 at this point.
|
Gentle ping here, it's out of sync with master if you've got the time to bring it up to date that would be great. |
|
Ah, just see this. Will update it now. |
|
Test build #100138 has finished for PR 17654 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Looking fine except for a few nits and a question about whether to keep the overrides.
| /** Indicates whether a training summary exists for this model instance. */ | ||
| @Since("1.5.0") | ||
| def hasSummary: Boolean = trainingSummary.isDefined | ||
| override def summary: LinearRegressionTrainingSummary = super.summary |
There was a problem hiding this comment.
On the one hand, you don't need these overrides for this to work correctly, right? but I suppose it's necessary to preserve the @Since tag, which varies across implementations. But these were mostly introduced in 1.5.0, and where they have a later @Since tag, it matches when the class was introduced. I think it would also be coherent, for Spark 3.0, to remove these overrides, and mark the methods in the new trait as @Since 1.5.0. The result would be similar to what would happen if this had been introduced at the start. I don't feel strongly about it but what do you think? would clean up the code a little more.
There was a problem hiding this comment.
I got an error message from Java side when removing summary
/home/yuhao/workspace/github/hhbyyh/spark/mllib/src/test/java/org/apache/spark/ml/classification/JavaLogisticRegressionSuite.java:145: error: incompatible types: Object cannot be converted to LogisticRegressionTrainingSummary
[error] LogisticRegressionTrainingSummary summary = model.summary();
There was a problem hiding this comment.
Ah OK nevermind then. Thanks for checking.
|
|
||
| /** | ||
| * Gets summary of model on training set. An exception is | ||
| * thrown if `trainingSummary == None`. |
There was a problem hiding this comment.
Nit: from the callers perspective they don't know what trainingSummary is. "if hasSummary is false"?
There was a problem hiding this comment.
Sure. Thanks for checking.
| extends RegressionModel[Vector, GeneralizedLinearRegressionModel] | ||
| with GeneralizedLinearRegressionBase with MLWritable { | ||
| with GeneralizedLinearRegressionBase with MLWritable | ||
| with HasTrainingSummary[GeneralizedLinearRegressionTrainingSummary]{ |
There was a problem hiding this comment.
It looks like there still isn't a space here.
|
Test build #100172 has finished for PR 17654 at commit
|
|
Test build #100210 has finished for PR 17654 at commit
|
|
Merged to master |
|
Thanks for the review. @srowen |
…te code ## What changes were proposed in this pull request? Add a trait HasTrainingSummary to avoid code duplicate related to training summary. Currently all the training summary use the similar pattern which can be generalized, ``` private[ml] final var trainingSummary: Option[T] = None def hasSummary: Boolean = trainingSummary.isDefined def summary: T = trainingSummary.getOrElse... private[ml] def setSummary(summary: Option[T]): ... ``` Classes with the trait need to override `setSummry`. And for Java compatibility, they will also have to override `summary` method, otherwise the java code will regard all the summary class as Object due to a known issue with Scala. ## How was this patch tested? existing Java and Scala unit tests Closes apache#17654 from hhbyyh/hassummary. Authored-by: Yuhao Yang <yuhao.yang@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…e in PySpark ## What changes were proposed in this pull request? Python version of apache#17654 ## How was this patch tested? Existing Python unit test Closes apache#23676 from huaxingao/spark26754. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…te code ## What changes were proposed in this pull request? Add a trait HasTrainingSummary to avoid code duplicate related to training summary. Currently all the training summary use the similar pattern which can be generalized, ``` private[ml] final var trainingSummary: Option[T] = None def hasSummary: Boolean = trainingSummary.isDefined def summary: T = trainingSummary.getOrElse... private[ml] def setSummary(summary: Option[T]): ... ``` Classes with the trait need to override `setSummry`. And for Java compatibility, they will also have to override `summary` method, otherwise the java code will regard all the summary class as Object due to a known issue with Scala. ## How was this patch tested? existing Java and Scala unit tests Closes apache#17654 from hhbyyh/hassummary. Authored-by: Yuhao Yang <yuhao.yang@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
…e in PySpark ## What changes were proposed in this pull request? Python version of apache#17654 ## How was this patch tested? Existing Python unit test Closes apache#23676 from huaxingao/spark26754. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>
What changes were proposed in this pull request?
Add a trait HasTrainingSummary to avoid code duplicate related to training summary.
Currently all the training summary use the similar pattern which can be generalized,
Classes with the trait need to override
setSummry. And for Java compatibility, they will also have to overridesummarymethod, otherwise the java code will regard all the summary class as Object due to a known issue with Scala.How was this patch tested?
existing Java and Scala unit tests