[SPARK-29235][ML][Pyspark]Support avgMetrics in read/write of CrossValidatorModel by shahidki31 · Pull Request #26038 · apache/spark

shahidki31 · 2019-10-06T21:07:59Z

What changes were proposed in this pull request?

Currently pyspark doesn't write/read avgMetrics in CrossValidatorModel, whereas scala supports it.

Why are the changes needed?

Test step to reproduce it:

dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
      (Vectors.dense([0.6]), 1.0),
      (Vectors.dense([1.0]), 1.0)] * 10,
     ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2)
cvModel = cv.fit(dataset)
cvModel.write().save("/tmp/model")
cvModel2 = CrossValidatorModel.read().load("/tmp/model")
print(cvModel.avgMetrics) # prints non empty result as expected
print(cvModel2.avgMetrics) # Bug: prints an empty result.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested

Before patch:

>>> cvModel.write().save("/tmp/model_0")
>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0")
>>> print(cvModel2.avgMetrics)
[]

After patch:

>>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2")
>>> print(cvModel2.avgMetrics[0])
0.5

SparkQA · 2019-10-06T21:13:37Z

Test build #111826 has finished for PR 26038 at commit 1c594da.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-06T21:37:11Z

Test build #111827 has finished for PR 26038 at commit 13e3a59.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2019-10-08T10:05:56Z

cc @zhengruifeng Kindly review

zhengruifeng · 2019-10-09T01:39:36Z

In [1]: from pyspark.ml.classification import LogisticRegression

In [2]: from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [3]: from pyspark.ml.linalg import Vectors

In [4]: dataset = spark.createDataFrame(
   ...:     ...     [(Vectors.dense([0.0]), 0.0),
   ...:     ...      (Vectors.dense([0.4]), 1.0),
   ...:     ...      (Vectors.dense([0.5]), 0.0),
   ...:     ...      (Vectors.dense([0.6]), 1.0),
   ...:     ...      (Vectors.dense([1.0]), 1.0)] * 10,
   ...:     ...     ["features", "label"]).repartition(1)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-4-47bd70df4aa7> in <module>()
      1 dataset = spark.createDataFrame(
      2     ...     [(Vectors.dense([0.0]), 0.0),
----> 3     ...      (Vectors.dense([0.4]), 1.0),
      4     ...      (Vectors.dense([0.5]), 0.0),
      5     ...      (Vectors.dense([0.6]), 1.0),

TypeError: 'ellipsis' object is not callable

In [5]: dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0),(Vectors.dense([0.4]), 1.0),(Vectors.dense([0.5]), 0.0),(Vectors.dense([0.6]), 1.0),(Vectors.dense([1.0]), 1.0)] * 10,["features", "label
   ...: "]).repartition(1)

In [6]: lr = LogisticRegression()

In [7]: grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-045a988cd0ea> in <module>()
----> 1 grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()

NameError: name 'ParamGridBuilder' is not defined

In [8]: from pyspark.ml.tuning import *

In [9]: grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build()

In [10]: evaluator = BinaryClassificationEvaluator()

In [11]: tvs = TrainValidationSplit(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator, parallelism=1, seed=42)

In [12]: tvsModel = tvs.fit(dataset)
19/10/09 09:36:51 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
19/10/09 09:36:51 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

In [13]: tvsModel.save("/tmp/model")

In [14]: tvsModel2 = TrainValidationSplitModel.load("/tmp/model")

In [15]: tvsModel.validationMetrics
Out[15]: [0.5, 0.8857142857142857]

In [16]: tvsModel2.validationMetrics
Out[16]: []

@shahidki31 Same issue also exist in TrainValidationSplitModel, can you also fix it in this pr?
BTW, what about adding doctests for model savle/load? (also check the loaded metrics)

shahidki31 · 2019-10-09T06:31:16Z

Thanks @zhengruifeng I will add metrics for TrainValidationSplitModel too.

srowen · 2019-10-16T16:57:30Z

If you'll make the changes @shahidki31 I think we can merge this.

shahidki31 · 2019-10-17T09:53:46Z

Thanks @srowen . I will update it today. Actually, there seems an issue. I think AvgMetrics need to convert from java to python object, while reading.

SparkQA · 2019-10-17T20:00:56Z

Test build #112229 has finished for PR 26038 at commit 6068f66.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T21:33:42Z

Test build #112234 has finished for PR 26038 at commit b0f1975.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T21:39:30Z

Test build #112235 has finished for PR 26038 at commit 5a79a8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shahidki31 · 2019-10-17T22:10:56Z

Updated the PR. Locally verified.

SparkQA · 2019-10-17T23:19:05Z

Test build #112232 has finished for PR 26038 at commit b0f1975.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-10-17T23:43:06Z

Test build #112233 has finished for PR 26038 at commit 5e39d5a.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class _ValidatorParams(HasSeed):
class _CrossValidatorParams(_ValidatorParams):
class CrossValidator(Estimator, _CrossValidatorParams, HasParallelism, HasCollectSubModels,
class CrossValidatorModel(Model, _CrossValidatorParams, MLReadable, MLWritable):
class _TrainValidationSplitParams(_ValidatorParams):
class TrainValidationSplit(Estimator, _TrainValidationSplitParams, HasParallelism,
class TrainValidationSplitModel(Model, _TrainValidationSplitParams, MLReadable, MLWritable):

shahidki31 · 2019-10-17T23:54:30Z

retest this please

SparkQA · 2019-10-18T00:14:18Z

Test build #112240 has finished for PR 26038 at commit 5a79a8a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-18T00:24:34Z

Test build #112241 has finished for PR 26038 at commit 2755376.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-18T15:49:23Z

            self.uid,
            self.bestModel._to_java(),
-            _py2java(sc, []))
+            self.validationMetrics)


This seems fine but out of curiosity why is the _py2java call no longer needed here?

I think we will be converting _py2java here?

spark/python/pyspark/ml/wrapper.py

Lines 60 to 69 in 9e42c52

def _new_java_obj(java_class, *args):

"""

Returns a new Java object.

"""

sc = SparkContext._active_spark_context

java_obj = _jvm()

for name in java_class.split("."):

java_obj = getattr(java_obj, name)

java_args = [_py2java(sc, arg) for arg in args]

return java_obj(*java_args)

I compared with _py2java here and without here, both cases the written metadata file is same. I'll add _py2java here, for consistency.

SparkQA · 2019-10-18T18:53:38Z

Test build #112292 has finished for PR 26038 at commit 00c4258.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-10-19T20:24:20Z

Merged to master

shahidki31 · 2019-10-20T09:20:11Z

Thanks @srowen @zhengruifeng

Support avgMetrics

1c594da

style

13e3a59

dongjoon-hyun added the ML label Oct 7, 2019

srowen approved these changes Oct 8, 2019

View reviewed changes

zhengruifeng added the PYSPARK label Oct 9, 2019

update

6068f66

style

b0f1975

shahidki31 force-pushed the avgMetrics branch from 5e39d5a to b0f1975 Compare October 17, 2019 21:05

conflict resolve

5a79a8a

shahidki31 requested a review from srowen October 17, 2019 22:04

shahidki31 added 2 commits October 18, 2019 05:27

Merge branch 'master' of https://github.com/apache/spark into avgMetrics

5367e54

minor update

2755376

srowen reviewed Oct 18, 2019

View reviewed changes

address comment

00c4258

srowen approved these changes Oct 19, 2019

View reviewed changes

srowen closed this in 4a6005c Oct 19, 2019

shahidki31 deleted the avgMetrics branch October 20, 2019 09:20

zero323 mentioned this pull request Oct 23, 2019

Sync with changes merged after 6378d4bc06cd1bb1a209bd5fb63d10ef52d75eb4 zero323/pyspark-stubs#230

Closed

47 tasks

	def _new_java_obj(java_class, *args):
	"""
	Returns a new Java object.
	"""
	sc = SparkContext._active_spark_context
	java_obj = _jvm()
	for name in java_class.split("."):
	java_obj = getattr(java_obj, name)
	java_args = [_py2java(sc, arg) for arg in args]
	return java_obj(*java_args)

Uh oh!

Conversation

shahidki31 commented Oct 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Oct 6, 2019

Uh oh!

SparkQA commented Oct 6, 2019

Uh oh!

shahidki31 commented Oct 8, 2019

Uh oh!

zhengruifeng commented Oct 9, 2019

Uh oh!

shahidki31 commented Oct 9, 2019

Uh oh!

srowen commented Oct 16, 2019

Uh oh!

shahidki31 commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

shahidki31 commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 17, 2019

Uh oh!

shahidki31 commented Oct 17, 2019

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

srowen Oct 18, 2019

Choose a reason for hiding this comment

Uh oh!

shahidki31 Oct 18, 2019

Choose a reason for hiding this comment

Uh oh!

shahidki31 Oct 18, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 18, 2019

Uh oh!

srowen commented Oct 19, 2019

Uh oh!

shahidki31 commented Oct 20, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shahidki31 commented Oct 6, 2019 •

edited

Loading