[SPARK-29489][ML][PySpark] ml.evaluation support log-loss#26135
[SPARK-29489][ML][PySpark] ml.evaluation support log-loss#26135zhengruifeng wants to merge 7 commits into
Conversation
|
|
||
|
|
||
| private val confusions = predictionAndLabels.map { | ||
| private lazy val confusions = predictionAndLabels.map { |
There was a problem hiding this comment.
If the metricName==logloss, then the confusion matrix is not needed, so I make this computation lazy.
| (prediction, label, 1.0) | ||
| case other => | ||
| throw new IllegalArgumentException(s"Expected Row of tuples, got $other") | ||
| this(predictionAndLabels.rdd.map { r => |
There was a problem hiding this comment.
matching will not work in pyspark, so I have to use r.get instead.
MultilabelMetrics also deals with dataframe in this way.
|
Test build #112149 has finished for PR 26135 at commit
|
|
Test build #112150 has finished for PR 26135 at commit
|
|
Test build #112155 has finished for PR 26135 at commit
|
|
Test build #112163 has finished for PR 26135 at commit
|
| def logLoss(eps: Double = 1e-15): Double = { | ||
| require(eps > 0 && eps < 0.5, s"eps must be in range (0, 0.5), but got $eps") | ||
| val loss1 = - math.log(eps) | ||
| val loss2 = - math.log(1 - eps) |
There was a problem hiding this comment.
- math.log1p(-eps)? because eps is going to be very small
| lazy val labels: Array[Double] = tpByClass.keys.toArray.sorted | ||
|
|
||
| /** | ||
| * Returns the logLoss, aka logistic loss or cross-entropy loss. |
There was a problem hiding this comment.
You could just use a @return tag
Also log-loss rather than logLoss
|
Test build #112217 has finished for PR 26135 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Looking OK pending tests and one very minor comment
| /** | ||
| * An auxiliary constructor taking a DataFrame. | ||
| * @param predictionAndLabels a DataFrame with two double columns: prediction and label | ||
| * @param predictionAndLabels a DataFrame with columns: prediction, label, weight(optional) |
|
Test build #112246 has finished for PR 26135 at commit
|
|
merged to master, thanks @srowen for reviewing! |
What changes were proposed in this pull request?
ml.MulticlassClassificationEvaluator&mllib.MulticlassMetricssupport log-lossWhy are the changes needed?
log-loss is an important classification metric and is widely used in practice
Does this PR introduce any user-facing change?
Yes, add new option ("logloss") and a related param
epsHow was this patch tested?
added testsuites & local tests refering to sklearn