Tim/privacy experiments ces22#180
Conversation
…he model directories
This reverts commit 5c9d026.
…dict.py" This reverts commit a948aec.
This reverts commit e31f08d.
…ictions.csv" This reverts commit 2f7cbaa.
This reverts commit bed5f93.
This reverts commit 06277a9.
This reverts commit 258b294.
This reverts commit 2dbd04e.
…n predict.py" This reverts commit 3c7c4c2.
…y from the model directories" This reverts commit ead6749.
This reverts commit 298bd15.
…ataset" This reverts commit e8f3700.
… target value
The evaluate-classifier-roc.py and evalyate-classifier-statistics.py scripts
generate an ROC curve and generic metrics (respectively) for the predictions
in data/predictions.csv
Three open questions:
(1) Is the use of classes_ in scripts/predict.py correct?
(2) Even if the answer to (1) is "Yes," is use of that vector acceptable
given that there is apparently no other way to map the elements returned
by predict_proba() to values in the predicted column?
(3) Is the vector yt[] computed in evaluate-classifer-roc.py correct as the
first argument to roc_curve()?
Schaechtle
left a comment
There was a problem hiding this comment.
This looks mostly good. But we should fix the inline comments.
| # Create a new binary column that is 1 IFF the classifier prediction matches | ||
| # the true value. This is what roc_curve() seems to want, but I'm not 100% | ||
| # sure. | ||
| yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] |
There was a problem hiding this comment.
ah, yt in sklearn is normally short for "y-test". Meaning the test label you want to predict, i.e. tv.
There was a problem hiding this comment.
So either delete it or turn it into a binary integer vector.
| yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] | |
| yt =[int(v) for v in tv] |
| yt = [1 if yv[i] == tv[i] else 0 for i in range(len(tv))] | ||
|
|
||
| # Compute ROC curve and ROC area | ||
| fpr, tpr, thresholds = roc_curve(yt, yp) |
There was a problem hiding this comment.
Alternatively, if tv is already binary you could just put tv here.
| fpr, tpr, thresholds = roc_curve(yt, yp) | |
| fpr, tpr, thresholds = roc_curve(tv, yp) |
| # https://scikit-learn.org/stable/modules/generated/sklearn. | ||
| # linear_model.LogisticRegression.html#sklearn.linear_model. | ||
| # LogisticRegression.predict_proba | ||
| probabilities = ml_model.predict_proba(X_test) |
There was a problem hiding this comment.
So, we need to decide which one of the two binary classes we want to predict and always grep the corresponding vector. Assuming the labels are encoded as 0, and 1, you could do the following:
| probabilities = ml_model.predict_proba(X_test) | |
| probabilities = ml_model.predict_proba(X_test) | |
| j = list(ml_model.classes_).index(1) |
Alternatively, if your true value is encoded as "true-value", you could run j = list(ml_model.classes_).index("true-value").
There was a problem hiding this comment.
Maybe for greatest generality we should specify the value to be predicted in params.yaml, then use that last method to create a new vector saying whether or not the predicted value equals that? That way we could handle dataset options like yes/no, true/false, without any preprocessing?
| # LogisticRegression.predict_proba | ||
| probabilities = ml_model.predict_proba(X_test) | ||
| for i in range(len(probabilities)): | ||
| j = list(ml_model.classes_).index(results["prediction"][i]) |
There was a problem hiding this comment.
.... and then you can delete this line.
This will become important when we add DVC stages for reproducibility (see following commit).
It's not needed if we save the resulting figure to file.
Will help make this part of a reproducible pipeline.
CAVEAT: Users need to change the positive label entry in params.yaml
This PR reverts a bunch of previous commits and in the end produces three net changes:
(1) Extend
scripts/predict.pyto add a column todata/predictions.csvwith the probability of the predicted value for the target feature(2) Add
scripts/evaluate-classifier-statistics.pyto compute generic metrics for the classifier output indata/predictions.csv(3) Add
scripts/evaluate/classifier-roc.pyto generate a graphical ROC curve for the classifier output indata/predictions.csvThere are some open questions about about the correctness and morality of using the
classes_[]vector inscripts/predict.pyand the correctness of the computed vectoryt[]inscripts/evaluate-classifier-roc.pyas an input to theroc_curve()function.