Table of Contents:
The summary of Advantages and Disadvantages of different Classification Models [GeeksForGeeks]:
| model | linear? | bias | variance | pro | con |
|---|---|---|---|---|---|
| Logistic regression | linear | high | low | probabilistic approach | assume linear decision boundary |
| Naive bayes | nonlinear | high | low | probabilistic approach, efficient, not biased by outliers, also works for nonlinear problems | assume no interaction between features |
| Tree | nonlinear | low | high | Interpretability, no need for feature scaling. | Poor results on very small datasets, overfitting can easily occur. |
| SVM | by kernel | low | high | not biased by outliers, not sensitive to overfitting | Not good choice for large number of features and size |
| kNN | nonlinear | low | high | Simple to understand, fast and efficient. | tune the number of neighbours ‘k’ |
| Emsemble | nonlinear | low | high | Powerful and accurate, good performance. | No interpretability, overfitting can easily occur, need hyperparameter tuning |
Summarize as follows (from Big Data Zone: Logistic Regression vs. Decision Tree)
| Logistic Regression | Decision Tree | |
|---|---|---|
| decision boundary | linear (b, d), works well if classes are not well-separated | non-linear (a,c) |
| relatively small data size | Yes | No |
| categorical data | need to Enumeration or OHE | Yes |
| data skewed | need to increase weight to the minority class or balance. | grow full tree |
| outlier | change decision boundary | at the initial stage, won't be affected, but later potentially yes |
| missing value | need to impute by mean, mode, and median | Yes. see How do decision tree learning algorithms deal with missing values |
| online learning | Use SGD | No |
Comparison can be visualized below (credit from Logistic Regression versus Decision Trees)
- For “relatively” very small dataset sizes, compare the performance of a discriminative Logistic Regression model to a related Naive Bayes classifier (a generative model) or SVMs, where the later may be less susceptible to noise and outlier points. Even so, logistic regression is a great, robust model for simple classification tasks. (by Sebastian Raschka).
- Logistic regression can be regarded as a one layer neural network (by Sebastian Raschka).
- One of the nice properties of logistic regression is that the logistic cost function (or max-entropy) is convex, and thus we are guaranteed to find the global cost minimum (by Sebastian Raschka).
Amazon’s Data Scientist Interview Practice Problems
Decision Trees: a tree-like model used to model decisions based on one or more conditions.
- Pros: easy to implement, intuitive, handles missing values
- Cons: high variance, inaccurate
Support Vector Machines: a classification technique that finds a hyperplane or a boundary between the two classes of data that maximizes the margin between the two classes. There are many planes that can separate the two classes, but only one plane can maximize the margin or distance between the classes.
- Pros: accurate in high dimensionality
- Cons: prone to over-fitting, does not directly provide probability estimates
Random Forests: an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree.
- Pros: can achieve higher accuracy, handle missing values, feature scaling not required, can determine feature importance.
- Cons: black box, computationally intensive
In most cases, there are no perfect classifiers. A good common question is which metric should we use for model selection, precision or recall? Classifier to have high True Positive Rate (TPR) or False Positive Rate (FPR)? It depends on domain and our business goal.
Recall the confusion matrix:
| actual positive | actual negative | |
|---|---|---|
| predicted positive | TP | FP |
| predicted negative | FN | TN |
The relevant metrics are
precision = TP/(TP+FP), recall = TP/(TP+FN)
and
TPR = recall, FPR = FP/(FP+TN)
Each probability threshold in a classifier determiones a set of the above metrics. The relation between probability threshold and the metrics are
large threshold -> less positive predicted -> less TP -> lower TPR (recall), higher FPR -> higher precision
small threshold -> more positive predicted -> more TP -> higher TPR (recall), lower FPR -> higher recall
Also we can define review rate
review rate = N(prob > threshold)/N
where N is the number of data points.
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
Interpretation: AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.
AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions (from Classification: ROC Curve and AUC):
Below there are some examples to ask: when precision is important and when recall is important ? [Data Science: When is precision more important over recall?] [Cross Validated: How to determine the optimal threshold for a classifier and generate ROC curve?]. The answer depends on which we want to minimize, FP or FN costs more? Note it has been mentioned in the post that you could have 100% recall yet have a useless model: if your model always outputs a positive prediction, it would have 100% recall but be completely uninformative.
Here I summarize the cases from the above posts and list in the following:
-
For rare cancer data modeling, a false negative is usually more disastrous than a false positive for preliminary diagnoses. We want to minimize FN to have higher recall. So Recall is a better measure than precision.
-
For YouTube recommendations, FN is less of a concern. Precision is better here (if too many FP, users feel annoyed, so minimize FP).
-
Imagine a lot of free customers register in our websites every daily. The customer call center team doesn't care to call a guy that is not going to buy (so FP is not important) but for us is very important that all of them with high temperature are always in my selection. That means that a model needs to have a high Recall.
-
For spam filtering, a FP occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery and we may lose important messgaes. Therefore we prefer more FN over many FP, and Precision is more important.
-
Let us say that a machine learning model is created to predict whether a certain day is a good day to launch satellites or not based on the weather. If the model accidentally predicts that a good day to launch satellites is bad (FN), we miss the chance to launch. This is not such a big deal. However, if the model predicts that it is a good day, but it is actually a bad day to launch the satellites (FP) then the satellites may be destroyed and the cost of damages will be in the billions. This is a case Precision is more important.
-
In the case of airport security, where a safety risk is the positive class, we want to make sure that every potential safety risk is investigated. In this case, we will have high Recall at the expense of precision (a lot of bags where there are no safety hazards will be investigated).
-
Imagine that we want to make sure that our web site blocker for our child only allows 'safe' websites to be shown. In this case, a 'safe' website is the positive class. Here, we want the blocker to be absolutely certain that the website is safe, even if some safe websites are predicted to be part of the negative or unsafe class and are consequently blocked. That is, we want high precision at the expense of recall.
If there is no external business concern about low TPR or high FPR, one option is to weight them equally by choosing the threshold:
- which is a
median valueof probability distribution, - which maximizes
TPR-FPR, - which has optimal
F1 score[Cross Validated: How to determine the optimal threshold for a classifier and generate ROC curve?]:
where
In the imbalanced cases, how do fraud rates (proportion of positive events) influence the metrics? The post [Sin-Yi Chou] has very wonderful disucssion about it and we can have intuition as follows.
Suppose for same amount of positive cases and model performance, lower fraud rate means more negative events. Thus precision may drop if recall keeps the same. The False positive rate doesn't change. Then we can expect ROC curve remains similar, but precision-recall curve will change.
[Sin-Yi Chou] shows comparison on ROC and PR curves at various positive rates: 0.5, 0.1, 0.01 below. We can see in (a) ROC patterns are roughly irrespective of the positive rates.
However, in (b), we can see the PR curves show significant difference. When positive rates descrease, the PR curves shift downward. At the same recall, precision drops. This is consistent with our expectation.
On the other side, the author also shows PR curves is more useful to compare model performance in imbalanced cases. At different positive rates, all ROC AUC are 0.8 on example A and B (see below). But in PR curves, we can obviously see the difference between the examples. Thus in the highly-imbalanced case, the PR curve is a better indication.
The cross-entropy of the generic form given a data record is
where
The cross-entropy can be used as loss to optimize using gradient descent in classification.
For binary classification
See an example below (credit from Cross-entropy for classification)
The hypothesis function
Note that the loss function of logistic regression model is convex. The followings are some detailed discussion:
-
Why is the error function minimized in logistic regression convex?
-
Logistic regression - Prove That the Cost Function Is Convex
In short, we can roighly argue that the second derivatives of the loss are positive semi-definite, and the linear-combination of two or more convex functions
For multiclass, say
where
The cross-entropy of a
When using deep learning models for classification tasks, we usually encounter the following problems: overfitting, and overconfidence. Overfitting is well studied and can be tackled with early stopping, dropout, weight regularization etc. Label smoothing is a regularization technique that addresses both problems [Wanshun Wong].
- [Sin-Yi Chou] Precision - Recall Curve, a Different View of Imbalanced Classifiers
- [Cross Validated: How to determine the optimal threshold for a classifier and generate ROC curve?] How to determine the optimal threshold for a classifier and generate ROC curve?
- [Data Science: When is precision more important over recall?] When is precision more important over recall?
- [GeeksForGeeks] Advantages and Disadvantages of different Classification Models
- [UFLDL Tutorial] Softmax Regression
- [Wanshun Wong] What is Label Smoothing?







