Skip to content

Latest commit

 

History

History

README.md

Classification

Table of Contents:

1. Binary Classification Models

The summary of Advantages and Disadvantages of different Classification Models [GeeksForGeeks]:

model linear? bias variance pro con
Logistic regression linear high low probabilistic approach assume linear decision boundary
Naive bayes nonlinear high low probabilistic approach, efficient, not biased by outliers, also works for nonlinear problems assume no interaction between features
Tree nonlinear low high Interpretability, no need for feature scaling. Poor results on very small datasets, overfitting can easily occur.
SVM by kernel low high not biased by outliers, not sensitive to overfitting Not good choice for large number of features and size
kNN nonlinear low high Simple to understand, fast and efficient. tune the number of neighbours ‘k’
Emsemble nonlinear low high Powerful and accurate, good performance. No interpretability, overfitting can easily occur, need hyperparameter tuning

1.1 Logistic regression vs decision trees

Summarize as follows (from Big Data Zone: Logistic Regression vs. Decision Tree)

Logistic Regression Decision Tree
decision boundary linear (b, d), works well if classes are not well-separated non-linear (a,c)
relatively small data size Yes No
categorical data need to Enumeration or OHE Yes
data skewed need to increase weight to the minority class or balance. grow full tree
outlier change decision boundary at the initial stage, won't be affected, but later potentially yes
missing value need to impute by mean, mode, and median Yes. see How do decision tree learning algorithms deal with missing values
online learning Use SGD No

Comparison can be visualized below (credit from Logistic Regression versus Decision Trees)

LR_vs_DT

1.2 Other classification model comparisons

  • For “relatively” very small dataset sizes, compare the performance of a discriminative Logistic Regression model to a related Naive Bayes classifier (a generative model) or SVMs, where the later may be less susceptible to noise and outlier points. Even so, logistic regression is a great, robust model for simple classification tasks. (by Sebastian Raschka).
  • Logistic regression can be regarded as a one layer neural network (by Sebastian Raschka).
  • One of the nice properties of logistic regression is that the logistic cost function (or max-entropy) is convex, and thus we are guaranteed to find the global cost minimum (by Sebastian Raschka).

1.3 Compare describe Tree, SVM and random forest

Amazon’s Data Scientist Interview Practice Problems

Decision Trees: a tree-like model used to model decisions based on one or more conditions.

  • Pros: easy to implement, intuitive, handles missing values
  • Cons: high variance, inaccurate

Support Vector Machines: a classification technique that finds a hyperplane or a boundary between the two classes of data that maximizes the margin between the two classes. There are many planes that can separate the two classes, but only one plane can maximize the margin or distance between the classes.

  • Pros: accurate in high dimensionality
  • Cons: prone to over-fitting, does not directly provide probability estimates

Random Forests: an ensemble learning technique that builds off of decision trees. Random forests involve creating multiple decision trees using bootstrapped datasets of the original data and randomly selecting a subset of variables at each step of the decision tree. The model then selects the mode of all of the predictions of each decision tree.

  • Pros: can achieve higher accuracy, handle missing values, feature scaling not required, can determine feature importance.
  • Cons: black box, computationally intensive

2. Binary Classification Metric

2.1 Precision and Recall

In most cases, there are no perfect classifiers. A good common question is which metric should we use for model selection, precision or recall? Classifier to have high True Positive Rate (TPR) or False Positive Rate (FPR)? It depends on domain and our business goal.

Recall the confusion matrix:

actual positive actual negative
predicted positive TP FP
predicted negative FN TN

The relevant metrics are

precision = TP/(TP+FP), recall = TP/(TP+FN)

and

TPR = recall, FPR = FP/(FP+TN)

Each probability threshold in a classifier determiones a set of the above metrics. The relation between probability threshold and the metrics are

  large threshold -> less positive predicted -> less TP -> lower TPR (recall), higher FPR -> higher precision
  small threshold -> more positive predicted -> more TP -> higher TPR (recall), lower FPR -> higher recall

Also we can define review rate

review rate = N(prob > threshold)/N

where N is the number of data points.

2.2 AUC

AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).

Interpretation: AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example.

AUC provides an aggregate measure of performance across all possible classification thresholds. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example. For example, given the following examples, which are arranged from left to right in ascending order of logistic regression predictions (from Classification: ROC Curve and AUC):

2.3 - Selection of precision or recall

2.3.a Business concern

Below there are some examples to ask: when precision is important and when recall is important ? [Data Science: When is precision more important over recall?] [Cross Validated: How to determine the optimal threshold for a classifier and generate ROC curve?]. The answer depends on which we want to minimize, FP or FN costs more? Note it has been mentioned in the post that you could have 100% recall yet have a useless model: if your model always outputs a positive prediction, it would have 100% recall but be completely uninformative.

Here I summarize the cases from the above posts and list in the following:

  1. For rare cancer data modeling, a false negative is usually more disastrous than a false positive for preliminary diagnoses. We want to minimize FN to have higher recall. So Recall is a better measure than precision.

  2. For YouTube recommendations, FN is less of a concern. Precision is better here (if too many FP, users feel annoyed, so minimize FP).

  3. Imagine a lot of free customers register in our websites every daily. The customer call center team doesn't care to call a guy that is not going to buy (so FP is not important) but for us is very important that all of them with high temperature are always in my selection. That means that a model needs to have a high Recall.

  4. For spam filtering, a FP occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery and we may lose important messgaes. Therefore we prefer more FN over many FP, and Precision is more important.

  5. Let us say that a machine learning model is created to predict whether a certain day is a good day to launch satellites or not based on the weather. If the model accidentally predicts that a good day to launch satellites is bad (FN), we miss the chance to launch. This is not such a big deal. However, if the model predicts that it is a good day, but it is actually a bad day to launch the satellites (FP) then the satellites may be destroyed and the cost of damages will be in the billions. This is a case Precision is more important.

  6. In the case of airport security, where a safety risk is the positive class, we want to make sure that every potential safety risk is investigated. In this case, we will have high Recall at the expense of precision (a lot of bags where there are no safety hazards will be investigated).

  7. Imagine that we want to make sure that our web site blocker for our child only allows 'safe' websites to be shown. In this case, a 'safe' website is the positive class. Here, we want the blocker to be absolutely certain that the website is safe, even if some safe websites are predicted to be part of the negative or unsafe class and are consequently blocked. That is, we want high precision at the expense of recall.

2.3.b If no business concern

If there is no external business concern about low TPR or high FPR, one option is to weight them equally by choosing the threshold:

$$F_1 = \frac{2\textrm{P}\textrm{R}}{\textrm{P}+\textrm{R}}$$

where $P$ = Precision and $R$ = Recall.

2.4 Fraud rate to precision, recall, AUC

In the imbalanced cases, how do fraud rates (proportion of positive events) influence the metrics? The post [Sin-Yi Chou] has very wonderful disucssion about it and we can have intuition as follows.

Suppose for same amount of positive cases and model performance, lower fraud rate means more negative events. Thus precision may drop if recall keeps the same. The False positive rate doesn't change. Then we can expect ROC curve remains similar, but precision-recall curve will change.

[Sin-Yi Chou] shows comparison on ROC and PR curves at various positive rates: 0.5, 0.1, 0.01 below. We can see in (a) ROC patterns are roughly irrespective of the positive rates.

imbalanced_ROC_PR

However, in (b), we can see the PR curves show significant difference. When positive rates descrease, the PR curves shift downward. At the same recall, precision drops. This is consistent with our expectation.

On the other side, the author also shows PR curves is more useful to compare model performance in imbalanced cases. At different positive rates, all ROC AUC are 0.8 on example A and B (see below). But in PR curves, we can obviously see the difference between the examples. Thus in the highly-imbalanced case, the PR curve is a better indication.

ROC_PR_model

3. Loss Function (Cost Function): Cross-Entropy

The cross-entropy of the generic form given a data record is

$$\textrm{Cross-Entropy} = -\sum_c p_c \log q_c$$

where $c$ denotes class labels. $p$ is the probability of target having class = $c$, and $q$ is the probability of prediction as class = $c$. In classification, cross-entropy is used to be loss to optimize. Also see below (credit from Cross-entropy for classification)

The cross-entropy can be used as loss to optimize using gradient descent in classification.

3.1 Binary

For binary classification $c = {0, 1}$, if using one-hot representation to $p$, i.e. $p = [1,0]$ for $y = 1$; $p = [0,1]$ for $y = 0$, and prediction $q$ is a sigmoid function, it arrives at the commonly-seen cross-entropy ($h$ is the hypothesis function)

$$L(\theta, \symbf{x}) = - \big( y \cdot \log(h_{\theta}(\symbf{x})) + (1-y) \cdot \log{(1-h_{\theta}(\symbf{x}))} \big)$$

See an example below (credit from Cross-entropy for classification)

The hypothesis function $h_{\theta}(\symbf{x})$ for binary case is the sigmoid function $g(z)=1/(1+e^{-z})$, thus the loss function is:

$$L(\theta, \symbf{x}) = - \Big[ y \cdot \log \Big(\frac{1}{1+e^{-\theta^T \symbf{x}}} \Big) + (1-y) \cdot \log \Big(1- \frac{1}{1+e^{-\theta^T \symbf{x}}} \Big) \Big]$$

Note that the loss function of logistic regression model is convex. The followings are some detailed discussion:

In short, we can roighly argue that the second derivatives of the loss are positive semi-definite, and the linear-combination of two or more convex functions $\log{(h)}$ and $\log{(1-h)}$ is also convex.

3.2 Multiclass

For multiclass, say $K$ classes, so $c = {1, ...K}$. If our target is a one-hot vector, $p = [1, 0, ... 0]$ for $y = 1$, $p = [0, 1, 0,... 0]$, for $y = 2$ and $p = [0, 0, ..., 1]$ for $y = K$, we arrive at the multiclassification cost function [UFLDL Tutorial]:

$$L(\theta, \symbf{x}) = - \sum^K_{j=1}\symbf{I}(y=j)\log\big( h_{\theta}(\symbf{x}) \big) = - \sum^K_{j=1} \symbf{I}(y=j)\log \Big( \frac{e^{\theta^{(j)T}\symbf{x}}}{ \sum_{j=1} e^{\theta^{(j)T}\symbf{x}}} \Big)$$

where $\symbf{I} = 1$ for $y = j$; otherwise $\symbf{I}=0$.

The cross-entropy of a $K = 3$ example is shown below (credit from Cross-entropy for classification)

4. Label Smoothing

When using deep learning models for classification tasks, we usually encounter the following problems: overfitting, and overconfidence. Overfitting is well studied and can be tackled with early stopping, dropout, weight regularization etc. Label smoothing is a regularization technique that addresses both problems [Wanshun Wong].

Reference