You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 03 - Classification.ipynb
+9-9Lines changed: 9 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -20,7 +20,7 @@
20
20
"source": [
21
21
"## Binary Classification\n",
22
22
"\n",
23
-
"Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercsie, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n",
23
+
"Let's start by looking at an example of *binary classification*, where the model must predict a label that belongs to one of two classes. In this exercise, we'll train a binary classifier to predict whether or not a patient should be tested for diabetes based on some medical data.\n",
24
24
"\n",
25
25
"### Explore the data\n",
26
26
"\n",
@@ -48,7 +48,7 @@
48
48
"cell_type": "markdown",
49
49
"metadata": {},
50
50
"source": [
51
-
"This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our mode to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n",
51
+
"This data consists of diagnostic information about some patients who have been tested for diabetes. Scroll to the right if necessary, and note that the final column in the dataset (**Diabetic**) contains the value ***0*** for patients who tested negative for diabetes, and ***1*** for patients who tested positive. This is the label that we will train our model to predict; most of the other columns (**Pregnancies**,**PlasmaGlucose**,**DiastolicBloodPressure**, and so on) are the features we will use to predict the **Diabetic** label.\n",
52
52
"\n",
53
53
"Let's separate the features from the labels - we'll call the features ***X*** and the label ***y***:"
54
54
]
@@ -99,7 +99,7 @@
99
99
"cell_type": "markdown",
100
100
"metadata": {},
101
101
"source": [
102
-
"For some of the features, there's a noticable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n",
102
+
"For some of the features, there's a noticeable difference in the distribution for each label value. In particular, **Pregnancies** and **Age** show markedly different distributions for diabetic patients than for non-diabetic patients. These features may help predict whether or not a patient is diabetic.\n",
103
103
"\n",
104
104
"### Split the data\n",
105
105
"\n",
@@ -227,7 +227,7 @@
227
227
"\n",
228
228
"> note that the header row may not line up with the values!\n",
229
229
"\n",
230
-
"* *Precision*: Of the predictons the model made for this class, what proportion were correct?\n",
230
+
"* *Precision*: Of the predictions the model made for this class, what proportion were correct?\n",
231
231
"* *Recall*: Out of all of the instances of this class in the test dataset, how many did the model identify?\n",
232
232
"* *F1-Score*: An average metric that takes both precision and recall into account.\n",
233
233
"* *Support*: How many instances of this class are there in the test dataset?\n",
@@ -320,7 +320,7 @@
320
320
"cell_type": "markdown",
321
321
"metadata": {},
322
322
"source": [
323
-
"The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilties are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:"
323
+
"The decision to score a prediction as a 1 or a 0 depends on the threshold to which the predicted probabilities are compared. If we were to change the threshold, it would affect the predictions; and therefore change the metrics in the confusion matrix. A common way to evaluate a classifier is to examine the *true positive rate* (which is another name for recall) and the *false positive rate* for a range of possible thresholds. These rates are then plotted against all possible thresholds to form a chart known as a *received operator characteristic (ROC) chart*, like this:"
324
324
]
325
325
},
326
326
{
@@ -383,10 +383,10 @@
383
383
"\n",
384
384
"In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n",
385
385
"\n",
386
-
"- Scaling numeric features so they're on the same scale. This prevents feaures with large values from producing coefficients that disproportionately affect the predictions.\n",
386
+
"- Scaling numeric features so they're on the same scale. This prevents features with large values from producing coefficients that disproportionately affect the predictions.\n",
387
387
"- Encoding categorical variables. For example, by using a *one hot encoding* technique you can create individual binary (true/false) features for each possible category value.\n",
388
388
"\n",
389
-
"To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and catagory encodings used with the training data).\n",
389
+
"To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature named *pipelines*. These enable us to define a set of preprocessing steps that end with an algorithm. You can then fit the entire pipeline to the data, so that the model encapsulates all of the preprocessing steps as well as the regression algorithm. This is useful, because when we want to use the model to predict values from new data, we need to apply the same transformations (based on the same statistical distributions and category encodings used with the training data).\n",
390
390
"\n",
391
391
">**Note**: The term *pipeline* is used extensively in machine learning, often to mean very different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn, but you may see it used elsewhere to mean something else.\n"
392
392
]
@@ -698,7 +698,7 @@
698
698
"cell_type": "markdown",
699
699
"metadata": {},
700
700
"source": [
701
-
"Now that we know what the feaures and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values."
701
+
"Now that we know what the features and labels in the data represent, let's explore the dataset. First, let's see if there are any missing (*null*) values."
0 commit comments