|
14 | 14 | "\n", |
15 | 15 | "$$y = f([x_1, x_2, x_3, ...])$$\n", |
16 | 16 | "\n", |
17 | | - "The goal of training the model is to find a function that performs some kind of calcuation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n", |
| 17 | + "The goal of training the model is to find a function that performs some kind of calculation to the *x* values that produces the result *y*. We do this by applying a machine learning *algorithm* that tries to fit the *x* values to a calculation that produces *y* reasonably accurately for all of the cases in the training dataset.\n", |
18 | 18 | "\n", |
19 | 19 | "There are lots of machine learning algorithms for supervised learning, and they can be broadly divided into two types:\n", |
20 | 20 | "\n", |
|
221 | 221 | "\n", |
222 | 222 | "- **holiday**: There are many fewer days that are holidays than days that aren't.\n", |
223 | 223 | "- **workingday**: There are more working days than non-working days.\n", |
224 | | - "- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparitively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n", |
| 224 | + "- **weathersit**: Most days are category *1* (clear), with category *2* (mist and cloud) the next most common. There are comparatively few category *3* (light rain or snow) days, and no category *4* (heavy rain, hail, or fog) days at all.\n", |
225 | 225 | "\n", |
226 | 226 | "Now that we know something about the distribution of the data in our columns, we can start to look for relationships between the features and the **rentals** label we want to be able to predict.\n", |
227 | 227 | "\n", |
|
278 | 278 | "cell_type": "markdown", |
279 | 279 | "metadata": {}, |
280 | 280 | "source": [ |
281 | | - "The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals." |
| 281 | + "The plots show some variance in the relationship between some category values and rentals. For example, there's a clear difference in the distribution of rentals on weekends (**weekday** 0 or 6) and those during the working week (**weekday** 1 to 5). Similarly, there are notable differences for **holiday** and **workingday** categories. There's a noticeable trend that shows different rental distributions in summer and fall months compared to spring and winter months. The **weathersit** category also seems to make a difference in rental distribution. The **day** feature we created for the day of the month shows little variation, indicating that it's probably not predictive of the number of rentals." |
282 | 282 | ] |
283 | 283 | }, |
284 | 284 | { |
|
307 | 307 | "source": [ |
308 | 308 | "After separating the dataset, we now have numpy arrays named **X** containing the features, and **y** containing the labels.\n", |
309 | 309 | "\n", |
310 | | - "We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distibution to the data on which it was trained).\n", |
| 310 | + "We *could* train a model using all of the data; but it's common practice in supervised learning to split the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to validate the trained model. This enables us to evaluate how well the model performs when used with the validation dataset by comparing the predicted labels to the known labels. It's important to split the data *randomly* (rather than say, taking the first 70% of the data for training and keeping the rest for validation). This helps ensure that the two subsets of data are statistically comparable (so we validate the model with data that has a similar statistical distribution to the data on which it was trained).\n", |
311 | 311 | "\n", |
312 | 312 | "To randomly split the data, we'll use the **train_test_split** function in the **scikit-learn** library. This library is one of the most widely used machine learning packages for Python." |
313 | 313 | ] |
|
567 | 567 | "\n", |
568 | 568 | "### Try an Ensemble Algorithm\n", |
569 | 569 | "\n", |
570 | | - "Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by appying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n", |
| 570 | + "Ensemble algorithms work by combining multiple base estimators to produce an optimal model, either by applying an aggregate function to a collection of base models (sometimes referred to a *bagging*) or by building a sequence of models that build on one another to improve predictive performance (referred to as *boosting*).\n", |
571 | 571 | "\n", |
572 | 572 | "For example, let's try a Random Forest model, which applies an averaging function to multiple Decision Tree models for a better overall model." |
573 | 573 | ] |
|
722 | 722 | "\n", |
723 | 723 | "We trained a model with data that was loaded straight from a source file, with only moderately successful results.\n", |
724 | 724 | "\n", |
725 | | - "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing trasformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", |
| 725 | + "In practice, it's common to perform some preprocessing of the data to make it easier for the algorithm to fit a model to it. There's a huge range of preprocessing transformations you can perform to get your data ready for modeling, but we'll limit ourselves to a few common techniques:\n", |
726 | 726 | "\n", |
727 | 727 | "### Scaling numeric features\n", |
728 | 728 | "\n", |
|
738 | 738 | "| -- | --- | --- |\n", |
739 | 739 | "| 0.3 | 0.48| 0.65|\n", |
740 | 740 | "\n", |
741 | | - "There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to mainatain the same *spread* of values on a different scale.\n", |
| 741 | + "There are multiple ways you can scale numeric data, such as calculating the minimum and maximum values for each column and assigning a proportional value between 0 and 1, or by using the mean and standard deviation of a normally distributed variable to maintain the same *spread* of values on a different scale.\n", |
742 | 742 | "\n", |
743 | 743 | "### Encoding categorical variables\n", |
744 | 744 | "\n", |
|
0 commit comments