eplanner
diff --git a/‎01 - Data Exploration.ipynb‎
Lines changed: 65 additions & 3 deletions b/‎01 - Data Exploration.ipynb‎
Lines changed: 65 additions & 3 deletions
diff --git a/‎02 - Regression.ipynb‎
Lines changed: 19 additions & 12 deletions b/‎02 - Regression.ipynb‎
Lines changed: 19 additions & 12 deletions
diff --git a/‎03 - Classification.ipynb‎
Lines changed: 8 additions & 2 deletions b/‎03 - Classification.ipynb‎
Lines changed: 8 additions & 2 deletions
diff --git a/‎04 - Clustering.ipynb‎
Lines changed: 16 additions & 2 deletions b/‎04 - Clustering.ipynb‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎05b - Convolutional Neural Networks (PyTorch).ipynb‎
Lines changed: 9 additions & 1 deletion b/‎05b - Convolutional Neural Networks (PyTorch).ipynb‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎05b - Convolutional Neural Networks (Tensorflow).ipynb‎
Lines changed: 9 additions & 1 deletion b/‎05b - Convolutional Neural Networks (Tensorflow).ipynb‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎challenges/01 - Flights Challenge.ipynb‎
Lines changed: 106 additions & 0 deletions b/‎challenges/01 - Flights Challenge.ipynb‎
Lines changed: 106 additions & 0 deletions
@@ -27,7 +27,7 @@
     "\n",
     "Suppose a college takes a sample of student grades for a data science class.\n",
     "\n",
-    "Run the code in the cell below to see the data."
+    "Run the code in the cell below by clicking the **&#9658; Run** button to see the data."
    ]
   },
   {
@@ -612,7 +612,41 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "DataFrames are amazingly versatile, and make it easy to manipulate data. Most DataFrame operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep the existing variable, you need to assign the result of the operation to the existing variable. For example, the following code sorts the student data into descending order of Grade, and assigns the resulting sorted DataFrame to the original **df_students** variable."
+    "DataFrames are designed for tabular data, and you can use them to perform many of the kinds of data analytics operation you can do in a relational database; such as grouping and aggregating tables of data.\n",
+    "\n",
+    "For example, you can use the **groupby** method to group the student data into groups based on the **Pass** column you added previously, and count the number of names in each group - in other words, you can determine how many students passed and failed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(df_students.groupby(df_students.Pass).Name.count())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can aggregate multiple fields in a group using any available aggregation function. For example, you can find the mean study time and grade for the groups of students who passed and failed the course."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "DataFrames are amazingly versatile, and make it easy to manipulate data. Many DataFrame operations return a new copy of the DataFrame; so if you want to modify a DataFrame but keep the existing variable, you need to assign the result of the operation to the existing variable. For example, the following code sorts the student data into descending order of Grade, and assigns the resulting sorted DataFrame to the original **df_students** variable."
    ]
   },
   {
@@ -1072,6 +1106,28 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "In thie example, the datadt is small enough to clearly see that the value **1** is an outlier for the **StudyHours** column, so you can exclude it explicitly. In most real-world cases, it's easier to consider outliers as being values that fall below or above percentiles within which most of the data lie. For example, the following code uses the Pandas **quantile** function to exclude observations below the 0.01th percentile (the value above which 99% of the data reside)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "q01 = df_students.StudyHours.quantile(0.01)\n",
+    "# Get the variable to examine\n",
+    "col = df_students[df_students.StudyHours>q01]['StudyHours']\n",
+    "# Call the function\n",
+    "show_distribution(col)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "> **Tip**: You can also eliminate outliers at the upper end of the distribution by defining a threshold at a high percentile value - for example, you could use the **quantile** function to find the 0.99 percentile below which 99% of the data reside.\n",
+    "\n",
     "With the outliers removed, the box plot shows all data within the four quartiles. Note that the distribution is not symmetric like it is for the grade data though - there are some students with very high study times of around 16 hours, but the bulk of the data is between 7 and 13 hours; The few extremely high values pull the mean towards the higher end of the scale.\n",
     "\n",
     "Let's look at the density for this distribution."
@@ -1471,7 +1527,13 @@
     "\n",
     "- [NumPy](https://numpy.org/doc/stable/)\n",
     "- [Pandas](https://pandas.pydata.org/pandas-docs/stable/)\n",
-    "- [Matplotlib](https://matplotlib.org/contents.html)"
+    "- [Matplotlib](https://matplotlib.org/contents.html)\n",
+    "\n",
+    "## Challenge: Analyze Flight Data\n",
+    "\n",
+    "If this notebook has inspired you to try exploring data for yourself, why not take on the challenge of a real-world dataset containing flight records from the US Department of Transportation? You'll find the challenge in the [/challenges/01 - Flights Challenge.ipynb](./challenges/01%20-%20Flights%20Challenge.ipynb) notebook!\n",
+    "\n",
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -23,6 +23,8 @@
     "\n",
     "In this notebook, we'll focus on *regression*, using an example based on a real study in which data for a bicycle sharing scheme was collected and used to predict the number of rentals based on seasonality and weather conditions. We'll use a simplified version of the dataset from that study.\n",
     "\n",
+    "> **Citation**: The data used in this exercise is derived from [Capital Bikeshare](https://www.capitalbikeshare.com/system-data) and is used in accordance with the published [license agreement](https://www.capitalbikeshare.com/data-license-agreement).\n",
+    "\n",
     "## Explore the Data\n",
     "\n",
     "The first step in any machine learning project is to explore the data that you will use to train a model. The goal of this exploration is to try to understand the relationships between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. This may require some work to detect and fix issues in the data (such as dealing with missing values, errors, or outlier values), deriving new feature columns by transforming or combining existing features (a process known as *feature engineering*), *normalizing* numeric features (values you can measure or count) so they're on a similar scale, and *encoding* categorical features (values that represent discrete categories) as numeric indicators.\n",
@@ -740,15 +742,23 @@
     "\n",
     "### Encoding categorical variables\n",
     "\n",
-    "achine learning models work best with numeric features rather than text values, so you generally need to convert categorical features into numeric representations. A common technique is to use *one hot encoding* to create individual binary (0 or 1) features for each possible category value. For example, suppose your data includes the following categorical feature.\n",
+    "achine learning models work best with numeric features rather than text values, so you generally need to convert categorical features into numeric representations.  For example, suppose your data includes the following categorical feature. \n",
     "\n",
     "| Size |\n",
     "| ---- |\n",
     "|  S   |\n",
     "|  M   |\n",
     "|  L   |\n",
     "\n",
-    "You could use one-hot encoding to translate the possible categories into binary columns like this:\n",
+    "You can apply *ordinal encoding* to substitute a unique integer value for each category, like this:\n",
+    "\n",
+    "| Size |\n",
+    "| ---- |\n",
+    "|  0   |\n",
+    "|  1   |\n",
+    "|  2   |\n",
+    "\n",
+    "Another common technique is to use *one hot encoding* to create individual binary (0 or 1) features for each possible category value. For example, you could use one-hot encoding to translate the possible categories into binary columns like this:\n",
     "\n",
     "|  Size_S  |  Size_M  |  Size_L  |\n",
     "| -------  | -------- | -------- |\n",
@@ -960,18 +970,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Learn More\n",
+    "## Further Reading\n",
     "\n",
-    "To learn more about Scikit-Learn, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/user_guide.html)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Citation\n",
+    "To learn more about Scikit-Learn, see the [Scikit-Learn documentation](https://scikit-learn.org/stable/user_guide.html).\n",
+    "\n",
+    "## Challenge: Predict Real Estate Prices\n",
+    "\n",
+    "Think you're ready to create your own regression model? Try the challenge of predicting real estate property prices in the [/challenges/02 - Real Estate Regression Challenge.ipynb](./challenges/02%20-%20Real%20Estate%20Regression%20Challenge.ipynb) notebook!\n",
     "\n",
-    "The data used in this exercise is derived from [Capital Bikeshare](https://www.capitalbikeshare.com/system-data) and is used in accordance with the published [license agreement](https://www.capitalbikeshare.com/data-license-agreement)."
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -1171,9 +1171,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## Learn More\n",
+    "## Further Reading\n",
     "\n",
-    "Classification is one of the most common forms of machine learning, and by following the basic principles we've discussed in this notebook you should be able to train and evaluate classification models with scikit-learn. It's worth spending some time investigating classification algorithms in more depth, and a good starting point is the [Scikit-Learn documentation](https://scikit-learn.org/stable/user_guide.html)."
+    "Classification is one of the most common forms of machine learning, and by following the basic principles we've discussed in this notebook you should be able to train and evaluate classification models with scikit-learn. It's worth spending some time investigating classification algorithms in more depth, and a good starting point is the [Scikit-Learn documentation](https://scikit-learn.org/stable/user_guide.html).\n",
+    "\n",
+    "## Challenge: Classify Wines\n",
+    "\n",
+    "Feel like challenging yourself to train a classification model? Try the challenge in the [/challenges/03 - Wine Classification Challenge.ipynb](./challenges/03%20-%20Wine%20Classification%20Challenge.ipynb) notebook to see if you can classify wines into their grape varietals!\n",
+    "\n",
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -291,9 +291,23 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this notebook, you've explored clustering; an unsupervised form of machine learning.\n",
+    "## Further Reading\n",
     "\n",
-    "To learn more about clustering with scikit-learn, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/clustering.html)."
+    "To learn more about clustering with scikit-learn, see the [scikit-learn documentation](https://scikit-learn.org/stable/modules/clustering.html).\n",
+    "\n",
+    "## Further Reading\n",
+    "\n",
+    "To learn more about the Python packages you explored in this notebook, see the following documentation:\n",
+    "\n",
+    "- [NumPy](https://numpy.org/doc/stable/)\n",
+    "- [Pandas](https://pandas.pydata.org/pandas-docs/stable/)\n",
+    "- [Matplotlib](https://matplotlib.org/contents.html)\n",
+    "\n",
+    "## Challenge: Cluster Unlabelled Data\n",
+    "\n",
+    "Now that you've seen how to create a clustering model, why not try for yourself? You'll find a clustering challenge in the [/challenges/04 - Clustering Challenge.ipynb](./challenges/04%20-%20Clustering%20Challenge.ipynb) notebook!\n",
+    "\n",
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -539,7 +539,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this notebook, you used PyTorch to train an image classification model based on a convolutional neural network."
+    "## Further Reading\n",
+    "\n",
+    "To learn more about training convolutional neural networks with PyTorch, see the [PyTorch documentation](https://pytorch.org/).\n",
+    "\n",
+    "## Challenge: Safari Image Classification\n",
+    "\n",
+    "Hopefully this notebook has shown you the main steps in training and evaluating a CNN. Why not put what you've learned into practice with our Safari image classification challenge in the [/challenges/05 - Safari CNN Challenge.ipynb](./challenges/05%20-%20Safari%20CNN%20Challenge.ipynb) notebook?\n",
+    "\n",
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -396,7 +396,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "In this notebook, you used Tensorflow to train an image classification model based on a convolutional neural network."
+    "## Further Reading\n",
+    "\n",
+    "To learn more about training convolutional neural networks with TensorFlow, see the [TensorFlow documentation](https://www.tensorflow.org/overview).\n",
+    "\n",
+    "## Challenge: Safari Image Classification\n",
+    "\n",
+    "Hopefully this notebook has shown you the main steps in training and evaluating a CNN. Why not put what you've learned into practice with our Safari image classification challenge in the [/challenges/05 - Safari CNN Challenge.ipynb](./challenges/05%20-%20Safari%20CNN%20Challenge.ipynb) notebook?\n",
+    "\n",
+    "> **Note**: The time to complete this optional challenge is not included in the estimated time for this exercise - you can spend as little or as much time on it as you like!"
    ]
   }
  ],
 
@@ -0,0 +1,106 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Flights Data Exploration Challenge\n",
+    "\n",
+    "In this challge, you'll explore a real-world dataset containing flights data from the US Department of Transportation.\n",
+    "\n",
+    "Let's start by loading and viewing the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "df_flights = pd.read_csv('data/flights.csv')\n",
+    "df_flights.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The dataset contains observations of US domestic flights in 2013, and consists of the following fields:\n",
+    "\n",
+    "- **Year**: The year of the flight (all records are from 2013)\n",
+    "- **Month**: The month of the flight\n",
+    "- **DayofMonth**: The day of the month on which the flight departed\n",
+    "- **DayOfWeek**: The day of the week on which the flight departed - from 1 (Monday) to 7 (Sunday)\n",
+    "- **Carrier**: The two-letter abbreviation for the airline.\n",
+    "- **OriginAirportID**: A unique numeric identifier for the departure aiport\n",
+    "- **OriginAirportName**: The full name of the departure airport\n",
+    "- **OriginCity**: The departure airport city\n",
+    "- **OriginState**: The departure airport state\n",
+    "- **DestAirportID**: A unique numeric identifier for the destination aiport\n",
+    "- **DestAirportName**: The full name of the destination airport\n",
+    "- **DestCity**: The destination airport city\n",
+    "- **DestState**: The destination airport state\n",
+    "- **CRSDepTime**: The scheduled departure time\n",
+    "- **DepDelay**: The number of minutes departure was delayed (flight that left ahead of schedule have a negative value)\n",
+    "- **DelDelay15**: A binary indicator that departure was delayed by more than 15 minutes (and therefore considered \"late\")\n",
+    "- **CRSArrTime**: The scheduled arrival time\n",
+    "- **ArrDelay**: The number of minutes arrival was delayed (flight that arrived ahead of schedule have a negative value)\n",
+    "- **ArrDelay15**: A binary indicator that arrival was delayed by more than 15 minutes (and therefore considered \"late\")\n",
+    "- **Cancelled**: A binary indicator that the flight was cancelled\n",
+    "\n",
+    "Your challenge is to explore the flight data to analyze possible factors that affect delays in departure or arrival of a flight.\n",
+    "\n",
+    "1. Start by cleaning the data.\n",
+    "    - Identify any null or missing data, and impute appropriate replacement values.\n",
+    "    - Identify and eliminate any outliers in the **DepDelay** and **ArrDelay** columns.\n",
+    "2. Explore the cleaned data.\n",
+    "    - View summary statistics for the numeric fields in the dataset.\n",
+    "    - Determine the distribution of the **DepDelay** and **ArrDelay** columns.\n",
+    "    - Use statistics, aggregate functions, and visualizations to answer the following questions:\n",
+    "        - *What are the average (mean) departure and arrival delays?*\n",
+    "        - *How do the carriers compare in terms of arrival delay performance?*\n",
+    "        - *Is there a noticable difference in arrival delays for different days of the week?*\n",
+    "        - *Which departure airport has the highest average departure delay?*\n",
+    "        - *Do **late** departures tend to result in longer arrival delays than on-time departures?*\n",
+    "        - *Which route (from origin airport to destination airport) has the most **late** arrivals?*\n",
+    "        - *Which route has the highest average arrival delay?*\n",
+    "        \n",
+    "Add markdown and code cells as required to create your solution.\n",
+    "\n",
+    "> **Note**: There is no single \"correct\" solution. A sample solution is provided in [01 - Flights Challenge.ipynb](01%20-%20Flights%20Solution.ipynb)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Your code to explore the data"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3.6 - AzureML",
+   "language": "python",
+   "name": "python3-azureml"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.6.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}