- Statistical Rigor
- Significance test
- How confident are we that a sample of data can prove or disprove an assumption?
- "Formalized framework for comparing and evaluating data"
- Significance test
- Running statistical significance tests
- Many test make assumptions about data's distribution
- Normal Distributions
- Two parameters
- Mean (mu)
- Std (sigma)
- Variance = sigma^2
- Two parameters
-
null hypothesis
- statement to 'disprove' or reject with a test
-
Welch's T-test
- Used for comparing two samples which don't necessarily have the same sample size
- Formula:
- In code:
import math def welch_t(mu_1, mu_2, var_1, var_2, n1, n2): difference = (mu - mu_2) return difference / math.sqrt( (var_1 / n1) + (var_2 / n2))- Calculate degrees of freedom (aka nu)
- In code:
import math def welch_df(mu_1, mu_2, var_1, var_2, n1, n2): var_with_n_factored = (var_1 / n1 + var_2 / n2) ** 2 return (var_with_n_factored / (var_1*2) / n1**2*(n-1) + (var_2*2) / n2**2*(n-1) )- Once you have t and nu, you can calculate p-value
- "p == probability of obtaining the t-statistic as extreme as the one observed, if null was true"
-
ttests in Python
- Use function
scipy.stats.ttest_ind - With args:
(list_1, list_2, equal_var=False)- the equal_var=False means it uses Welch's formula
- Returns tuple
(t_stat, two_sided_p_value) - To get one-sided p_value, divide by half and ensure greater or less than 0 (depending on whether you're looking for positive or negative results)
- Use function
-
Pandas refresher
- read dataframe from CSV
import pandas df = pandas.read_csv('~/filename')- filter dataframe
middleweights = df[df.weightclass == 'middle'] -
Non-parametric tests
- a test that doesn't assume data is drawn from any underlying prob distribution
- Mann-Whitney U Test
- Tests null hypothesis that two populations are the same
u, p = scipy.stats.mannwhitneyu(x, y)
-
Non-normal data
- Shapiro-wilk test
w_test_stat, prob = scipy.stats.shapiro(data)
- Shapiro-wilk test
-
Machine learning
- "Branch of AI that focuses on constructing systems from large amounts of data to make predictions"
- Differs from stats because it's focused on making predictions rather than drawing conclusions
- Types of ML problems
- Supervised learning
- Spam filter
- Unsupervised learning
- Split photos into groups without tell it what groups to use
- Supervised learning
-
Prediction with Regression
- Takes in data points as input variables and build most accurate equation
-
Linear Regression with Gradient Descent
- cost function: want to minimize J(theta)
-
How to minimize cost function
- Start with some Theta value
- For each Theta, update Theta values according to this equation
theta - alpha / m * numpy.dot((predicted_values) - values), features) -
Need to understand what the hell
numpy.dotdoes. It's never explained. -
Coefficients of determination (R-squared)
- Have a number of data values
y[i] through to y[n] - Have a bunch of predictions
f[i] through to f[n] - Average value for the data
y_bar
- R^2 is:
1 - ( sum(y[i] - f[i])^2) / sum((y[i] - y_bar)^2) )- Point of it is to determine how effective your coefficients you've gotten from linear regression
- Have a number of data values
-
Additional considerations
- Other types of linear regression
- Ordinary least squares regression
- Guaranteed to find optimal solution when performing linear regression
- Ordinary least squares regression
- Parameter estimation
- Putting confidence intervals on parameters
- Underfitting data
- Trying to fit linear model to non-linear data
- Gradient descent might find a "local minimum" that's not the actual minimum
- Could random start points and compare results
- Other types of linear regression
-
Kurt's Advice for Machine Learning practises
- With any problem:
- What do we know?
- What expectations do we have?
- Is there any intuition for the data?
- Pick which part of data science you are most interested in and focus on it
- Coding
- Maths/stats
- Communication
- With any problem:



