You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: 01-Introduction.Rmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -663,7 +663,7 @@ This application involves both non-normal data (number of stops by ethnic group
663
663
664
664
1.**Kentucky Derby.** The next set of questions is related to the Kentucky Derby case study from this chapter.
665
665
666
-
a. Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationships between year and track condition in Figure \@ref(fig:bivariate).
666
+
a. Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationship between year and track condition in Figure \@ref(fig:bivariate).
667
667
b. Why is a scatterplot more informative than a correlation coefficient to describe the relationship between speed of the winning horse and year in Figure \@ref(fig:bivariate).
668
668
c. How might you incorporate a fourth variable, say number of starters, into Figure \@ref(fig:codeds)?
669
669
d. Explain why $\epsilon_i$ in Equation \@ref(eq:model1) measures the vertical distance from a data point to the regression line.
@@ -708,7 +708,7 @@ We have convincing evidence that the Sex Conditional Model provides a significan
708
708
709
709
*Note: *You may notice that the LRT is similar in spirit to the extra-sum-of-squares F-test used in linear regression. Recall that the extra-sum-of-squares F-test involves comparing two nested models. When the smaller model is true, the F-ratio follows an F-distribution which on average is 1.0. A large, unusual F-ratio provides evidence that the larger model provides a significant improvement.
710
710
711
-
*Also note: * It might have been more logical to start by using Likelihood Ratio Test to determine whether the probability of having a boy differs significantly from 0.5. We leave this as an exercise.
711
+
*Also note: * It might have been more logical to start by using a Likelihood Ratio Test to determine whether the probability of having a boy differs significantly from 0.5. We leave this as an exercise.
712
712
713
713
## Model 3: Stopping Rule Model (waiting for a boy)
where $\E(Y) = 1/\lambda$, $\SD(Y) = 1/\lambda$. Figure \@ref(fig:multExp) displays three exponential distributions with different $\lambda$ values. As $\lambda$ increases, $\E(Y)$ tends towards 0, and distributions "die off" quicker.
354
+
where $\E(Y) = 1/\lambda$ and $\SD(Y) = 1/\lambda$. Figure \@ref(fig:multExp) displays three exponential distributions with different $\lambda$ values. As $\lambda$ increases, $\E(Y)$ tends towards 0, and distributions "die off" quicker.
355
355
356
356
(ref:multExp) Exponential distributions with $\lambda = 0.5, 1,$ and $5$.
357
357
@@ -586,7 +586,7 @@ In this course, we encounter $\chi^2$ distributions \index{chi-square distributi
586
586
587
587
In general, $\chi^2$ distributions with $k$ degrees of freedom are right skewed with a mean $k$ and standard deviation $\sqrt{2k}$. Figure \@ref(fig:multChisq) displays chi-square distributions with different values of $k$.
588
588
589
-
The $\chi^2$ distribution is a special case of gamma distributions. Specifically, a $\chi^2$ distribution with $k$ degrees of freedom can be expressed as a gamma distribution with $\lambda = 1/2$ and $r = k/2$.
589
+
The $\chi^2$ distribution is a special case of a gamma distribution. Specifically, a $\chi^2$ distribution with $k$ degrees of freedom can be expressed as a gamma distribution with $\lambda = 1/2$ and $r = k/2$.
590
590
591
591
(ref:multChisq) $\chi^2$ distributions with 1, 3, and 7 degrees of freedom..
Copy file name to clipboardExpand all lines: 04-Poisson-Regression.Rmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -732,7 +732,7 @@ table4ch4 <- c.data %>%
732
732
kable(table4ch4, booktabs=T, caption = 'The mean and variance of the violent crime rate by region and type of institution.')
733
733
```
734
734
735
-
```{r, boxtyperegion, fig.align="center",out.width="60%", fig.cap='Boxplot of violent crime rate by region and type of institution.',echo=FALSE, warning=FALSE, message=FALSE}
735
+
```{r, boxtyperegion, fig.align="center",out.width="60%", fig.cap='Boxplot of violent crime rate by region and type of institution (colleges (C) on the left, and universities (U) on the right).',echo=FALSE, warning=FALSE, message=FALSE}
736
736
#Insert boxplot without the outlier and combining S and SE
737
737
ggplot(c.data, aes(x = region, y = nvrate, fill = type)) +
We use likelihood methods to estimate $\beta_0$ and $\beta_1$. As we had done in Chapter \@ref(ch-beyondmost), we can write the likelihood for this example in the following form:
Our interest centers on estimating $\hat{\beta_0}$ and $\hat{\beta_1}$, not $p_1$ or $p_0$. So we replace $p_1$ in the likelihood with an expression for $p_1$ in terms of $\beta_0$ and $\beta_1$ as in Equation \@ref(eq:pBehindform). Similarly, $p_0$ in Equation \@ref(eq:pNotBehindform) involves only $\beta_0$. After removing constants, the new likelihood looks like:
@@ -536,7 +536,7 @@ A deviance residual can be calculated for each observation using:
536
536
537
537
When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.
538
538
539
-
The sum of the individual deviance residuals is referred to as the **deviance** or **residual deviance**. \index{residual deviance} The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, $m_i$, are large and a model fits, the residual deviance follows a $\chi^2$ distribution with $n-p$ degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a $\chi_{n-p}^2$ that are greater than the observed residual deviance.
539
+
The sum of the individual deviance residuals is referred to as the **deviance** or **residual deviance**. \index{residual deviance} The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, $m_i$, are large and a model fits, the residual deviance follows a $\chi^2$ distribution with $n-p$ degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a $\chi_{n-p}^2$ distribution that are greater than the observed residual deviance.
540
540
541
541
We begin a residual analysis of our interaction model by plotting the residuals against the fitted values in Figure \@ref(fig:resid). This kind of plot for binomial regression would produce two linear trends with similar negative slopes if there were equal sample sizes $m_i$ for each observation.
542
542
@@ -652,7 +652,7 @@ We began by fitting a logistic regression model with `distance` alone. Then we a
652
652
653
653
## Case Study: Trying to Lose Weight
654
654
655
-
The final case study uses individual-specific information so that our response, rather than the number of successes out of some number of trials, is simply a binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.). This type of problem---__binary logistic regression__---is exceedingly common in practice \index{binary logistic regression}. Here we examine characteristics of young people who are trying to lose weight. The prevalence of obesity among U.S. youth suggests that wanting to lose weight is sensible and desirable for some young people such as those with a high body mass index (BMI). On the flip side, there are young people who do not need to lose weight but make ill-advised attempts to do so nonetheless. A multitude of studies on weight loss focus specifically on youth and propose a variety of motivations for the young wanting to lose weight; athletics and the media are two commonly cited sources of motivation for losing weight for young people.
655
+
The final case study uses individual-specific information so that our response, rather than the number of successes out of some number of trials, is simply a binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.). This type of problem---__binary logistic regression__---is exceedingly common in practice. \index{binary logistic regression} Here we examine characteristics of young people who are trying to lose weight. The prevalence of obesity among U.S. youth suggests that wanting to lose weight is sensible and desirable for some young people such as those with a high body mass index (BMI). On the flip side, there are young people who do not need to lose weight but make ill-advised attempts to do so nonetheless. A multitude of studies on weight loss focus specifically on youth and propose a variety of motivations for the young wanting to lose weight; athletics and the media are two commonly cited sources of motivation for losing weight for young people.
656
656
657
657
Sports have been implicated as a reason for young people wanting to shed pounds, but not all studies are consistent with this idea. For example, a study by @Martinsen2009 reported that, despite preconceptions to the contrary, there was a higher rate of self-reported eating disorders among controls (non-elite athletes) as opposed to elite athletes. Interestingly, the kind of sport was not found to be a factor, as participants in leanness sports (for example, distance running, swimming, gymnastics, dance, and diving) did not differ in the proportion with eating disorders when compared to those in non-leanness sports. So, in our analysis, we will not make a distinction between different sports.
658
658
@@ -1303,5 +1303,5 @@ summary(model1a)
1303
1303
4.__Trashball.__ Great for a rainy day! A fun way to generate overdispersed binomial data. Each student crumbles an 8.5 by 11 inch sheet and tosses it from three prescribed distances ten times each. The response is the number of made baskets out of 10 tosses, keeping track of the distance. Have the class generate and collect potential covariates, and include them in your data set (e.g., years of basketball experience, using a tennis ball instead of a sheet of paper, height). Some sample analysis steps:
1304
1304
1305
1305
a. Create scatterplots of logits vs. continuous predictors (distance, height, shot number, etc.) and boxplots of logit vs. categorical variables (sex, type of ball, etc.). Summarize important trends in one or two sentences.
1306
-
b. Create a graph with empirical logits vs. distance plotted separately by sex. What might you conclude from this plot?
1306
+
b. Create a graph with empirical logits vs. distance plotted separately by type of ball. What might you conclude from this plot?
1307
1307
c. Find a binomial model using the variables that you collected. Give a brief discussion on your findings.
0 commit comments