Skip to content

Commit 8dd7556

Browse files
committed
Add changes from final hand edits to hard copy. Also improve preface for bookdown version.
1 parent e6ebd56 commit 8dd7556

35 files changed

+3019
-2791
lines changed

01-Introduction.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -663,7 +663,7 @@ This application involves both non-normal data (number of stops by ethnic group
663663

664664
1. **Kentucky Derby.** The next set of questions is related to the Kentucky Derby case study from this chapter.
665665

666-
a. Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationships between year and track condition in Figure \@ref(fig:bivariate).
666+
a. Discuss the pros and cons of using side-by-side boxplots vs. stacked histograms to illustrate the relationship between year and track condition in Figure \@ref(fig:bivariate).
667667
b. Why is a scatterplot more informative than a correlation coefficient to describe the relationship between speed of the winning horse and year in Figure \@ref(fig:bivariate).
668668
c. How might you incorporate a fourth variable, say number of starters, into Figure \@ref(fig:codeds)?
669669
d. Explain why $\epsilon_i$ in Equation \@ref(eq:model1) measures the vertical distance from a data point to the regression line.

02-Beyond-Most-Least-Squares.Rmd

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -267,7 +267,7 @@ Lik.f(nBoys = 30, nGirls = 20, nGrid = 50)
267267
# more precise MLE for p_B based on finer grid (more points)
268268
Lik.f(nBoys = 30, nGirls = 20, nGrid = 1000)
269269
270-
## Another approach: using R's optimize command
270+
## Another approach: using R's optimize function
271271
## Note that the log-likelihood is optimized here
272272
oLik.f <- function(pb){
273273
return(30*log(pb) + 20*log(1-pb))
@@ -421,9 +421,9 @@ Numfams <- c(930,951,582,666,666,530,186,177,173,
421421
148,151,125,182,159)
422422
Numchild <- c(930,951,1164,1332,1332,1060,558,531,
423423
519,444,453,375,546,477)
424-
Malesfemales <- c("97 boys to 100 girls"," ",
425-
"104 boys to 100 girls"," "," "," ",
426-
"104 boys to 100 girls"," "," "," "," "," "," "," ")
424+
Malesfemales <- c("97:100"," ",
425+
"104:100"," "," "," ",
426+
"104:100"," "," "," "," "," "," "," ")
427427
PB <- c("0.494", " ","0.511"," "," "," ","0.510"," "," ",
428428
" "," "," "," "," ")
429429
```
@@ -708,7 +708,7 @@ We have convincing evidence that the Sex Conditional Model provides a significan
708708

709709
*Note: *You may notice that the LRT is similar in spirit to the extra-sum-of-squares F-test used in linear regression. Recall that the extra-sum-of-squares F-test involves comparing two nested models. When the smaller model is true, the F-ratio follows an F-distribution which on average is 1.0. A large, unusual F-ratio provides evidence that the larger model provides a significant improvement.
710710

711-
*Also note: * It might have been more logical to start by using Likelihood Ratio Test to determine whether the probability of having a boy differs significantly from 0.5. We leave this as an exercise.
711+
*Also note: * It might have been more logical to start by using a Likelihood Ratio Test to determine whether the probability of having a boy differs significantly from 0.5. We leave this as an exercise.
712712

713713
## Model 3: Stopping Rule Model (waiting for a boy)
714714

@@ -763,8 +763,8 @@ Cont <- c("$\\bstop$",
763763
table9chp2 <- data.frame(Famcomp2, Numfamis3, Lik, Cont)
764764
colnames(table9chp2) <- c("Family Composition",
765765
"Num. families",
766-
" Likelihood",
767-
"Contribution")
766+
"Likelihood Contribution",
767+
" ")
768768
kable(table9chp2, booktabs = T, escape = F,
769769
caption="Likelihood contributions for NLSY families in Model 3: Waiting for a boy.")%>%
770770
column_spec(1, width = "3cm") %>%

03-Distribution-Theory.Rmd

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -351,7 +351,7 @@ Suppose we have a Poisson process with rate $\lambda$, and we wish to model the
351351
f(y) = \lambda e^{-\lambda y} \quad \textrm{for} \quad y > 0,
352352
(\#eq:expRV)
353353
\end{equation}
354-
where $\E(Y) = 1/\lambda$, $\SD(Y) = 1/\lambda$. Figure \@ref(fig:multExp) displays three exponential distributions with different $\lambda$ values. As $\lambda$ increases, $\E(Y)$ tends towards 0, and distributions "die off" quicker.
354+
where $\E(Y) = 1/\lambda$ and $\SD(Y) = 1/\lambda$. Figure \@ref(fig:multExp) displays three exponential distributions with different $\lambda$ values. As $\lambda$ increases, $\E(Y)$ tends towards 0, and distributions "die off" quicker.
355355

356356
(ref:multExp) Exponential distributions with $\lambda = 0.5, 1,$ and $5$.
357357

@@ -586,7 +586,7 @@ In this course, we encounter $\chi^2$ distributions \index{chi-square distributi
586586

587587
In general, $\chi^2$ distributions with $k$ degrees of freedom are right skewed with a mean $k$ and standard deviation $\sqrt{2k}$. Figure \@ref(fig:multChisq) displays chi-square distributions with different values of $k$.
588588

589-
The $\chi^2$ distribution is a special case of gamma distributions. Specifically, a $\chi^2$ distribution with $k$ degrees of freedom can be expressed as a gamma distribution with $\lambda = 1/2$ and $r = k/2$.
589+
The $\chi^2$ distribution is a special case of a gamma distribution. Specifically, a $\chi^2$ distribution with $k$ degrees of freedom can be expressed as a gamma distribution with $\lambda = 1/2$ and $r = k/2$.
590590

591591
(ref:multChisq) $\chi^2$ distributions with 1, 3, and 7 degrees of freedom..
592592

04-Poisson-Regression.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -732,7 +732,7 @@ table4ch4 <- c.data %>%
732732
kable(table4ch4, booktabs=T, caption = 'The mean and variance of the violent crime rate by region and type of institution.')
733733
```
734734

735-
```{r, boxtyperegion, fig.align="center",out.width="60%", fig.cap='Boxplot of violent crime rate by region and type of institution.',echo=FALSE, warning=FALSE, message=FALSE}
735+
```{r, boxtyperegion, fig.align="center",out.width="60%", fig.cap='Boxplot of violent crime rate by region and type of institution (colleges (C) on the left, and universities (U) on the right).',echo=FALSE, warning=FALSE, message=FALSE}
736736
#Insert boxplot without the outlier and combining S and SE
737737
ggplot(c.data, aes(x = region, y = nvrate, fill = type)) +
738738
geom_boxplot() +

06-Logistic-Regression.Rmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -194,7 +194,7 @@ p_0=\frac{e^{\beta_0}}{1+e^{\beta_0}}
194194

195195
We use likelihood methods to estimate $\beta_0$ and $\beta_1$. As we had done in Chapter \@ref(ch-beyondmost), we can write the likelihood for this example in the following form:
196196

197-
\[\Lik(p_1, p_0) = {28 \choose 22}p_1^{22}(1-p_1)^{2}
197+
\[\Lik(p_1, p_0) = {24 \choose 22}p_1^{22}(1-p_1)^{2}
198198
{180 \choose 141}p_0^{141}(1-p_0)^{39}\]
199199

200200
Our interest centers on estimating $\hat{\beta_0}$ and $\hat{\beta_1}$, not $p_1$ or $p_0$. So we replace $p_1$ in the likelihood with an expression for $p_1$ in terms of $\beta_0$ and $\beta_1$ as in Equation \@ref(eq:pBehindform). Similarly, $p_0$ in Equation \@ref(eq:pNotBehindform) involves only $\beta_0$. After removing constants, the new likelihood looks like:
@@ -536,7 +536,7 @@ A deviance residual can be calculated for each observation using:
536536

537537
When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.
538538

539-
The sum of the individual deviance residuals is referred to as the **deviance** or **residual deviance**. \index{residual deviance} The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, $m_i$, are large and a model fits, the residual deviance follows a $\chi^2$ distribution with $n-p$ degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a $\chi_{n-p}^2$ that are greater than the observed residual deviance.
539+
The sum of the individual deviance residuals is referred to as the **deviance** or **residual deviance**. \index{residual deviance} The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, $m_i$, are large and a model fits, the residual deviance follows a $\chi^2$ distribution with $n-p$ degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a $\chi_{n-p}^2$ distribution that are greater than the observed residual deviance.
540540

541541
We begin a residual analysis of our interaction model by plotting the residuals against the fitted values in Figure \@ref(fig:resid). This kind of plot for binomial regression would produce two linear trends with similar negative slopes if there were equal sample sizes $m_i$ for each observation.
542542

@@ -652,7 +652,7 @@ We began by fitting a logistic regression model with `distance` alone. Then we a
652652

653653
## Case Study: Trying to Lose Weight
654654

655-
The final case study uses individual-specific information so that our response, rather than the number of successes out of some number of trials, is simply a binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.). This type of problem---__binary logistic regression__---is exceedingly common in practice \index{binary logistic regression}. Here we examine characteristics of young people who are trying to lose weight. The prevalence of obesity among U.S. youth suggests that wanting to lose weight is sensible and desirable for some young people such as those with a high body mass index (BMI). On the flip side, there are young people who do not need to lose weight but make ill-advised attempts to do so nonetheless. A multitude of studies on weight loss focus specifically on youth and propose a variety of motivations for the young wanting to lose weight; athletics and the media are two commonly cited sources of motivation for losing weight for young people.
655+
The final case study uses individual-specific information so that our response, rather than the number of successes out of some number of trials, is simply a binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.). This type of problem---__binary logistic regression__---is exceedingly common in practice. \index{binary logistic regression} Here we examine characteristics of young people who are trying to lose weight. The prevalence of obesity among U.S. youth suggests that wanting to lose weight is sensible and desirable for some young people such as those with a high body mass index (BMI). On the flip side, there are young people who do not need to lose weight but make ill-advised attempts to do so nonetheless. A multitude of studies on weight loss focus specifically on youth and propose a variety of motivations for the young wanting to lose weight; athletics and the media are two commonly cited sources of motivation for losing weight for young people.
656656

657657
Sports have been implicated as a reason for young people wanting to shed pounds, but not all studies are consistent with this idea. For example, a study by @Martinsen2009 reported that, despite preconceptions to the contrary, there was a higher rate of self-reported eating disorders among controls (non-elite athletes) as opposed to elite athletes. Interestingly, the kind of sport was not found to be a factor, as participants in leanness sports (for example, distance running, swimming, gymnastics, dance, and diving) did not differ in the proportion with eating disorders when compared to those in non-leanness sports. So, in our analysis, we will not make a distinction between different sports.
658658

@@ -1303,5 +1303,5 @@ summary(model1a)
13031303
4. __Trashball.__ Great for a rainy day! A fun way to generate overdispersed binomial data. Each student crumbles an 8.5 by 11 inch sheet and tosses it from three prescribed distances ten times each. The response is the number of made baskets out of 10 tosses, keeping track of the distance. Have the class generate and collect potential covariates, and include them in your data set (e.g., years of basketball experience, using a tennis ball instead of a sheet of paper, height). Some sample analysis steps:
13041304

13051305
a. Create scatterplots of logits vs. continuous predictors (distance, height, shot number, etc.) and boxplots of logit vs. categorical variables (sex, type of ball, etc.). Summarize important trends in one or two sentences.
1306-
b. Create a graph with empirical logits vs. distance plotted separately by sex. What might you conclude from this plot?
1306+
b. Create a graph with empirical logits vs. distance plotted separately by type of ball. What might you conclude from this plot?
13071307
c. Find a binomial model using the variables that you collected. Give a brief discussion on your findings.

07-Correlated-Data.Rmd

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -60,9 +60,9 @@ tModelName <- c("fit\\_1a\\_binom", "fit\\_1a\\_quasi", "fit\\_1b\\_binom", "fit
6060
"Model Name", "fit\\_2a\\_binom","fit\\_2a\\_quasi", "fit\\_2b\\_binom", "fit\\_2b\\_quasi")
6161
6262
tBeta <- c("","","","","",
63-
"$\\beta_1$", "","","","")
63+
"$\\hat{\\beta}_1$", "","","","")
6464
tSEBeta <- c("","","","","",
65-
"SE $\\beta_1$", "","","","")
65+
"SE $\\hat{\\beta}_1$", "","","","")
6666
tTStat <- c("","","","","",
6767
"$t$", "","","","")
6868
tPVal <- c("","","","","",
@@ -81,7 +81,7 @@ tGOFP <- c("","X","","X","",
8181
"GOF p value", "","X","","X")
8282
8383
scenarioSimTab <- tibble(tScenario, tModel, tModelName, tBeta, tSEBeta, tTStat, tPVal, tPhi, tEst, tCI, tMeanCount, tSDCount, tGOFP)
84-
colnames(scenarioSimTab) <- c("Scenario", "Model", "Model Name", "$\\beta_0$", "SE $\\beta_0$", "$t$", "p value", "$\\phi$", "Est prob", "CI prob", "Mean count", "SD count", "GOF p value")
84+
colnames(scenarioSimTab) <- c("Scenario", "Model", "Model Name", "$\\hat{\\beta}_0$", "SE $\\hat{\\beta}_0$", "$t$", "p value", "$\\phi$", "Est prob", "CI prob", "Mean count", "SD count", "GOF p value")
8585
8686
kable(scenarioSimTab, booktabs=T, caption="Summary of simulations for Dams and Pups case study.", escape=F) %>%
8787
kable_styling(latex_options = "scale_down", font_size = 9) %>%

0 commit comments

Comments
 (0)