Read Regression and Other Stories, Chapters 13 “Logistic regression”, and 15 “Other generalized linear models”, (Gelman, Hill, and Vehtari 2020)
A detailed guide to generalized linear models.
Read An Introduction to Statistical Learning with Applications in R, Chapter 4 “Classification”, (James et al. 2021)
A complementary treatment of generalized linear models from a different perspective.
Read We Gave Four Good Pollsters the Same Raw Data. They Had Four Different Results, (Cohn 2016)
Details a situation in which different modelling choices, given the same dataset, result in different forecasts.
Key concepts and skills
Linear regression can be generalized for alternative types of dependent variables. For instance, logistic and Poisson regression have a binary and integer count dependent variable, respectively.
Linear models, covered in Chapter 12, have evolved substantially over the past century. Francis Galton, mentioned in Chapter 8, and others of his generation, some of whom were eugenicists, used linear regression in earnest in the late 1800s and early 1900s. Binary outcomes quickly became of interest and needed special treatment, leading to the development and wide adaption of logistic regression and similar methods in the mid-1900s (Cramer 2003). The generalized linear model framework came into being, in a formal sense, in the 1970s with Nelder and Wedderburn (1972). Generalized linear models (GLMs) broaden the types of outcomes that are allowed. We still model outcomes as a linear function, but we are less constrained. The outcome can be anything in the exponential family, and popular choices include the logistic distribution and the Poisson distribution. A further generalization of GLMs is generalized additive models (GAMs) where we broaden the structure of the explanatory side. We still explain the dependent variable as an additive function of various bits and pieces, but those bits and pieces can be functions. This framework was proposed in the 1990s by Hastie and Tibshirani (1990).
Shoulders of giants
Dr Robert Tibshirani is Professor in the Departments of Statistics and Biomedical Data Science at Stanford University. After earning a PhD in Statistics from Stanford University in 1981, he joined the University of Toronto as an assistant professor. He was promoted to full professor in 1994 and moved to Stanford in 1998. He made fundamental contributions including GAMs, mentioned above, and lasso regression, which is a way of automated variable selection and will be covered in Chapter 16. He is an author of James et al. (2021). He was awarded the COPSS Presidents’ Award in 1996 and was appointed a Fellow of the Royal Society in 2019.
In this chapter we consider logistic, Poisson, and negative binomial regression.
13.2 Logistic regression
Linear regression is a useful way to better understand our data. But it assumes a continuous outcome variable that can take any number on the real line. We would like some way to use this same machinery when we cannot satisfy this condition. We turn to logistic and Poisson regression for binary and count outcome variables, respectively, noting they are still linear models, because the independent variables enter in a linear fashion.
Logistic regression, and its close variants, are useful in a variety of settings, from elections (Wang et al. 2015) through to horse racing (Chellel 2018; Bolton and Chapman 1986). We use logistic regression when the dependent variable is a binary outcome, such as 0 or 1, or “yes” or “no”. Although the presence of a binary outcome variable may sound limiting, there are a lot of circumstances in which the outcome either naturally falls into this situation or can be adjusted into it.
The foundation of this is the Bernoulli distribution where there is a certain probability, \(p\), of outcome “1” and the remainder, \(1-p\), for outcome “0”. We can use rbernoulli() to simulate data from the Bernoulli distribution.
library(tidyverse)set.seed(853)bernoulli_example <-tibble(draws =rbernoulli(n =20, p =0.1))bernoulli_example |>pull(draws)
One reason to use logistic regression is that we will be modelling a probability and so it will be bounded between 0 and 1. Whereas with linear regression we may end up with values outside this. The foundation of logistic regression is the logit function:
\[
\mbox{logit}(x) = \log\left(\frac{x}{1-x}\right).
\] This will transpose values between 0 and 1 onto the real line. For instance, logit(0.1) = -2.2, logit(0.5) = 0, and logit(0.9) = 2.2.
To illustrate logistic regression, we will simulate data on whether it is a weekday or weekend, based on the number of cars on the road. We will assume that on weekdays the road is busier.
We can use glm() from base R to do a quick estimation. In this case we will try to work out whether it is a weekday or weekend, based on the number of cars we can see. We are interested in estimating Equation 13.1:1\[
\mbox{Pr}(y_i=1) = \mbox{logit}^{-1}\left(\beta_0+\beta_1 x_i\right)
\tag{13.1}\] where \(y_i\) is whether it is a weekday and \(x_i\) is the number of cars on the road.
The results of the estimation are in the first column of Table 13.2. The estimated coefficient on the number of cars is 0.19.
The interpretation of the logistic regression’s coefficients is more complicated than linear regression as they relate to changes in the log-odds of the binary outcome. For instance, the estimate of 0.19 is the average change in the log odds of it being a weekday with observing one extra car on the road. The coefficient is positive which means an increase, but as it is non-linear, if we want to specify a particular change, then this will be different for different baseline levels of the observation (e.g. an increase of 0.19 log odds has a larger impact when the baseline log odds are 0, compared to 2).
We can translate our estimate into the probability of it being a weekday, for a given number of cars. predictions() from marginaleffects(Arel-Bundock 2022) can be used to add the implied probability that it is a weekday for each observation.
And we can then graph the probability that our model implies, for each observation, of it being a weekday (Figure 13.1).
week_or_weekday_predictions |>mutate(is_weekday =factor(is_weekday)) |>ggplot(aes(x = number_of_cars,y = predicted,color = is_weekday )) +geom_jitter(width =0.01, height =0.01, alpha =0.3) +labs(x ="Number of cars that were seen",y ="Estimated probability it is a weekday",color ="Was actually weekday" ) +theme_classic() +scale_color_brewer(palette ="Set1") +theme(legend.position ="bottom")
Figure 13.1: Logistic regression probability results with simulated data of whether it is a weekday or weekend based on the number of cars that are around
The marginal effect at each observation is of interest because it provides a sense of how this probability is changing. It enables us to say that at the mean (which in this case is if we were to see 50 cars) the probability of it being a weekday, increases by almost five per cent if we were to see another one (Table 13.1).
Table 13.1: Marginal effect of another car on the probability that it is a weekday, at the mean
Term
Slope
Standard error
number_of_cars
0.0471838
0.0035216
We could use tidymodels to run this if we wanted. To do that, we first need to change the class of our dependent variable into a factor because this is required for classification models.
As before, we can make a graph of the actual results compared with our estimates. But one nice aspect of this is that we could use our test dataset to evaluate our model’s forecasting ability more thoroughly, for instance through a confusion matrix, which specifies the count of each prediction by what the truth was. We find that the model does well on the held-out dataset. There were 90 observations where the model predicted it was a weekday, and it was actually a weekday, and 95 where the model predicted it was a weekend, and it was a weekend. It was wrong for 15 observations, and these were split across seven where it forecast a weekday, but it was a weekend, and eight where it was the opposite case.2
We might be interested in inference, and so want to build a Bayesian model using rstanarm. Again, we will specify priors for our model, but use the default priors that rstanarm uses:
\[
\begin{aligned}
y_i|\pi_i & \sim \mbox{Bern}(\pi_i) \\
\mbox{logit}(\pi_i) & = \left(\beta_0+\beta_1 x_i\right) \\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
\] where \(y_i\) is whether it is a weekday (actually 0 or 1), \(x_i\) is the number of cars on the road, and \(\pi_i\) is the probability that observations \(i\) is a weekday.
Table 13.2: Forecasting and explanatory models of whether it is day or night, based on the number of cars on the road
base R
tidymodels
rstanarm
(Intercept)
−9.645
−9.410
−9.464
[−11.213, −8.286]
[−11.125, −7.945]
[−10.999, −8.165]
number_of_cars
0.190
0.185
0.186
[0.164, 0.220]
[0.157, 0.218]
[0.162, 0.216]
Num.Obs.
1000
800
1000
R2
0.779
AIC
359.8
299.0
BIC
369.6
308.3
Log.Lik.
−177.875
−177.899
F
174.307
ELPD
−179.8
ELPD s.e.
13.9
LOOIC
359.6
LOOIC s.e.
27.9
WAIC
359.6
RMSE
0.24
0.24
0.24
Table 13.2 makes it clear that each of the approaches is similar in this case. They agree on the direction of the effect of seeing an extra car on the probability of it being a weekday. And even the magnitude of the effect is estimated to be similar.
13.2.1 US political support
One area where logistic regression is often used is political polling. In many cases voting implies the need for one preference ranking, and so issues are reduced, whether appropriately or not, to “support” or “not support”.
As a reminder, the workflow we advocate in this book: \(\mbox{Plan} \rightarrow \mbox{Simulate} \rightarrow \mbox{Acquire} \rightarrow \mbox{Explore} \rightarrow \mbox{Share}\). While the focus here is the exploration of data using linear models, we still need to do the other aspects. We begin by planning. In this case, we are interested in US political support. In particular we are interested in whether we can forecast who a respondent is likely to vote for, based only on knowing their highest level of education and gender. That means we are interested in a dataset with variables for who an individual voted for, and some of their characteristics, for instance, gender and education. A quick sketch of such a dataset is Figure 13.2 (a). And we would like our model to average over these points. A quick sketch is Figure 13.2 (b).
(a) Quick sketch of a dataset that could be used to examine US political support
(b) Quick sketch of what we expect from the analysis before finalizing either the data of the analysis
Figure 13.2: Sketches of the expected dataset and analysis focus and clarify our thinking even if they will be updated later
We will simulate a dataset where the chance that a person supports Biden depends on their gender and education.
We can use the 2020 Cooperative Election Study (CES) (Schaffner, Ansolabehere, and Luks 2021), which is a long-standing annual survey of US political opinion. In 2020, there were 61,000 respondents who completed the post-election survey. While the sampling methodology, detailed in Ansolabehere, Schaffner, and Luks (2021, 13), is not a perfect random sample and relies on matching, it is widely used and accepted approach that balances sampling concerns and cost.
We can access the CES using get_dataframe_by_name() from dataverse(Kuriwaki, Beasley, and Leeper 2022), which was an approach that was introduced in Chapter 7 and Chapter 10. We save the data that are of interest to us, and then refer to that saved dataset.
When we look at the actual data, there are concerns that we did not anticipate in our sketches. We only want respondents who are registered to vote, and we are only interested in those that voted for either Biden or Trump. We can filter to only those respondents and then add more informative labels. Gender of “female” and “male” is what is available from the CES.
Figure 13.3: The distribution of presidential preferences, by gender, and highest education
One approach is to use tidymodels to build a prediction-focused logistic regression model in the same way as before i.e. a validation set approach (James et al. 2021, 176). In this case, the probability will be that of voting for Biden.
Truth
Prediction Trump Biden
Trump 656 519
Biden 2834 4702
When we introduced tidymodels, we discussed the importance of randomly constructing training and test sets. We use the training dataset to estimate parameters, and then evaluate the model on the test set. It is natural to ask why we should be subject to the whims of randomness and whether we are making the most of our data. For instance, what if a good model is poorly evaluated because of some random inclusion in the test set? And what if we do not have a large test set?
One commonly used resampling method that goes some way to addressing this is \(k\)-fold cross-validation. In this approach we create \(k\) different samples, or “folds”, from the dataset without replacement. We then fit the model to the first \(k-1\) folds, and then evaluate it on the last fold. We do this \(k\) times, once for every fold, such that every observation will be used for training \(k-1\) times and for testing once. The \(k\)-fold cross-validation estimate is then the average mean squared error (James et al. 2021, 181).
We can use vfold_cv() from tidymodels to create, say, 10 folds.
set.seed(853)ces2020_10_folds <-vfold_cv(ces2020, v =10)
We then fit our model across the different combinations of folds with fit_resamples(). In this case, the model will be fit 10 times.
We might be interested to understand the performance of our model and we can use collect_metrics() to aggregate them across the folds (Table 13.3 (a)). These types of details would typically be mentioned in passing in the main content of the paper, but included in great detail in an appendix. The average accuracy of our model across the folds is 0.61, while the average sensitivity is 0.19 and the average specificity is 0.90.
Table 13.3: Average metrics across the 10 folds of a logistic regression to forecast voter preference
(a) Key performance metrics
Metric
Mean
accuracy
0.61
sens
0.19
spec
0.90
(b) Confusion matrix
Prediction
Truth
Freq
Proportion
Trump
Trump
327.5
0.08
Trump
Biden
267.7
0.06
Biden
Trump
1,428.3
0.33
Biden
Biden
2,331.9
0.54
What does this mean? Accuracy is the proportion of observations that were correctly classified. The result of 0.61 suggests the model is doing better than a coin toss, but not much more. Sensitivity is the proportion of true observations that are identified as true (James et al. 2021, 145). In this case would mean that the model predicted a respondent voted for Trump and they did. And specificity is the proportion of false observations that are identified as false (James et al. 2021, 145). In this case it is the proportion of voters that voted for Biden, that were predicted to vote for Biden. This confirms our initial thought that model is having trouble identifying Trump supporters.
We can see this in more detail by looking at the confusion matrix (Table 13.3 (b)). When used with resampling approaches, such as cross-validation, the confusion matrix is computed for each fold and then averaged. The model is predicting Biden much more than we might expect from our knowledge of how close the 2020 election was. It suggests that our model may need additional variables to do a better job.
Finally, it may be the case that we are interested in individual-level results, and we can add these to our dataset with collect_predictions().
For instance, we can see that model is essentially predicting support for Biden for all individuals apart from males with no high school, high school graduates, or 2 years of college (Table 13.4).
Table 13.4: The model is forecasting support for Biden for all females, and for many males, regardless of education
Gender
Education
Voted for
Predicted vote
Number
Female
No HS
Trump
Biden
206
Female
No HS
Biden
Biden
228
Female
High school graduate
Trump
Biden
3,204
Female
High school graduate
Biden
Biden
3,028
Female
Some college
Trump
Biden
1,842
Female
Some college
Biden
Biden
3,325
Female
2-year
Trump
Biden
1,117
Female
2-year
Biden
Biden
1,739
Female
4-year
Trump
Biden
1,721
Female
4-year
Biden
Biden
4,295
Female
Post-grad
Trump
Biden
745
Female
Post-grad
Biden
Biden
2,853
Male
No HS
Trump
Trump
132
Male
No HS
Biden
Trump
123
Male
High school graduate
Trump
Trump
2,054
Male
High school graduate
Biden
Trump
1,528
Male
Some college
Trump
Biden
1,992
Male
Some college
Biden
Biden
2,131
Male
2-year
Trump
Trump
1,089
Male
2-year
Biden
Trump
1,026
Male
4-year
Trump
Biden
2,208
Male
4-year
Biden
Biden
3,294
Male
Post-grad
Trump
Biden
1,248
Male
Post-grad
Biden
Biden
2,426
13.3 Poisson regression
When we have count data we should initially think to take advantage of the Poisson distribution. The Poisson distribution is governed by one parameter, \(\lambda\). This distributes probabilities over non-negative integers and hence governs the shape of the distribution. As such, the Poisson distribution has the interesting feature that the mean is also the variance, and so as the mean increases, so does the variance. The Poisson probability mass function is (Pitman 1993, 121): \[P_{\lambda}(k) = e^{-\lambda}\lambda^k/k!\mbox{, for }k=0,1,2,...\] We can simulate \(n=20\) draws from the Poisson distribution with rpois(), where \(\lambda\) here is equal to three.
rpois(n =20, lambda =3)
[1] 2 1 1 2 1 5 4 1 4 6 4 4 3 3 2 3 7 2 5 2
We can also look at what happens to the distribution as we change the value of \(\lambda\) (Figure 13.4).
Figure 13.4: The Poisson distribution is governed by the value of the mean, which is the same as its variance
To illustrate the situation, we could simulate data about the number of A grades that are awarded in each university course. In this simulated example, we consider three departments, each of which has many courses. Each course will award a different number of A grades.
count_of_A |>ggplot(aes(x = number_of_A)) +geom_histogram(aes(fill = department), position ="dodge") +labs(x ="Number of A grades awarded",y ="Number of classes",fill ="Department" ) +theme_classic() +scale_fill_brewer(palette ="Set1")
Figure 13.5: Simulated number of A grades in various classes across three departments
Our simulated dataset has the number of A grades awarded by courses, which are structured within departments (Figure 13.5). In Chapter 15, we will take advantage of this departmental structure, but for now we just ignore it.
The model that we are interested in estimating is:
\[
\begin{aligned}
y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\
\log(\lambda_i) & = \beta_0 + \beta_1 x_i
\end{aligned}
\] where \(y_i\) is the number of A grades awarded, and \(x_i\) is the course.
We can use glm() from base R to get a quick sense of the data. This function is quite general, and we specify Poisson regression by setting the “family” parameter. The estimates are contained in the first column of Table 13.5.
As with logistic regression, the interpretation of the coefficients from Poisson regression can be difficult. The interpretation of the coefficient on “department2” is that it is the log of the expected difference between departments. We expect \(e^{0.883} \approx 2.4\) and \(e^{1.703} \approx 5.5\) as many A grades in departments 2 and 3, respectively, compared with department 1 (Table 13.5).
If we were interested in prediction, then we could use tidymodels to estimate Poisson models with poissonreg(Kuhn and Frick 2022) (Table 13.5).
The results of this estimation are in the second column of Table 13.5. They are similar to the estimates from glm(), but the number of observations is less because of the split.
Finally, we could build a Bayesian model and estimate it with rstanarm (Table 13.5).
Table 13.5: Forecasting and explanatory models of marathon times based on five-kilometer run times
base R
tidymodels
rstanarm
(Intercept)
1.327
1.353
1.321
[1.122, 1.519]
[1.138, 1.554]
[1.128, 1.509]
department2
0.883
0.861
0.884
[0.651, 1.123]
[0.611, 1.118]
[0.654, 1.117]
department3
1.703
1.719
1.706
[1.493, 1.924]
[1.495, 1.954]
[1.499, 1.921]
Num.Obs.
78
62
78
AIC
392.5
311.6
BIC
399.6
318.0
Log.Lik.
−193.274
−193.355
F
147.022
ELPD
−196.2
ELPD s.e.
7.7
LOOIC
392.4
LOOIC s.e.
15.4
WAIC
392.4
RMSE
3.41
3.35
3.41
As with logistic regression, we can use marginaleffects() from (Arel-Bundock 2022) to help with interpreting these results. It may be useful to consider how we expect the number of A grades to change as we go from one department to another. Table 13.6 suggests that in our dataset classes in Department 2 tend to have around five additional A grades, compared with Department 1. And that classes in Department 3, tend to have around 17 more A grades, compared with Department 1.3
Table 13.6: The estimated difference in the number of A grades awarded at each department
Departmental comparison
Estimate
2.5%
97.5%
2 - 1
5.32
4.01
6.70
3 - 1
16.92
15.10
18.84
13.3.1 Letters used in Jane Eyre
In an earlier age, Edgeworth (1885) made counts of the dactyls in Virgil’s Aeneid (Friendly (2021) makes this dataset available with HistData::Dactyl). Inspired by this we could use gutenbergr(Johnston and Robinson 2022) to get the text of Jane Eyre by Charlotte Brontë. We could then consider the first 10 lines of each chapter, count the number of words, and count the number of times either “E” or “e” appears. We are interested to see whether the number of e/E’s increases as more words are used. If not, it could suggest that the distribution of e/E’s is not consistent, which could be of interest to linguists.
Following the workflow advocated in this book, we first sketch our dataset and model. A quick sketch of what the dataset could look like is Figure 13.10 (a). And a quick sketch of our model is Figure 13.10 (b).
(a) Planned counts, by line and chapter, in Jane Eyre
(b) Planned expected relationship between count of e/E and number of words in the line
Figure 13.6: Sketches of the expected dataset and analysis force us to consider what we are interested in
We will simulate a dataset the number of e/E letters is distributed following the Poisson distribution (Figure 13.7).
count_of_e_simulation <-tibble(chapter =c(rep(1, 10), rep(2, 10), rep(3, 10)),line =rep(1:10, 3),number_words_in_line =runif(min =0, max =15, n =30) |>round(0),number_e =rpois(n =30, lambda =10) )count_of_e_simulation |>ggplot(aes(y = number_e, x = number_words_in_line)) +geom_point() +labs(x ="Number of words in line",y ="Number of e/E letters in the chapter" ) +theme_classic() +scale_fill_brewer(palette ="Set1")
# A tibble: 21,001 × 2
gutenberg_id text
<int> <chr>
1 1260 JANE EYRE
2 1260 AN AUTOBIOGRAPHY
3 1260 <NA>
4 1260 by Charlotte Brontë
5 1260 <NA>
6 1260 _ILLUSTRATED BY F. H. TOWNSEND_
7 1260 <NA>
8 1260 London
9 1260 SERVICE & PATON
10 1260 5 HENRIETTA STREET
# … with 20,991 more rows
We are interested in only those lines that have content, so we remove those empty lines that are just there for spacing. Then we can create counts of the number of times “e” or “E” occurs in that line, for the first 10 lines of each chapter. For instance, we can look at the first few lines and see that there are 5 e/E’s in the first line and 8 in the second.
jane_eyre_reduced <- jane_eyre |>filter(!is.na(text)) |># Remove empty linesmutate(chapter =if_else(str_detect(text, "CHAPTER") ==TRUE, text,NA_character_ )) |># Find start of chapterfill(chapter, .direction ="down") |># Add chapter number to each linegroup_by(chapter) |>mutate(chapter_line =row_number()) |># Add line number of each chapterfilter(!is.na(chapter), chapter_line %in%c(2:11) ) |># Start at 2 to get rid of text "CHAPTER I" etcselect(text, chapter) |>mutate(chapter =str_remove(chapter, "CHAPTER "),chapter =str_remove(chapter, "—CONCLUSION"),chapter =as.integer(as.roman(chapter)) ) # Change chapters to integersjane_eyre_reduced <- jane_eyre_reduced |>mutate(count_e =str_count(text, "e|E")) |># Based on: https://stackoverflow.com/a/38058033mutate(word_count =str_count(text, "\\w+"))
# A tibble: 6 × 4
# Groups: chapter [1]
chapter word_count count_e text
<int> <int> <int> <chr>
1 1 13 5 There was no possibility of taking a walk that day…
2 1 11 8 wandering, indeed, in the leafless shrubbery an ho…
3 1 12 9 but since dinner (Mrs. Reed, when there was no com…
4 1 14 3 the cold winter wind had brought with it clouds so…
5 1 11 7 so penetrating, that further outdoor exercise was …
6 1 1 1 question.
We can verify that the mean and variance of the number of E/e’s is roughly similar by plotting all of the data (Figure 13.8). The mean, in pink, is 6.7, and the variance, in blue, is 6.2. While they are not entirely the same, they are similar. We include the diagonal in Figure 13.8 (b) to help with thinking about the data. If the data were on the \(y=x\) line, then on average the is one E/e per word. As it is below that point, means that on average there is less than one per word.
mean_e <-mean(jane_eyre_reduced$count_e)variance_e <-var(jane_eyre_reduced$count_e)jane_eyre_reduced |>ggplot(aes(x = count_e)) +geom_histogram() +geom_vline(xintercept = mean_e, linetype ="dashed", color ="#C64191") +geom_vline(xintercept = variance_e, linetype ="dashed", color ="#0ABAB5") +theme_minimal() +labs(y ="Count",x ="Number of e's per line for first ten lines" )jane_eyre_reduced |>ggplot(aes(x = word_count, y = count_e)) +geom_jitter(alpha =0.5) +geom_abline(slope =1, intercept =0, linetype ="dashed") +theme_minimal() +labs(x ="Number of words in the line",y ="Number of e/E's in the line" )
(a) Distribution of the number of e/E’s
(b) Comparison of the number of e/E’s in the line and the number of words in the line
Figure 13.8: Number of ‘E’ or ‘e’ in the first ten lines of each chapter in Jane Eyre
We could consider the following model:
\[
\begin{aligned}
y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\
\log(\lambda_i) & = \beta_0 + \beta_1 x_i\\
\beta_0 & \sim \mbox{Normal}(0, 2.5)\\
\beta_1 & \sim \mbox{Normal}(0, 2.5)
\end{aligned}
\] where \(y_i\) is the number of e/E’s in the line and \(x_i\) is the number of words in the line. We could estimate the parameters using stan_glm().
While we would normally be interested in the table of estimates, as we have seen that a few times now, rather than again creating a table of the estimates, we introduce plot_cap() from marginaleffects(Arel-Bundock 2022). We can use this to show the number of e/E’s predicted by the model, for each line, based on the number of words in that line. Figure 13.9 makes it clear that we expect a positive relationship.
plot_cap(jane_e_counts, condition ="word_count") +labs(x ="Number of words",y ="Average number of e/E in the first 10 lines" )
Figure 13.9: The predicted number of e/E’s in each line based on the number of words
13.4 Negative binomial regression
One of the major restrictions with Poisson regression is the assumption that the mean and the variance are the same. We can relax this assumption to allow over-dispersion and can use a close variant, negative binomial regression.
Poisson and negative binomial models go hand in hand. It is often the case that we will end up fitting both, and then comparing them. For instance, Maher (1982) considers both in the context of results from the English Football League and discusses situations in which one may be considered more appropriate than the other. Similarly, Smith (2002) considers the 2000 US presidential election and especially the issue of overdispersion in a Poisson analysis. And Osgood (2000) compares them in the case of crime data.
13.4.1 Mortality in Alberta
Consider, somewhat morbidly, that every year each individual either dies or does not. From the perspective of a geographic area, we could gather data on the number of people who died each year, by their cause of death. The Canadian province of Alberta makes available the number of deaths, by cause, since 2001, for the top 30 causes each year.
As always we first sketch our dataset and model. A quick sketch of what the dataset could look like is Figure 13.10 (a) And a quick sketch of our model is Figure 13.10 (b)
(a) Quick sketch of a dataset that could be used to examine cause of death in Alberta
(b) Quick sketch of what we expect from the analysis of cause of death in Alberta before finalizing either the data or the analysis
Figure 13.10: Sketches of the expected dataset and analysis for cause of death in Alberta
We will simulate a dataset of cause of death distributed following the negative binomial distribution.
alberta_death_simulation <-tibble(cause =rep(x =c("Heart", "Stroke", "Diabetes"), times =10),year =rep(x =2016:2018, times =10),deaths =rnbinom(n =30, size =20, prob =0.1) )alberta_death_simulation
We download that data, and save it as “alberta_COD.csv”. We can look at the distribution of these deaths, by year and cause (Figure 13.11). We have truncated the full cause of death because some are quite long. As some causes are not always in the top 30 each year, not all causes have the same number of occurrences.
If we were to look at the top ten causes in 2021, we would notice a variety of interesting aspects (Table 13.7). For instance, we would expect that the most common causes would be in all 21 years of our data. But we notice that that most common cause, “Other ill-defined and unknown causes of mortality”, is only in three years. Naturally, “COVID-19, virus identified”, is only in two other years, as there were no COVID deaths in Canada before 2020.
Table 13.7: Top ten causes of death in Alberta in 2021
Year
Cause
Ranking
Total deaths
Number of years
2021
Other ill-defined and unkno...
1
3,362
3
2021
Organic dementia
2
2,135
21
2021
COVID-19, virus identified
3
1,950
2
2021
All other forms of chronic ...
4
1,939
21
2021
Malignant neoplasms of trac...
5
1,552
21
2021
Acute myocardial infarction
6
1,075
21
2021
Other chronic obstructive p...
7
1,028
21
2021
Diabetes mellitus
8
728
21
2021
Stroke, not specified as he...
9
612
21
2021
Accidental poisoning by and...
10
604
9
In this case, for simplicity, we will restrict ourselves to the five most common causes of death in 2021, that have been present every year.
alberta_cod_top_five <- alberta_cod |>filter( calendar_year ==2021, n ==21 ) |>slice_max(order_by =desc(ranking), n =5) |>pull(cause)alberta_cod <- alberta_cod |>filter(cause %in% alberta_cod_top_five)
alberta_cod |>ggplot(aes(x = calendar_year, y = total_deaths, color = cause)) +geom_line(alpha =0.5) +theme_minimal() +scale_color_brewer(palette ="Set1") +labs(x ="Year",y ="Annual number of deaths in Alberta",color ="Cause" )
Figure 13.11: Annual number of deaths for the top 5 causes in 2021, since 2001, for Alberta, Canada
One thing that we notice is that the mean, 1,273, is different to the variance, 182,378 (Table 13.8).
Table 13.8: Summary statistics of the number of yearly deaths, by cause, in Alberta
Min
Mean
Max
SD
Var
N
total_deaths
280
1273
2135
427
182378
105
We can implement negative binomial regression when using stan_glm() by specifying that distribution in “family”. In this case, we run both Poisson and negative binomial.
We can compare our different models (Table 13.9) (we need to load broom.mixed(Bolker and Robinson 2022) because this is needed under the hood).
library(broom.mixed)coef_short_names <-c("causeAll other forms of chronic ischemic heart disease"="causeAll other forms of...","causeMalignant neoplasms of trachea, bronchus and lung"="causeMalignant neoplas...","causeOrganic dementia"="causeOrganic dementia","causeOther chronic obstructive pulmonary disease"="causeOther chronic obst..." )modelsummary::modelsummary(list("Poisson"= cause_of_death_alberta_poisson,"Negative binomial"= cause_of_death_alberta_neg_binomial ),statistic ="conf.int",coef_map = coef_short_names)
Table 13.9: Modelling the most prevelent cause of deaths in Alberta, 2001-2020
Poisson
Negative binomial
causeAll other forms of...
0.442
0.439
[0.426, 0.458]
[0.231, 0.636]
causeMalignant neoplas...
0.224
0.226
[0.207, 0.241]
[0.019, 0.436]
causeOrganic dementia
0.001
0.002
[−0.016, 0.019]
[−0.203, 0.206]
causeOther chronic obst...
−0.214
−0.217
[−0.233, −0.195]
[−0.422, −0.008]
Num.Obs.
105
105
Log.Lik.
−5139.564
−771.297
ELPD
−5312.5
−775.8
ELPD s.e.
1099.1
10.0
LOOIC
10625.0
1551.6
LOOIC s.e.
2198.3
20.0
WAIC
10793.7
1551.6
RMSE
308.24
308.26
We could use posterior predictive checks to more clearly show that the negative binomial approach is a better choice for this circumstance (Figure 13.12).
Figure 13.12: Comparing posterior prediction checks for Poisson and negative binomial models
Finally, we can compare between the models using the resampling method leave-one-out (LOO) cross-validation. This is a variant of cross-validation, introduced earlier, where the size of each fold is one. That is to say, if there was a dataset with 100 observations this LOO is equivalent to 100-fold cross validation. We can implement this in rstanarm with loo() for each model, and then compare between them with loo_compare(). Strictly speaking LOO-CV is not actually done by loo(), because it would be too computationally intensive. Instead an approximation is done provides the expected log point wise predictive density (ELPD), where the higher the better.
In this case we find that the negative binomial model is a better fit than the Poisson, because ELPD is larger.
We have covered a variety of diagnostic aspects for Bayesian models. While it is difficult to be definitive about what is “enough” because it is context specific, the following checklist would be enough for most purposes. These would go into an appendix that was mentioned and cross-referenced in the model section of a paper:
Prior predictive checks.
Trace plots.
Rhat plots.
Posterior distributions.
Posterior predictive checks.
13.5 Exercises
Scales
(Plan) Consider the following scenario: A person is interested in the number of deaths, attributed to cancer, in Sydney They collect data from the five largest hospitals. Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
(Simulate) Please further consider the scenario described and simulate the situation, along with three independent variables that are associated the number of deaths, by cause. Please include at least ten tests based on the simulated data.
(Acquire) Please describe one possible source of such a dataset.
(Explore) Please use ggplot2 to build the graph that you sketched.
(Communicate) Please write two paragraphs about what you did.
Questions
When should we use logistic regression (pick one)?
Continuous dependent variable.
Binary dependent variable.
Count dependent variable.
We are interested in studying how voting intentions in the recent US presidential election vary by an individual’s income. We set up a logistic regression model to study this relationship. In this study, one possible dependent variable would be (pick one)?
Whether the respondent is a US citizen (yes/no)
The respondent’s personal income (high/low)
Whether the respondent is going to vote for Trump (yes/no)
Who the respondent voted for in 2016 (Trump/Clinton)
We are interested in studying how voting intentions in the recent US presidential election vary by an individual’s income. We set up a logistic regression model to study this relationship. In this study, one possible dependent variable would be (pick one)?
The race of the respondent (white/not white)
The respondent’s marital status (married/not)
Whether the respondent is registered to vote (yes/no)
Whether the respondent is going to vote for Biden (yes/no)
The mean of a Poisson distribution is equal to its?
Median.
Standard deviation.
Variance.
Please redo the tidymodels example of US elections but include additional variables. Which variable did you choose, and how did the performance of the model improve?
Please create the graph of the density of the Poisson distribution when \(\lambda = 75\).
From Gelman, Hill, and Vehtari (2020), what is the offset in Poisson regression?
Redo the Jane Eyre example, but for “A/a”.
Tutorial
Please consider Maher (1982) or Smith (2002). Build a simplified version of their model. Obtain some recent relevant data, estimate the model, and discuss your choice between Poisson and negative binomial regression.
Paper
At about this point the Murrumbidgee Paper from Appendix D would be appropriate.
Ansolabehere, Stephen, Brian Schaffner, and Sam Luks. 2021. “Guide to the 2020 Cooperative Election Study.”https://doi.org/10.7910/DVN/E9N6PH.
Arel-Bundock, Vincent. 2021. modelsummary: Summary Tables and Plots for Statistical Models and Data: Beautiful, Customizable, and Publication-Ready. https://CRAN.R-project.org/package=modelsummary.
Bolton, Ruth, and Randall Chapman. 1986. “Searching for Positive Returns at the Track.”Management Science 32 (August): 1040–60. https://doi.org/10.1287/mnsc.32.8.1040.
Edgeworth, Francis Ysidro. 1885. “Methods of Statistics.”Journal of the Statistical Society of London, 181–217.
Frick, Hannah, Fanny Chow, Max Kuhn, Michael Mahoney, Julia Silge, and Hadley Wickham. 2022. rsample: General Resampling Infrastructure. https://CRAN.R-project.org/package=rsample.
Goodrich, Ben, Jonah Gabry, Imad Ali, and Sam Brilleman. 2020. “rstanarm: Bayesian applied regression modeling via Stan.”https://mc-stan.org/rstanarm.
Hastie, Trevor, and Robert Tibshirani. 1990. Generalized Additive Models. 1st ed. Boca Raton: Chapman; Hall/CRC.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2021. An Introduction to Statistical Learning with Applications in R. 2nd ed. Springer. https://www.statlearning.com.
Kuhn, Max, and Hadley Wickham. 2020. Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles. https://www.tidymodels.org.
Nelder, John, and Robert Wedderburn. 1972. “Generalized Linear Models.”Journal of the Royal Statistical Society: Series A (General) 135 (3): 370–84. https://doi.org/10.2307/2344614.
Osgood, D. Wayne. 2000. “Poisson-Based Regression Analysis of Aggregate Crime Rates.”Journal of Quantitative Criminology 16 (1): 21–43. https://doi.org/10.1023/a:1007521427059.
R Core Team. 2022. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schaffner, Brian, Stephen Ansolabehere, and Sam Luks. 2021. “Cooperative Election Study Common Content, 2020.” Harvard Dataverse. https://doi.org/10.7910/DVN/E9N6PH.
Smith, Richard. 2002. “A Statistical Assessment of Buchanan’s Vote in Palm Beach County.”Statistical Science 17 (4): 441–57. https://doi.org/10.1214/ss/1049993203.
Wang, Wei, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. “Forecasting Elections with Non-Representative Polls.”International Journal of Forecasting 31 (3): 980–91. https://doi.org/10.1016/j.ijforecast.2014.06.001.
THERE’S NO EXPLAINATION OF HOW WE GOT HERE, MAYBE SHOULD SHOULD THAT \(E[Y] = LOGIT(XBETA)\), AND THAT \(Y ~ BERN(SIGMOID(XBETA))\)↩︎
STUDENTS PROBABLY WON’T UNDERSTAND THAT IMPICIT IN THE PREDICTED LABEL IS AN OPERATING THRESHOLD, I WOULD HAVE AT LEAST ONE SENTENCE EXPLAINING THAT THIS IS A BUILT IN ASSUMPION IN THE PREDICT() METHOD THAT CAN BE ADJUSTED AND ADJUSTING THIS WILL IMPACT SENSITIVITY VS PRECISION WHICH CAN BE DISCUSSED LATER↩︎
THE TABLE BELOW IS A BIT DIFFICULT TO READ, NOT ENOUGH SPACE BETWEEN THE COLUMNS.↩︎