13  Generalized linear models

Required material

Key concepts and skills

Key packages and functions

13.1 Introduction

Linear models, covered in Chapter 12, have evolved substantially over the past century. Francis Galton, mentioned in Chapter 8, and others of his generation, some of whom were eugenicists, used linear regression in earnest in the late 1800s and early 1900s. Binary outcomes quickly became of interest and needed special treatment, leading to the development and wide adaption of logistic regression and similar methods in the mid-1900s (Cramer 2003). The generalized linear model framework came into being, in a formal sense, in the 1970s with Nelder and Wedderburn (1972). Generalized linear models (GLMs) broaden the types of outcomes that are allowed. We still model outcomes as a linear function, but we are less constrained. The outcome can be anything in the exponential family, and popular choices include the logistic distribution and the Poisson distribution. A further generalization of GLMs is generalized additive models (GAMs) where we broaden the structure of the explanatory side. We still explain the dependent variable as an additive function of various bits and pieces, but those bits and pieces can be functions. This framework was proposed in the 1990s by Hastie and Tibshirani (1990).

Shoulders of giants

Dr Robert Tibshirani is Professor in the Departments of Statistics and Biomedical Data Science at Stanford University. After earning a PhD in Statistics from Stanford University in 1981, he joined the University of Toronto as an assistant professor. He was promoted to full professor in 1994 and moved to Stanford in 1998. He made fundamental contributions including GAMs, mentioned above, and lasso regression, which is a way of automated variable selection and will be covered in Chapter 16. He is an author of James et al. (2021). He was awarded the COPSS Presidents’ Award in 1996 and was appointed a Fellow of the Royal Society in 2019.

In this chapter we consider logistic, Poisson, and negative binomial regression.

13.2 Logistic regression

Linear regression is a useful way to better understand our data. But it assumes a continuous outcome variable that can take any number on the real line. We would like some way to use this same machinery when we cannot satisfy this condition. We turn to logistic and Poisson regression for binary and count outcome variables, respectively, noting they are still linear models, because the independent variables enter in a linear fashion.

Logistic regression, and its close variants, are useful in a variety of settings, from elections (Wang et al. 2015) through to horse racing (Chellel 2018; Bolton and Chapman 1986). We use logistic regression when the dependent variable is a binary outcome, such as 0 or 1, or “yes” or “no”. Although the presence of a binary outcome variable may sound limiting, there are a lot of circumstances in which the outcome either naturally falls into this situation or can be adjusted into it.

The foundation of this is the Bernoulli distribution where there is a certain probability, \(p\), of outcome “1” and the remainder, \(1-p\), for outcome “0”. We can use rbernoulli() to simulate data from the Bernoulli distribution.



bernoulli_example <-
  tibble(draws = rbernoulli(n = 20, p = 0.1))

bernoulli_example |> pull(draws)

One reason to use logistic regression is that we will be modelling a probability and so it will be bounded between 0 and 1. Whereas with linear regression we may end up with values outside this. The foundation of logistic regression is the logit function:

\[ \mbox{logit}(x) = \log\left(\frac{x}{1-x}\right). \] This will transpose values between 0 and 1 onto the real line. For instance, logit(0.1) = -2.2, logit(0.5) = 0, and logit(0.9) = 2.2.

To illustrate logistic regression, we will simulate data on whether it is a weekday or weekend, based on the number of cars on the road. We will assume that on weekdays the road is busier.


week_or_weekday <-
    number_of_cars = runif(n = 1000, min = 0, 100),
    noise = rnorm(n = 1000, mean = 0, sd = 10),
    is_weekday = if_else(number_of_cars + noise > 50, 1, 0)
  ) |>
  mutate(number_of_cars = round(number_of_cars)) |>

# A tibble: 1,000 × 2
   number_of_cars is_weekday
            <dbl>      <dbl>
 1             36          0
 2             12          0
 3             48          1
 4             32          0
 5              4          0
 6             40          0
 7             13          0
 8             24          0
 9             16          0
10             19          0
# … with 990 more rows

We can use glm() from base R to do a quick estimation. In this case we will try to work out whether it is a weekday or weekend, based on the number of cars we can see. We are interested in estimating Equation 13.1:1 \[ \mbox{Pr}(y_i=1) = \mbox{logit}^{-1}\left(\beta_0+\beta_1 x_i\right) \tag{13.1}\] where \(y_i\) is whether it is a weekday and \(x_i\) is the number of cars on the road.

week_or_weekday_model <-
    is_weekday ~ number_of_cars,
    data = week_or_weekday,
    family = "binomial"

The results of the estimation are in the first column of Table 13.2. The estimated coefficient on the number of cars is 0.19.

The interpretation of the logistic regression’s coefficients is more complicated than linear regression as they relate to changes in the log-odds of the binary outcome. For instance, the estimate of 0.19 is the average change in the log odds of it being a weekday with observing one extra car on the road. The coefficient is positive which means an increase, but as it is non-linear, if we want to specify a particular change, then this will be different for different baseline levels of the observation (e.g. an increase of 0.19 log odds has a larger impact when the baseline log odds are 0, compared to 2).

We can translate our estimate into the probability of it being a weekday, for a given number of cars. predictions() from marginaleffects (Arel-Bundock 2022) can be used to add the implied probability that it is a weekday for each observation.


week_or_weekday_predictions <-
  predictions(week_or_weekday_model) |>

# A tibble: 1,000 × 10
   rowid type   predi…¹ std.e…² stati…³  p.value conf.…⁴ conf.…⁵ is_we…⁶ numbe…⁷
   <int> <chr>    <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1     1 respo… 5.66e-2 1.36e-2    4.17 3.01e- 5 3.52e-2 8.98e-2       0      36
 2     2 respo… 6.31e-4 3.63e-4    1.74 8.23e- 2 2.04e-4 1.95e-3       0      12
 3     3 respo… 3.69e-1 3.31e-2   11.1  7.49e-29 3.07e-1 4.36e-1       1      48
 4     4 respo… 2.73e-2 8.08e-3    3.38 7.25e- 4 1.52e-2 4.85e-2       0      32
 5     5 respo… 1.38e-4 9.52e-5    1.45 1.46e- 1 3.59e-5 5.33e-4       0       4
 6     6 respo… 1.14e-1 2.09e-2    5.43 5.59e- 8 7.86e-2 1.61e-1       0      40
 7     7 respo… 7.63e-4 4.28e-4    1.78 7.49e- 2 2.54e-4 2.29e-3       0      13
 8     8 respo… 6.12e-3 2.49e-3    2.45 1.42e- 2 2.75e-3 1.36e-2       0      24
 9     9 respo… 1.35e-3 7.00e-4    1.93 5.42e- 2 4.86e-4 3.72e-3       0      16
10    10 respo… 2.38e-3 1.14e-3    2.09 3.63e- 2 9.32e-4 6.05e-3       0      19
# … with 990 more rows, and abbreviated variable names ¹​predicted, ²​std.error,
#   ³​statistic, ⁴​conf.low, ⁵​conf.high, ⁶​is_weekday, ⁷​number_of_cars

And we can then graph the probability that our model implies, for each observation, of it being a weekday (Figure 13.1).

week_or_weekday_predictions |>
  mutate(is_weekday = factor(is_weekday)) |>
    x = number_of_cars,
    y = predicted,
    color = is_weekday
  )) +
  geom_jitter(width = 0.01, height = 0.01, alpha = 0.3) +
    x = "Number of cars that were seen",
    y = "Estimated probability it is a weekday",
    color = "Was actually weekday"
  ) +
  theme_classic() +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 13.1: Logistic regression probability results with simulated data of whether it is a weekday or weekend based on the number of cars that are around

The marginal effect at each observation is of interest because it provides a sense of how this probability is changing. It enables us to say that at the mean (which in this case is if we were to see 50 cars) the probability of it being a weekday, increases by almost five per cent if we were to see another one (Table 13.1).

marginaleffects(week_or_weekday_model, newdata = "mean") |>
  select(term, dydx, std.error) |>
    col.names = c("Term", "Slope", "Standard error")
Table 13.1: Marginal effect of another car on the probability that it is a weekday, at the mean
Term Slope Standard error
number_of_cars 0.0471838 0.0035216

We could use tidymodels to run this if we wanted. To do that, we first need to change the class of our dependent variable into a factor because this is required for classification models.



week_or_weekday <-
  week_or_weekday |>
  mutate(is_weekday = as_factor(is_weekday))

week_or_weekday_split <- initial_split(week_or_weekday, prop = 0.80)
week_or_weekday_train <- training(week_or_weekday_split)
week_or_weekday_test <- testing(week_or_weekday_split)

week_or_weekday_tidymodels <-
  logistic_reg(mode = "classification") |>
  set_engine("glm") |>
    is_weekday ~ number_of_cars,
    data = week_or_weekday_train

As before, we can make a graph of the actual results compared with our estimates. But one nice aspect of this is that we could use our test dataset to evaluate our model’s forecasting ability more thoroughly, for instance through a confusion matrix, which specifies the count of each prediction by what the truth was. We find that the model does well on the held-out dataset. There were 90 observations where the model predicted it was a weekday, and it was actually a weekday, and 95 where the model predicted it was a weekend, and it was a weekend. It was wrong for 15 observations, and these were split across seven where it forecast a weekday, but it was a weekend, and eight where it was the opposite case.2

week_or_weekday_tidymodels |>
  predict(new_data = week_or_weekday_test) |>
  cbind(week_or_weekday_test) |>
  conf_mat(truth = is_weekday, estimate = .pred_class)
Prediction  0  1
         0 90  8
         1  7 95

We might be interested in inference, and so want to build a Bayesian model using rstanarm. Again, we will specify priors for our model, but use the default priors that rstanarm uses:

\[ \begin{aligned} y_i|\pi_i & \sim \mbox{Bern}(\pi_i) \\ \mbox{logit}(\pi_i) & = \left(\beta_0+\beta_1 x_i\right) \\ \beta_0 & \sim \mbox{Normal}(0, 2.5)\\ \beta_1 & \sim \mbox{Normal}(0, 2.5) \end{aligned} \] where \(y_i\) is whether it is a weekday (actually 0 or 1), \(x_i\) is the number of cars on the road, and \(\pi_i\) is the probability that observations \(i\) is a weekday.


week_or_weekday_rstanarm <-
    is_weekday ~ number_of_cars,
    data = week_or_weekday,
    family = binomial(link = "logit"),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(location = 0, scale = 2.5, autoscale = TRUE),
    seed = 853

  file = "week_or_weekday_rstanarm.rds"

We can compare our different models (Table 13.2).

    "base R" = week_or_weekday_model,
    "tidymodels" = week_or_weekday_tidymodels,
    "rstanarm" = week_or_weekday_rstanarm
  statistic = "conf.int"
Table 13.2: Forecasting and explanatory models of whether it is day or night, based on the number of cars on the road
base R tidymodels rstanarm
(Intercept) −9.645 −9.410 −9.464
[−11.213, −8.286] [−11.125, −7.945] [−10.999, −8.165]
number_of_cars 0.190 0.185 0.186
[0.164, 0.220] [0.157, 0.218] [0.162, 0.216]
Num.Obs. 1000 800 1000
R2 0.779
AIC 359.8 299.0
BIC 369.6 308.3
Log.Lik. −177.875 −177.899
F 174.307
ELPD −179.8
ELPD s.e. 13.9
LOOIC 359.6
LOOIC s.e. 27.9
WAIC 359.6
RMSE 0.24 0.24 0.24

Table 13.2 makes it clear that each of the approaches is similar in this case. They agree on the direction of the effect of seeing an extra car on the probability of it being a weekday. And even the magnitude of the effect is estimated to be similar.

13.2.1 US political support

One area where logistic regression is often used is political polling. In many cases voting implies the need for one preference ranking, and so issues are reduced, whether appropriately or not, to “support” or “not support”.

As a reminder, the workflow we advocate in this book: \(\mbox{Plan} \rightarrow \mbox{Simulate} \rightarrow \mbox{Acquire} \rightarrow \mbox{Explore} \rightarrow \mbox{Share}\). While the focus here is the exploration of data using linear models, we still need to do the other aspects. We begin by planning. In this case, we are interested in US political support. In particular we are interested in whether we can forecast who a respondent is likely to vote for, based only on knowing their highest level of education and gender. That means we are interested in a dataset with variables for who an individual voted for, and some of their characteristics, for instance, gender and education. A quick sketch of such a dataset is Figure 13.2 (a). And we would like our model to average over these points. A quick sketch is Figure 13.2 (b).

(a) Quick sketch of a dataset that could be used to examine US political support

(b) Quick sketch of what we expect from the analysis before finalizing either the data of the analysis

Figure 13.2: Sketches of the expected dataset and analysis focus and clarify our thinking even if they will be updated later

We will simulate a dataset where the chance that a person supports Biden depends on their gender and education.

number_of_observations <- 1000

us_political_preferences <-
    education = sample(1:5, size = number_of_observations, replace = TRUE),
    gender = sample(1:2, size = number_of_observations, replace = TRUE),
    support_prob = 1 - 1 / (education + gender)
  ) |>
  rowwise() |>
  mutate(supports_biden = sample(
    c("yes", "no"),
    size = 1,
    replace = TRUE,
    prob = c(support_prob, 1 - support_prob)
  )) |>
  ungroup() |>
  select(-support_prob, supports_biden, gender, education)

# A tibble: 1,000 × 3
   education gender supports_biden
       <int>  <int> <chr>         
 1         4      2 yes           
 2         1      1 yes           
 3         5      2 yes           
 4         2      1 yes           
 5         3      2 yes           
 6         2      1 yes           
 7         5      2 yes           
 8         4      2 yes           
 9         1      2 yes           
10         4      1 no            
# … with 990 more rows

We can use the 2020 Cooperative Election Study (CES) (Schaffner, Ansolabehere, and Luks 2021), which is a long-standing annual survey of US political opinion. In 2020, there were 61,000 respondents who completed the post-election survey. While the sampling methodology, detailed in Ansolabehere, Schaffner, and Luks (2021, 13), is not a perfect random sample and relies on matching, it is widely used and accepted approach that balances sampling concerns and cost.

We can access the CES using get_dataframe_by_name() from dataverse (Kuriwaki, Beasley, and Leeper 2022), which was an approach that was introduced in Chapter 7 and Chapter 10. We save the data that are of interest to us, and then refer to that saved dataset.


ces2020 <-
    filename = "CES20_Common_OUTPUT_vv.csv",
    dataset = "10.7910/DVN/E9N6PH",
    server = "dataverse.harvard.edu",
    .f = readr::read_csv
  ) |>
  select(votereg, CC20_410, gender, educ)

write_csv(ces2020, "ces2020.csv")
ces2020 <-
    col_types =
        "votereg" = col_integer(),
        "CC20_410" = col_integer(),
        "gender" = col_integer(),
        "educ" = col_integer()

# A tibble: 61,000 × 4
   votereg CC20_410 gender  educ
     <int>    <int>  <int> <int>
 1       1        2      1     4
 2       2       NA      2     6
 3       1        1      2     5
 4       1        1      2     5
 5       1        4      1     5
 6       1        2      1     3
 7       2       NA      1     3
 8       1        2      2     3
 9       1        2      2     2
10       1        1      2     5
# … with 60,990 more rows

When we look at the actual data, there are concerns that we did not anticipate in our sketches. We only want respondents who are registered to vote, and we are only interested in those that voted for either Biden or Trump. We can filter to only those respondents and then add more informative labels. Gender of “female” and “male” is what is available from the CES.

ces2020 <-
  ces2020 |>
    votereg == 1,
    CC20_410 %in% c(1, 2)
  ) |>
    voted_for = if_else(CC20_410 == 1, "Biden", "Trump"),
    voted_for = as_factor(voted_for),
    gender = if_else(gender == 1, "Male", "Female"),
    education = case_when(
      educ == 1 ~ "No HS",
      educ == 2 ~ "High school graduate",
      educ == 3 ~ "Some college",
      educ == 4 ~ "2-year",
      educ == 5 ~ "4-year",
      educ == 6 ~ "Post-grad"
    education = factor(
      levels = c(
        "No HS",
        "High school graduate",
        "Some college",
  ) |>
  select(voted_for, gender, education)

In the end we are left with 43,554 respondents (Figure 13.3).

ces2020 |>
  ggplot(aes(x = education, fill = voted_for)) +
  stat_count(position = "dodge") +
  facet_wrap(facets = vars(gender)) +
  theme_minimal() +
    x = "Highest education",
    y = "Number of respondents",
    fill = "Voted for"
  ) +
  coord_flip() +
  scale_fill_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 13.3: The distribution of presidential preferences, by gender, and highest education

One approach is to use tidymodels to build a prediction-focused logistic regression model in the same way as before i.e. a validation set approach (James et al. 2021, 176). In this case, the probability will be that of voting for Biden.


ces2020_split <- initial_split(ces2020, prop = 0.80)
ces2020_train <- training(ces2020_split)
ces2020_test <- testing(ces2020_split)

ces_tidymodels <-
  logistic_reg(mode = "classification") |>
  set_engine("glm") |>
    voted_for ~ gender + education,
    data = ces2020_train

parsnip model object

Call:  stats::glm(formula = voted_for ~ gender + education, family = stats::binomial, 
    data = data)

                  (Intercept)                     genderMale  
                       0.2157                        -0.4697  
educationHigh school graduate          educationSome college  
                      -0.1857                         0.3502  
              education2-year                education4-year  
                       0.2311                         0.6700  

Degrees of Freedom: 34842 Total (i.e. Null);  34836 Residual
Null Deviance:      47000 
Residual Deviance: 45430    AIC: 45440

And then evaluate it on the test set. It appears as though the model is having difficulty identifying Trump supporters.

ces_tidymodels |>
  predict(new_data = ces2020_test) |>
  cbind(ces2020_test) |>
  conf_mat(truth = voted_for, estimate = .pred_class)
Prediction Trump Biden
     Trump   656   519
     Biden  2834  4702

When we introduced tidymodels, we discussed the importance of randomly constructing training and test sets. We use the training dataset to estimate parameters, and then evaluate the model on the test set. It is natural to ask why we should be subject to the whims of randomness and whether we are making the most of our data. For instance, what if a good model is poorly evaluated because of some random inclusion in the test set? And what if we do not have a large test set?

One commonly used resampling method that goes some way to addressing this is \(k\)-fold cross-validation. In this approach we create \(k\) different samples, or “folds”, from the dataset without replacement. We then fit the model to the first \(k-1\) folds, and then evaluate it on the last fold. We do this \(k\) times, once for every fold, such that every observation will be used for training \(k-1\) times and for testing once. The \(k\)-fold cross-validation estimate is then the average mean squared error (James et al. 2021, 181).

We can use vfold_cv() from tidymodels to create, say, 10 folds.


ces2020_10_folds <- vfold_cv(ces2020, v = 10)

We then fit our model across the different combinations of folds with fit_resamples(). In this case, the model will be fit 10 times.

ces2020_cross_validation <-
    object = logistic_reg(mode = "classification") |> set_engine("glm"),
    preprocessor = recipe(
      voted_for ~ gender + education,
      data = ces2020
    resamples = ces2020_10_folds,
    metrics = metric_set(accuracy, sens, spec),
    control = control_resamples(save_pred = TRUE)

We might be interested to understand the performance of our model and we can use collect_metrics() to aggregate them across the folds (Table 13.3 (a)). These types of details would typically be mentioned in passing in the main content of the paper, but included in great detail in an appendix. The average accuracy of our model across the folds is 0.61, while the average sensitivity is 0.19 and the average specificity is 0.90.

collect_metrics(ces2020_cross_validation) |>
  select(.metric, mean) |>
    col.names = c(
    digits = 2,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")
conf_mat_resampled(ces2020_cross_validation) |>
  mutate(Proportion = Freq / sum(Freq)) |>
    digits = 2,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")

Table 13.3: Average metrics across the 10 folds of a logistic regression to forecast voter preference

(a) Key performance metrics
Metric Mean
accuracy 0.61
sens 0.19
spec 0.90
(b) Confusion matrix
Prediction Truth Freq Proportion
Trump Trump 327.5 0.08
Trump Biden 267.7 0.06
Biden Trump 1,428.3 0.33
Biden Biden 2,331.9 0.54

What does this mean? Accuracy is the proportion of observations that were correctly classified. The result of 0.61 suggests the model is doing better than a coin toss, but not much more. Sensitivity is the proportion of true observations that are identified as true (James et al. 2021, 145). In this case would mean that the model predicted a respondent voted for Trump and they did. And specificity is the proportion of false observations that are identified as false (James et al. 2021, 145). In this case it is the proportion of voters that voted for Biden, that were predicted to vote for Biden. This confirms our initial thought that model is having trouble identifying Trump supporters.

We can see this in more detail by looking at the confusion matrix (Table 13.3 (b)). When used with resampling approaches, such as cross-validation, the confusion matrix is computed for each fold and then averaged. The model is predicting Biden much more than we might expect from our knowledge of how close the 2020 election was. It suggests that our model may need additional variables to do a better job.

Finally, it may be the case that we are interested in individual-level results, and we can add these to our dataset with collect_predictions().

ces2020_with_predictions <-
    collect_predictions(ces2020_cross_validation) |>
      arrange(.row) |>
  ) |>

For instance, we can see that model is essentially predicting support for Biden for all individuals apart from males with no high school, high school graduates, or 2 years of college (Table 13.4).

ces2020_with_predictions |>
  group_by(gender, education, voted_for) |>
  count(.pred_class) |>
    col.names = c(
      "Voted for",
      "Predicted vote",
    digits = 0,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")
Table 13.4: The model is forecasting support for Biden for all females, and for many males, regardless of education
Gender Education Voted for Predicted vote Number
Female No HS Trump Biden 206
Female No HS Biden Biden 228
Female High school graduate Trump Biden 3,204
Female High school graduate Biden Biden 3,028
Female Some college Trump Biden 1,842
Female Some college Biden Biden 3,325
Female 2-year Trump Biden 1,117
Female 2-year Biden Biden 1,739
Female 4-year Trump Biden 1,721
Female 4-year Biden Biden 4,295
Female Post-grad Trump Biden 745
Female Post-grad Biden Biden 2,853
Male No HS Trump Trump 132
Male No HS Biden Trump 123
Male High school graduate Trump Trump 2,054
Male High school graduate Biden Trump 1,528
Male Some college Trump Biden 1,992
Male Some college Biden Biden 2,131
Male 2-year Trump Trump 1,089
Male 2-year Biden Trump 1,026
Male 4-year Trump Biden 2,208
Male 4-year Biden Biden 3,294
Male Post-grad Trump Biden 1,248
Male Post-grad Biden Biden 2,426

13.3 Poisson regression

When we have count data we should initially think to take advantage of the Poisson distribution. The Poisson distribution is governed by one parameter, \(\lambda\). This distributes probabilities over non-negative integers and hence governs the shape of the distribution. As such, the Poisson distribution has the interesting feature that the mean is also the variance, and so as the mean increases, so does the variance. The Poisson probability mass function is (Pitman 1993, 121): \[P_{\lambda}(k) = e^{-\lambda}\lambda^k/k!\mbox{, for }k=0,1,2,...\] We can simulate \(n=20\) draws from the Poisson distribution with rpois(), where \(\lambda\) here is equal to three.

rpois(n = 20, lambda = 3)
 [1] 2 1 1 2 1 5 4 1 4 6 4 4 3 3 2 3 7 2 5 2

We can also look at what happens to the distribution as we change the value of \(\lambda\) (Figure 13.4).


poisson_takes_shape <-
    lambda = c(),
    draw = c()

number_of_each <- 100

for (i in c(0, 1, 2, 4, 7, 10, 15, 25, 50)) {
  draws_i <-
      lambda = c(rep(i, number_of_each)),
      draw = c(rpois(n = number_of_each, lambda = i))

  poisson_takes_shape <- rbind(poisson_takes_shape, draws_i)

poisson_takes_shape |>
    lambda = paste("Lambda =", lambda),
    lambda = factor(
      levels = c(
        "Lambda = 0",
        "Lambda = 1",
        "Lambda = 2",
        "Lambda = 4",
        "Lambda = 7",
        "Lambda = 10",
        "Lambda = 15",
        "Lambda = 25",
        "Lambda = 50"
  ) |>
  ggplot(aes(x = draw)) +
  geom_density() +
    scales = "free_y"
  ) +
  theme_minimal() +
    x = "Integer",
    y = "Density"

Figure 13.4: The Poisson distribution is governed by the value of the mean, which is the same as its variance

To illustrate the situation, we could simulate data about the number of A grades that are awarded in each university course. In this simulated example, we consider three departments, each of which has many courses. Each course will award a different number of A grades.


class_size <- 26

count_of_A <-
    # From Chris DuBois: https://stackoverflow.com/a/1439843
    department = c(rep.int("1", 26), rep.int("2", 26), rep.int("3", 26)),
    course = c(
      paste0("DEP_1_", letters),
      paste0("DEP_2_", letters),
      paste0("DEP_3_", letters)
    number_of_A = c(
      rpois(n = class_size, lambda = 5),
      rpois(n = class_size, lambda = 10),
      rpois(n = class_size, lambda = 20)

# A tibble: 78 × 3
   department course  number_of_A
   <chr>      <chr>         <int>
 1 1          DEP_1_a           4
 2 1          DEP_1_b           2
 3 1          DEP_1_c           5
 4 1          DEP_1_d           4
 5 1          DEP_1_e           1
 6 1          DEP_1_f           4
 7 1          DEP_1_g           3
 8 1          DEP_1_h           3
 9 1          DEP_1_i           3
10 1          DEP_1_j           3
# … with 68 more rows
count_of_A |>
  ggplot(aes(x = number_of_A)) +
  geom_histogram(aes(fill = department), position = "dodge") +
    x = "Number of A grades awarded",
    y = "Number of classes",
    fill = "Department"
  ) +
  theme_classic() +
  scale_fill_brewer(palette = "Set1")

Figure 13.5: Simulated number of A grades in various classes across three departments

Our simulated dataset has the number of A grades awarded by courses, which are structured within departments (Figure 13.5). In Chapter 15, we will take advantage of this departmental structure, but for now we just ignore it.

The model that we are interested in estimating is:

\[ \begin{aligned} y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\ \log(\lambda_i) & = \beta_0 + \beta_1 x_i \end{aligned} \] where \(y_i\) is the number of A grades awarded, and \(x_i\) is the course.

We can use glm() from base R to get a quick sense of the data. This function is quite general, and we specify Poisson regression by setting the “family” parameter. The estimates are contained in the first column of Table 13.5.

grades_base <-
    number_of_A ~ department,
    data = count_of_A,
    family = "poisson"

As with logistic regression, the interpretation of the coefficients from Poisson regression can be difficult. The interpretation of the coefficient on “department2” is that it is the log of the expected difference between departments. We expect \(e^{0.883} \approx 2.4\) and \(e^{1.703} \approx 5.5\) as many A grades in departments 2 and 3, respectively, compared with department 1 (Table 13.5).

If we were interested in prediction, then we could use tidymodels to estimate Poisson models with poissonreg (Kuhn and Frick 2022) (Table 13.5).



count_of_A_split <-
  initial_split(count_of_A, prop = 0.80)
count_of_A_train <- training(count_of_A_split)
count_of_A_test <- testing(count_of_A_split)

grades_tidymodels <-
  poisson_reg(mode = "regression") |>
  set_engine("glm") |>
    number_of_A ~ department,
    data = count_of_A_train

The results of this estimation are in the second column of Table 13.5. They are similar to the estimates from glm(), but the number of observations is less because of the split.

Finally, we could build a Bayesian model and estimate it with rstanarm (Table 13.5).

\[ \begin{aligned} y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\ \log(\lambda_i) & = \beta_0 + \beta_1 x_i\\ \beta_0 & \sim \mbox{Normal}(0, 2.5)\\ \beta_1 & \sim \mbox{Normal}(0, 2.5) \end{aligned} \]

grades_rstanarm <-
    number_of_A ~ department,
    data = count_of_A,
    family = poisson(link = "log"),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(location = 0, scale = 2.5, autoscale = TRUE),
    seed = 853

  file = "grades_rstanarm.rds"

The results are contained in the third column of Table 13.5.

    "base R" = grades_base,
    "tidymodels" = grades_tidymodels,
    "rstanarm" = grades_rstanarm
  statistic = "conf.int"
Table 13.5: Forecasting and explanatory models of marathon times based on five-kilometer run times
base R tidymodels rstanarm
(Intercept) 1.327 1.353 1.321
[1.122, 1.519] [1.138, 1.554] [1.128, 1.509]
department2 0.883 0.861 0.884
[0.651, 1.123] [0.611, 1.118] [0.654, 1.117]
department3 1.703 1.719 1.706
[1.493, 1.924] [1.495, 1.954] [1.499, 1.921]
Num.Obs. 78 62 78
AIC 392.5 311.6
BIC 399.6 318.0
Log.Lik. −193.274 −193.355
F 147.022
ELPD −196.2
ELPD s.e. 7.7
LOOIC 392.4
LOOIC s.e. 15.4
WAIC 392.4
RMSE 3.41 3.35 3.41

As with logistic regression, we can use marginaleffects() from (Arel-Bundock 2022) to help with interpreting these results. It may be useful to consider how we expect the number of A grades to change as we go from one department to another. Table 13.6 suggests that in our dataset classes in Department 2 tend to have around five additional A grades, compared with Department 1. And that classes in Department 3, tend to have around 17 more A grades, compared with Department 1.3

marginaleffects(grades_rstanarm) |>
  summary() |>
  select(-type, -term) |>
    digits = 2,
    col.names = c("Departmental comparison", "Estimate", "2.5%", "97.5%"),
    booktabs = TRUE,
    linesep = ""
Table 13.6: The estimated difference in the number of A grades awarded at each department
Departmental comparison Estimate 2.5% 97.5%
2 - 1 5.32 4.01 6.70
3 - 1 16.92 15.10 18.84

13.3.1 Letters used in Jane Eyre

In an earlier age, Edgeworth (1885) made counts of the dactyls in Virgil’s Aeneid (Friendly (2021) makes this dataset available with HistData::Dactyl). Inspired by this we could use gutenbergr (Johnston and Robinson 2022) to get the text of Jane Eyre by Charlotte Brontë. We could then consider the first 10 lines of each chapter, count the number of words, and count the number of times either “E” or “e” appears. We are interested to see whether the number of e/E’s increases as more words are used. If not, it could suggest that the distribution of e/E’s is not consistent, which could be of interest to linguists.

Following the workflow advocated in this book, we first sketch our dataset and model. A quick sketch of what the dataset could look like is Figure 13.10 (a). And a quick sketch of our model is Figure 13.10 (b).

(a) Planned counts, by line and chapter, in Jane Eyre

(b) Planned expected relationship between count of e/E and number of words in the line

Figure 13.6: Sketches of the expected dataset and analysis force us to consider what we are interested in

We will simulate a dataset the number of e/E letters is distributed following the Poisson distribution (Figure 13.7).

count_of_e_simulation <-
    chapter = c(rep(1, 10), rep(2, 10), rep(3, 10)),
    line = rep(1:10, 3),
    number_words_in_line = runif(min = 0, max = 15, n = 30) |> round(0),
    number_e = rpois(n = 30, lambda = 10)

count_of_e_simulation |>
  ggplot(aes(y = number_e, x = number_words_in_line)) +
  geom_point() +
    x = "Number of words in line",
    y = "Number of e/E letters in the chapter"
  ) +
  theme_classic() +
  scale_fill_brewer(palette = "Set1")

Figure 13.7: Simulated counts of e/E letters

We can now gather and prepare our data.


gutenberg_id_of_janeeyre <- 1260

jane_eyre <-
    gutenberg_id = gutenberg_id_of_janeeyre,
    mirror = "https://gutenberg.pglaf.org/"


write_csv(jane_eyre, "jane_eyre.csv")

We will download it and then use our local copy to avoid overly imposing on the Project Gutenberg servers.

jane_eyre <- read_csv(
  col_types = cols(
    gutenberg_id = col_integer(),
    text = col_character()

# A tibble: 21,001 × 2
   gutenberg_id text                           
          <int> <chr>                          
 1         1260 JANE EYRE                      
 2         1260 AN AUTOBIOGRAPHY               
 3         1260 <NA>                           
 4         1260 by Charlotte Brontë            
 5         1260 <NA>                           
 6         1260 _ILLUSTRATED BY F. H. TOWNSEND_
 7         1260 <NA>                           
 8         1260 London                         
 9         1260 SERVICE & PATON                
10         1260 5 HENRIETTA STREET             
# … with 20,991 more rows

We are interested in only those lines that have content, so we remove those empty lines that are just there for spacing. Then we can create counts of the number of times “e” or “E” occurs in that line, for the first 10 lines of each chapter. For instance, we can look at the first few lines and see that there are 5 e/E’s in the first line and 8 in the second.

jane_eyre_reduced <-
  jane_eyre |>
  filter(!is.na(text)) |> # Remove empty lines
  mutate(chapter = if_else(
    str_detect(text, "CHAPTER") == TRUE,
  )) |> # Find start of chapter
  fill(chapter, .direction = "down") |> # Add chapter number to each line
  group_by(chapter) |>
  mutate(chapter_line = row_number()) |> # Add line number of each chapter
    chapter_line %in% c(2:11)
  ) |> # Start at 2 to get rid of text "CHAPTER I" etc
  select(text, chapter) |>
    chapter = str_remove(chapter, "CHAPTER "),
    chapter = str_remove(chapter, "—CONCLUSION"),
    chapter = as.integer(as.roman(chapter))
  ) # Change chapters to integers

jane_eyre_reduced <-
  jane_eyre_reduced |>
  mutate(count_e = str_count(text, "e|E")) |>
  # Based on: https://stackoverflow.com/a/38058033
  mutate(word_count = str_count(text, "\\w+"))
jane_eyre_reduced |>
  select(chapter, word_count, count_e, text) |>
# A tibble: 6 × 4
# Groups:   chapter [1]
  chapter word_count count_e text                                               
    <int>      <int>   <int> <chr>                                              
1       1         13       5 There was no possibility of taking a walk that day…
2       1         11       8 wandering, indeed, in the leafless shrubbery an ho…
3       1         12       9 but since dinner (Mrs. Reed, when there was no com…
4       1         14       3 the cold winter wind had brought with it clouds so…
5       1         11       7 so penetrating, that further outdoor exercise was …
6       1          1       1 question.                                          

We can verify that the mean and variance of the number of E/e’s is roughly similar by plotting all of the data (Figure 13.8). The mean, in pink, is 6.7, and the variance, in blue, is 6.2. While they are not entirely the same, they are similar. We include the diagonal in Figure 13.8 (b) to help with thinking about the data. If the data were on the \(y=x\) line, then on average the is one E/e per word. As it is below that point, means that on average there is less than one per word.

mean_e <- mean(jane_eyre_reduced$count_e)
variance_e <- var(jane_eyre_reduced$count_e)

jane_eyre_reduced |>
  ggplot(aes(x = count_e)) +
  geom_histogram() +
  geom_vline(xintercept = mean_e, linetype = "dashed", color = "#C64191") +
  geom_vline(xintercept = variance_e, linetype = "dashed", color = "#0ABAB5") +
  theme_minimal() +
    y = "Count",
    x = "Number of e's per line for first ten lines"

jane_eyre_reduced |>
  ggplot(aes(x = word_count, y = count_e)) +
  geom_jitter(alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
  theme_minimal() +
    x = "Number of words in the line",
    y = "Number of e/E's in the line"

(a) Distribution of the number of e/E’s

(b) Comparison of the number of e/E’s in the line and the number of words in the line

Figure 13.8: Number of ‘E’ or ‘e’ in the first ten lines of each chapter in Jane Eyre

We could consider the following model:

\[ \begin{aligned} y_i|\lambda_i &\sim \mbox{Poisson}(\lambda_i)\\ \log(\lambda_i) & = \beta_0 + \beta_1 x_i\\ \beta_0 & \sim \mbox{Normal}(0, 2.5)\\ \beta_1 & \sim \mbox{Normal}(0, 2.5) \end{aligned} \] where \(y_i\) is the number of e/E’s in the line and \(x_i\) is the number of words in the line. We could estimate the parameters using stan_glm().

jane_e_counts <-
    count_e ~ word_count,
    data = jane_eyre_reduced,
    family = poisson(link = "log"),
    seed = 853

  file = "jane_e_counts.rds"

While we would normally be interested in the table of estimates, as we have seen that a few times now, rather than again creating a table of the estimates, we introduce plot_cap() from marginaleffects (Arel-Bundock 2022). We can use this to show the number of e/E’s predicted by the model, for each line, based on the number of words in that line. Figure 13.9 makes it clear that we expect a positive relationship.

plot_cap(jane_e_counts, condition = "word_count") +
    x = "Number of words",
    y = "Average number of e/E in the first 10 lines"

Figure 13.9: The predicted number of e/E’s in each line based on the number of words

13.4 Negative binomial regression

One of the major restrictions with Poisson regression is the assumption that the mean and the variance are the same. We can relax this assumption to allow over-dispersion and can use a close variant, negative binomial regression.

Poisson and negative binomial models go hand in hand. It is often the case that we will end up fitting both, and then comparing them. For instance, Maher (1982) considers both in the context of results from the English Football League and discusses situations in which one may be considered more appropriate than the other. Similarly, Smith (2002) considers the 2000 US presidential election and especially the issue of overdispersion in a Poisson analysis. And Osgood (2000) compares them in the case of crime data.

13.4.1 Mortality in Alberta

Consider, somewhat morbidly, that every year each individual either dies or does not. From the perspective of a geographic area, we could gather data on the number of people who died each year, by their cause of death. The Canadian province of Alberta makes available the number of deaths, by cause, since 2001, for the top 30 causes each year.

As always we first sketch our dataset and model. A quick sketch of what the dataset could look like is Figure 13.10 (a) And a quick sketch of our model is Figure 13.10 (b)

(a) Quick sketch of a dataset that could be used to examine cause of death in Alberta

(b) Quick sketch of what we expect from the analysis of cause of death in Alberta before finalizing either the data or the analysis

Figure 13.10: Sketches of the expected dataset and analysis for cause of death in Alberta

We will simulate a dataset of cause of death distributed following the negative binomial distribution.

alberta_death_simulation <-
    cause = rep(x = c("Heart", "Stroke", "Diabetes"), times = 10),
    year = rep(x = 2016:2018, times = 10),
    deaths = rnbinom(n = 30, size = 20, prob = 0.1)

# A tibble: 30 × 3
   cause     year deaths
   <chr>    <int>  <int>
 1 Heart     2016    118
 2 Stroke    2017    167
 3 Diabetes  2018    216
 4 Heart     2016    153
 5 Stroke    2017    165
 6 Diabetes  2018    153
 7 Heart     2016    168
 8 Stroke    2017    204
 9 Diabetes  2018    190
10 Heart     2016    173
# … with 20 more rows

We download that data, and save it as “alberta_COD.csv”. We can look at the distribution of these deaths, by year and cause (Figure 13.11). We have truncated the full cause of death because some are quite long. As some causes are not always in the top 30 each year, not all causes have the same number of occurrences.

alberta_cod <-
    col_types = cols(
      `Calendar Year` = col_integer(),
      Cause = col_character(),
      Ranking = col_integer(),
      `Total Deaths` = col_integer()
  ) |>
  janitor::clean_names() |>
  add_count(cause) |>
  mutate(cause = str_trunc(cause, 30))

If we were to look at the top ten causes in 2021, we would notice a variety of interesting aspects (Table 13.7). For instance, we would expect that the most common causes would be in all 21 years of our data. But we notice that that most common cause, “Other ill-defined and unknown causes of mortality”, is only in three years. Naturally, “COVID-19, virus identified”, is only in two other years, as there were no COVID deaths in Canada before 2020.

alberta_cod |>
    calendar_year == 2021,
    ranking <= 10
  ) |>
  mutate(total_deaths = format(total_deaths, big.mark = ",")) |>
    col.names = c(
      "Total deaths",
      "Number of years"
    digits = 0,
    align = c("l", "r", "r", "r", "r"),
    booktabs = TRUE,
    linesep = ""
Table 13.7: Top ten causes of death in Alberta in 2021
Year Cause Ranking Total deaths Number of years
2021 Other ill-defined and unkno... 1 3,362 3
2021 Organic dementia 2 2,135 21
2021 COVID-19, virus identified 3 1,950 2
2021 All other forms of chronic ... 4 1,939 21
2021 Malignant neoplasms of trac... 5 1,552 21
2021 Acute myocardial infarction 6 1,075 21
2021 Other chronic obstructive p... 7 1,028 21
2021 Diabetes mellitus 8 728 21
2021 Stroke, not specified as he... 9 612 21
2021 Accidental poisoning by and... 10 604 9

In this case, for simplicity, we will restrict ourselves to the five most common causes of death in 2021, that have been present every year.

alberta_cod_top_five <-
  alberta_cod |>
    calendar_year == 2021,
    n == 21
  ) |>
  slice_max(order_by = desc(ranking), n = 5) |>

alberta_cod <-
  alberta_cod |>
  filter(cause %in% alberta_cod_top_five)
alberta_cod |>
  ggplot(aes(x = calendar_year, y = total_deaths, color = cause)) +
  geom_line(alpha = 0.5) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
    x = "Year",
    y = "Annual number of deaths in Alberta",
    color = "Cause"

Figure 13.11: Annual number of deaths for the top 5 causes in 2021, since 2001, for Alberta, Canada

One thing that we notice is that the mean, 1,273, is different to the variance, 182,378 (Table 13.8).

Table 13.8: Summary statistics of the number of yearly deaths, by cause, in Alberta
Min Mean Max SD  Var N
total_deaths 280 1273 2135 427 182378 105

We can implement negative binomial regression when using stan_glm() by specifying that distribution in “family”. In this case, we run both Poisson and negative binomial.

cause_of_death_alberta_poisson <-
    total_deaths ~ cause,
    data = alberta_cod,
    family = poisson(link = "log"),
    seed = 853

cause_of_death_alberta_neg_binomial <-
    total_deaths ~ cause,
    data = alberta_cod,
    family = neg_binomial_2(link = "log"),
    seed = 853

We can compare our different models (Table 13.9) (we need to load broom.mixed (Bolker and Robinson 2022) because this is needed under the hood).


coef_short_names <- 
  c("causeAll other forms of chronic ischemic heart disease"
    = "causeAll other forms of...",
    "causeMalignant neoplasms of trachea, bronchus and lung"
    = "causeMalignant neoplas...",
    "causeOrganic dementia"
    = "causeOrganic dementia",
    "causeOther chronic obstructive pulmonary disease"
    = "causeOther chronic obst..."

    "Poisson" = cause_of_death_alberta_poisson,
    "Negative binomial" = cause_of_death_alberta_neg_binomial
  statistic = "conf.int",
  coef_map = coef_short_names
Table 13.9: Modelling the most prevelent cause of deaths in Alberta, 2001-2020
Poisson Negative binomial
causeAll other forms of... 0.442 0.439
[0.426, 0.458] [0.231, 0.636]
causeMalignant neoplas... 0.224 0.226
[0.207, 0.241] [0.019, 0.436]
causeOrganic dementia 0.001 0.002
[−0.016, 0.019] [−0.203, 0.206]
causeOther chronic obst... −0.214 −0.217
[−0.233, −0.195] [−0.422, −0.008]
Num.Obs. 105 105
Log.Lik. −5139.564 −771.297
ELPD −5312.5 −775.8
ELPD s.e. 1099.1 10.0
LOOIC 10625.0 1551.6
LOOIC s.e. 2198.3 20.0
WAIC 10793.7 1551.6
RMSE 308.24 308.26

We could use posterior predictive checks to more clearly show that the negative binomial approach is a better choice for this circumstance (Figure 13.12).

pp_check(cause_of_death_alberta_poisson) +
  theme(legend.position = "bottom")

pp_check(cause_of_death_alberta_neg_binomial) +
  theme(legend.position = "bottom")

(a) Poisson model

(b) Negative binomial model

Figure 13.12: Comparing posterior prediction checks for Poisson and negative binomial models

Finally, we can compare between the models using the resampling method leave-one-out (LOO) cross-validation. This is a variant of cross-validation, introduced earlier, where the size of each fold is one. That is to say, if there was a dataset with 100 observations this LOO is equivalent to 100-fold cross validation. We can implement this in rstanarm with loo() for each model, and then compare between them with loo_compare(). Strictly speaking LOO-CV is not actually done by loo(), because it would be too computationally intensive. Instead an approximation is done provides the expected log point wise predictive density (ELPD), where the higher the better.

poisson <- loo(cause_of_death_alberta_poisson, cores = 2)
neg_binomial <- loo(cause_of_death_alberta_neg_binomial, cores = 2)

loo_compare(poisson, neg_binomial)
                                    elpd_diff se_diff
cause_of_death_alberta_neg_binomial     0.0       0.0
cause_of_death_alberta_poisson      -4536.7    1089.5

In this case we find that the negative binomial model is a better fit than the Poisson, because ELPD is larger.

We have covered a variety of diagnostic aspects for Bayesian models. While it is difficult to be definitive about what is “enough” because it is context specific, the following checklist would be enough for most purposes. These would go into an appendix that was mentioned and cross-referenced in the model section of a paper:

  • Prior predictive checks.
  • Trace plots.
  • Rhat plots.
  • Posterior distributions.
  • Posterior predictive checks.

13.5 Exercises


  1. (Plan) Consider the following scenario: A person is interested in the number of deaths, attributed to cancer, in Sydney They collect data from the five largest hospitals. Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation, along with three independent variables that are associated the number of deaths, by cause. Please include at least ten tests based on the simulated data.
  3. (Acquire) Please describe one possible source of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched.
  5. (Communicate) Please write two paragraphs about what you did.


  1. When should we use logistic regression (pick one)?
    1. Continuous dependent variable.
    2. Binary dependent variable.
    3. Count dependent variable.
  2. We are interested in studying how voting intentions in the recent US presidential election vary by an individual’s income. We set up a logistic regression model to study this relationship. In this study, one possible dependent variable would be (pick one)?
    1. Whether the respondent is a US citizen (yes/no)
    2. The respondent’s personal income (high/low)
    3. Whether the respondent is going to vote for Trump (yes/no)
    4. Who the respondent voted for in 2016 (Trump/Clinton)
  3. We are interested in studying how voting intentions in the recent US presidential election vary by an individual’s income. We set up a logistic regression model to study this relationship. In this study, one possible dependent variable would be (pick one)?
    1. The race of the respondent (white/not white)
    2. The respondent’s marital status (married/not)
    3. Whether the respondent is registered to vote (yes/no)
    4. Whether the respondent is going to vote for Biden (yes/no)
  4. The mean of a Poisson distribution is equal to its?
    1. Median.
    2. Standard deviation.
    3. Variance.
  5. Please redo the tidymodels example of US elections but include additional variables. Which variable did you choose, and how did the performance of the model improve?
  6. Please create the graph of the density of the Poisson distribution when \(\lambda = 75\).
  7. From Gelman, Hill, and Vehtari (2020), what is the offset in Poisson regression?
  8. Redo the Jane Eyre example, but for “A/a”.


Please consider Maher (1982) or Smith (2002). Build a simplified version of their model. Obtain some recent relevant data, estimate the model, and discuss your choice between Poisson and negative binomial regression.


At about this point the Murrumbidgee Paper from Appendix D would be appropriate.