```
library(arrow)
library(poissonreg)
library(tidymodels)
library(tidyverse)
```

# Online Appendix H — Prediction

**Under construction**

**TODO**

- Improve the tidymodels section.
- Add Python.

**Prerequisites**

- Read
*An Introduction to Statistical Learning with Applications in R*, (James et al. [2013] 2021)- Focus on Chapter 6 “Linear Model Selection and Regularization”, which provides an overview of ridge and lasso regression.

- Read
*Python for Data Analysis*, Chapter 13, (McKinney [2011] 2022)- This chapter provides worked examples of data analysis in Python.

**Key concepts and skills**

**Software and packages**

`arrow`

(Richardson et al. 2023)`poissonreg`

(Kuhn and Frick 2022)`tidymodels`

(Kuhn and Wickham 2020)`parsnip`

(Kuhn and Vaughan 2022)`recipes`

(Kuhn and Wickham 2022)`rsample`

(Frick et al. 2022)`tune`

(Kuhn 2022)`yarkdstick`

(Kuhn, Vaughan, and Hvitfeldt 2022)

`tidyverse`

(Wickham et al. 2019)

## H.1 Introduction

As discussed in Chapter 12, models tend to be focused on either inference or prediction. There are, in general, different cultures depending on your focus. One reason for this is a different emphasis of causality, introduced in Chapter 14. I am talking very generally here, but often with inference we will be very concerned about causality, and with prediction we will be less so. That means the quality of our predictions will break-down when conditions are quite different from what our model was expecting—but how do we know when conditions are sufficiently different for us to be worried?

Another way for this cultural difference is because the rise of data science and machine learning in particular, has been substantially driven by the development of models in Python by people with a computer science or engineering background. This means there is an additional vocabulary difference because much of inference came out of statistics. Again, this is all speaking very broadly.

In this chapter, I begin with a focus on prediction using the R approach of `tidymodels`

. I then introduce one of those grey areas—and the reason that I have been trying to speak broadly—lasso regression. That was developed by statisticians, but is mostly used for prediction. Finally, I introduce all of this in Python.

## H.2 Prediction with `tidymodels`

### H.2.1 Linear models

When we are focused on prediction, we will often want to fit many models. One way to do this is to copy and paste code many times. This is okay, and it is the way that most people get started but it is prone to making errors that are hard to find. A better approach will:

- scale more easily;
- enable us to think carefully about over-fitting; and
- add model evaluation.

The use of `tidymodels`

(Kuhn and Wickham 2020) satisfies these criteria by providing a coherent grammar that allows us to easily fit a variety of models. Like the `tidyverse`

, it is a package of packages.

By way of illustration, we want to estimate the following model for the simulated running data:

\[ \begin{aligned} y_i | \mu_i &\sim \mbox{Normal}(\mu_i, \sigma) \\ \mu_i &= \beta_0 +\beta_1x_i \end{aligned} \]

where \(y_i\) refers to the marathon time of some individual \(i\) and \(x_i\) refers to their five-kilometer time. Here we say that the marathon time of some individual \(i\) is normally distributed with a mean of \(\mu\) and a standard deviation of \(\sigma\), where the mean depends on two parameters \(\beta_0\) and \(\beta_1\) and their five-kilometer time. Here “~” means “is distributed as”. We use this slightly different notation from earlier to be more explicit about the distributions being used, but this model is equivalent to \(y_i=\beta_0+\beta_1 x_i + \epsilon_i\), where \(\epsilon\) is normally distributed.

As we are focused on prediction, we are worried about over-fitting our data, which would limit our ability to make claims about other datasets. One way to partially address this is to split our dataset in two using `initial_split()`

.

```
<-
sim_run_data read_parquet(file = "outputs/data/running_data.parquet")
set.seed(853)
<-
sim_run_data_split initial_split(
data = sim_run_data,
prop = 0.80
)
sim_run_data_split
```

```
<Training/Testing/Total>
<160/40/200>
```

Having split the data, we then create training and test datasets with `training()`

and `testing()`

.

```
<- training(sim_run_data_split)
sim_run_data_train
<- testing(sim_run_data_split) sim_run_data_test
```

We have placed 80 per cent of our dataset into the training dataset. We will use that to estimate the parameters of our model. We have kept the remaining 20 per cent of it back, and we will use that to evaluate our model. Why might we do this? Our concern is the bias-variance trade-off, which haunts all aspects of modeling. We are concerned that our results may be too particular to the dataset that we have, such that they are not applicable to other datasets. To take an extreme example, consider a dataset with ten observations. We could come up with a model that perfectly hits those observations. But when we took that model to other datasets, even those generated by the same underlying process, it would not be accurate.

One way to deal with this concern is to split the data in this way. We use the training data to inform our estimates of the coefficients, and then use the test data to evaluate the model. A model that too closely matched the data in the training data would not do well in the test data, because it would be too specific to the training data. The use of this test-training split enables us the opportunity to build an appropriate model.

It is more difficult to do this separation appropriately than one might initially think. We want to avoid the situation where aspects of the test dataset are present in the training dataset because this inappropriately telegraphs what is about to happen. This is called data leakage. But if we consider data cleaning and preparation, which likely involves the entire dataset, it may be that some features of each are influencing each other. Kapoor and Narayanan (2022) find extensive data leakage in applications of machine learning that could invalidate much research.

To use `tidymodels`

we first specify that we are interested in linear regression with `linear_reg()`

. We then specify the type of linear regression, in this case multiple linear regression, with `set_engine()`

. Finally, we specify the model with `fit()`

. While this requires considerably more infrastructure than the base R approach detailed above, the advantage of this approach is that it can be used to fit many models; we have created a model factory, as it were.

```
<-
sim_run_data_first_model_tidymodels linear_reg() |>
set_engine(engine = "lm") |>
fit(
~ five_km_time + was_raining,
marathon_time data = sim_run_data_train
)
```

The estimated coefficients are summarized in the first column of Table 12.4. For instance, we find that on average in our dataset, five-kilometer run times that are one minute longer are associated with marathon times that are about eight minutes longer.

### H.2.2 Logistic regression

We can also use `tidymodels`

for logistic regression problems. To accomplish this, we first need to change the class of our dependent variable into a factor because this is required for classification models.

```
<-
week_or_weekday read_parquet(file = "outputs/data/week_or_weekday.parquet")
set.seed(853)
<-
week_or_weekday |>
week_or_weekday mutate(is_weekday = as_factor(is_weekday))
<- initial_split(week_or_weekday, prop = 0.80)
week_or_weekday_split <- training(week_or_weekday_split)
week_or_weekday_train <- testing(week_or_weekday_split)
week_or_weekday_test
<-
week_or_weekday_tidymodels logistic_reg(mode = "classification") |>
set_engine("glm") |>
fit(
~ number_of_cars,
is_weekday data = week_or_weekday_train
)
```

As before, we can make a graph of the actual results compared with our estimates. But one nice aspect of this is that we could use our test dataset to evaluate our model’s prediction ability more thoroughly, for instance through a confusion matrix, which specifies the count of each prediction by what the truth was. We find that the model does well on the held-out dataset. There were 90 observations where the model predicted it was a weekday, and it was actually a weekday, and 95 where the model predicted it was a weekend, and it was a weekend. It was wrong for 15 observations, and these were split across seven where it predicted a weekday, but it was a weekend, and eight where it was the opposite case.^{1}

```
|>
week_or_weekday_tidymodels predict(new_data = week_or_weekday_test) |>
cbind(week_or_weekday_test) |>
conf_mat(truth = is_weekday, estimate = .pred_class)
```

```
Truth
Prediction 0 1
0 90 8
1 7 95
```

#### H.2.2.1 US political support

One approach is to use `tidymodels`

to build a prediction-focused logistic regression model in the same way as before, i.e. a validation set approach (James et al. [2013] 2021, 176). In this case, the probability will be that of voting for Biden.

```
<-
ces2020 read_parquet(file = "outputs/data/ces2020.parquet")
set.seed(853)
<- initial_split(ces2020, prop = 0.80)
ces2020_split <- training(ces2020_split)
ces2020_train <- testing(ces2020_split)
ces2020_test
<-
ces_tidymodels logistic_reg(mode = "classification") |>
set_engine("glm") |>
fit(
~ gender + education,
voted_for data = ces2020_train
)
ces_tidymodels
```

```
parsnip model object
Call: stats::glm(formula = voted_for ~ gender + education, family = stats::binomial,
data = data)
Coefficients:
(Intercept) genderMale
0.2157 -0.4697
educationHigh school graduate educationSome college
-0.1857 0.3502
education2-year education4-year
0.2311 0.6700
educationPost-grad
0.9898
Degrees of Freedom: 34842 Total (i.e. Null); 34836 Residual
Null Deviance: 47000
Residual Deviance: 45430 AIC: 45440
```

And then evaluate it on the test set. It appears as though the model is having difficulty identifying Trump supporters.

```
|>
ces_tidymodels predict(new_data = ces2020_test) |>
cbind(ces2020_test) |>
conf_mat(truth = voted_for, estimate = .pred_class)
```

```
Truth
Prediction Trump Biden
Trump 656 519
Biden 2834 4702
```

When we introduced `tidymodels`

, we discussed the importance of randomly constructing training and test sets. We use the training dataset to estimate parameters, and then evaluate the model on the test set. It is natural to ask why we should be subject to the whims of randomness and whether we are making the most of our data. For instance, what if a good model is poorly evaluated because of some random inclusion in the test set? Further, what if we do not have a large test set?

One commonly used resampling method that goes some way to addressing this is \(k\)-fold cross-validation. In this approach we create \(k\) different samples, or “folds”, from the dataset without replacement. We then fit the model to the first \(k-1\) folds, and evaluate it on the last fold. We do this \(k\) times, once for every fold, such that every observation will be used for training \(k-1\) times and for testing once. The \(k\)-fold cross-validation estimate is then the average mean squared error (James et al. [2013] 2021, 181). For instance, `vfold_cv()`

from `tidymodels`

can be used to create, say, ten folds.

```
set.seed(853)
<- vfold_cv(ces2020, v = 10) ces2020_10_folds
```

The model can then be fit across the different combinations of folds with `fit_resamples()`

. In this case, the model will be fit ten times.

### H.2.3 Poisson regression

We can use `tidymodels`

to estimate Poisson models with `poissonreg`

(Kuhn and Frick 2022) (Table 13.4).

```
<-
count_of_A read_parquet(file = "outputs/data/count_of_A.parquet")
set.seed(853)
<-
count_of_A_split initial_split(count_of_A, prop = 0.80)
<- training(count_of_A_split)
count_of_A_train <- testing(count_of_A_split)
count_of_A_test
<-
grades_tidymodels poisson_reg(mode = "regression") |>
set_engine("glm") |>
fit(
~ department,
number_of_As data = count_of_A_train
)
```

The results of this estimation are in the second column of Table 13.4. They are similar to the estimates from `glm()`

, but the number of observations is less because of the split.

## H.3 Lasso regression

Dr Robert Tibshirani is Professor in the Departments of Statistics and Biomedical Data Science at Stanford University. After earning a PhD in Statistics from Stanford University in 1981, he joined the University of Toronto as an assistant professor. He was promoted to full professor in 1994 and moved to Stanford in 1998. He made fundamental contributions including GAMs, mentioned above, and lasso regression, which is a way of automated variable selection. He is an author of James et al. ([2013] 2021). He was awarded the COPSS Presidents’ Award in 1996 and was appointed a Fellow of the Royal Society in 2019.

## H.4 Prediction with Python

### H.4.1 Setup

We will use Python within VSCode, which is a free IDE from Microsoft that you can download here. You then install the Quarto and Python extensions.

### H.4.2 Data

*Read in data using parquet.*

*Manipulate using pandas*

### H.4.3 Model

#### H.4.3.1 scikit-learn

#### H.4.3.2 TensorFlow

## H.5 Exercises

### Scales

*(Plan)*Consider the following scenario:*Each day for a year your uncle and you play darts. Each round consists of throwing three darts each. At the end of each round you add the points that your three darts hit. So if you hit 3, 5, and 10, then your total score for that round is 18. Your uncle is somewhat benevolent, and if you hit a number less than 5, he pretends not to see that, allowing you the chance to re-throw that dart. Pretend that each day you play 15 rounds.*Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.*(Simulate)*Please further consider the scenario described and simulate the situation. Compare your uncle’s total with your total if you didn’t get the chance to re-throw, and the total that actually end up with. Please include at least ten tests based on the simulated data.*(Acquire)*Please describe one possible source of such a dataset (or an equivalent sport or situation of interest to you).*(Explore)*Please use`ggplot2`

to build the graph that you sketched. Then use`rstanarm`

to build a model of who wins.*(Communicate)*Please write two paragraphs about what you did.

### Questions

### Tutorial

STUDENTS PROBABLY WON’T UNDERSTAND THAT IMPICIT IN THE PREDICTED LABEL IS AN OPERATING THRESHOLD, I WOULD HAVE AT LEAST ONE SENTENCE EXPLAINING THAT THIS IS A BUILT IN ASSUMPION IN THE PREDICT() METHOD THAT CAN BE ADJUSTED AND ADJUSTING THIS WILL IMPACT SENSITIVITY VS PRECISION WHICH CAN BE DISCUSSED LATER↩︎