15  Causality from observational data

Required material

Key concepts and skills

Key packages and functions

15.1 Introduction

Life is grand when we can conduct experiments to be able to speak to causality. But there are circumstances in which we cannot run an experiment, yet nonetheless want to be able to make causal claims. And data from outside experiments have value that experiments do not have. In this chapter we discuss the circumstances and methods that allow us to speak to causality using observational data. We use relatively simple methods, in sophisticated ways, drawing from statistics, but also a variety of social sciences, including economics, and political science, as well as epidemiology.

For instance, Dagan et al. (2021) use observational data to confirm the effectiveness of the Pfizer-BioNTech vaccine. They discuss how one concern with using observational data in this way is confounding, which is where we are concerned that there is some variable that affects both the explanatory and dependent variables and can lead to spurious relationships. Dagan et al. (2021) adjust for this by first making a list of potential confounders, such as age, sex, geographic location, healthcare usage and then adjusting for each of them, by matching, one-to-one between people that were vaccinated and those that were not. The experimental data guided the use of observational data, and the larger size of the later enabled a focus on specific age-groups and extent of disease.

This chapter is about using observational data in sophisticated ways. How we can nonetheless be comfortable making causal statements, even when we cannot run A/B tests or RCTs. Indeed, in what circumstances may we prefer to not run those or to run observational-based approaches in addition to them. We cover three of the major methods: difference in differences; regression discontinuity; and instrumental variables.

15.2 Directed acyclic graphs

When we are discussing causality, it can help to be specific about what we mean. It is easy to get caught up in observational data and trick ourselves. It is important to think hard, and to use all the tools available to us. For instance, in that earlier example, Dagan et al. (2021) were able to use experimental data as a guide. Most of the time, we will not be so lucky as to have both experimental data and observational data available to us. One framework that can help with thinking hard about our data is the use of directed acyclic graph (DAG). DAGs are a fancy name for a flow diagram and involves drawing arrows and lines between the variables to indicate the relationship between them.

To construct them we use Graphviz, which is an open-source package for package graph visualization and is built into Quarto. The code needs to be wrapped in a “dot” chunk rather than “R”, and the chunk options are set with “//|” instead of “#|”. Alternatives that do not require this include the use of DiagrammeR (Iannone 2020) and ggdag (Barrett 2021). We provide the whole chunk for the first DAG, but then, only provide the code for the others.

```{dot}
//| label: fig-dot-firstdag
//| fig-cap: "We expect a causal relationship between x and y, where x influences y"
digraph D {
  node [shape=plaintext, fontname = "helvetica"];
  
  {rank=same x y};
  
  x -> y;
}
```

D x x y y x->y

Figure 15.1: We expect a causal relationship between x and y, where x influences y

In Figure 15.1, we think that x causes y. We could build another DAG where the situation is less clear. To make the examples a little easier to follow, we will switch to thinking about a hypothetical relationship between income and happiness, with consideration of variables that could affect that relationship. In this first one we consider the relationship between income and happiness, along with education (Figure 15.2).

digraph D {
  
  node [shape=plaintext, fontname = "helvetica"];
  
  a [label = "Income"];
  b [label = "Happiness"];
  c [label = "Education"];
  
  { rank=same a b};
  
  a->b;
  c->{a, b};
}

D a Income b Happiness a->b c Education c->a c->b

Figure 15.2: Education is a confounder that affects the relationship between income and happiness

In Figure 15.2, we think income causes happiness. But we also think that education causes happiness, and that education also causes income. That relationship is a “backdoor path”, and failing to adjust for education in a regression could overstate the extent of the relationship, or even create a spurious relationship, between income and happiness in our analysis. That is, we may think that changes in income are causing changes in happiness, but it could be that education is changing them both. That variable, in this case, education, is called a “confounder”.

Hernan and Robins (2020, 83) discuss an interesting case where a researcher was interested in whether one person looking up at the sky makes others look up at the sky also. There was a clear relationship between the responses of both people. But it was also the case that there was noise in the sky. So, it was unclear whether the second person looked up because the first person looked up, or they both looked up because of the noise. When using experimental data, randomization allows us to avoid this concern, but with observational data we cannot rely on that. It is also not the case that bigger data necessarily get around this problem for us. Instead, it is important to think carefully about the situation, and DAGs can help with that.

If there are confounders, but we are still interested in causal effects, then we need to adjust for them. One way is to include them in the regression. But the validity of this requires several assumptions. In particular, Gelman and Hill (2007, 169) warn that our estimate will only correspond to the average causal effect in the sample if we include all of the confounders and have the right model. Putting the second requirement to one side, and focusing only on the first, if we do not think about and observe a confounder, then it can be difficult to adjust for it. And this is an area where both domain expertise and theory can bring a lot to an analysis.

In Figure 15.3 we again consider that income causes happiness. But, if income also causes children, and children also cause happiness, then we have a situation where it would be tricky to understand the effect of income on happiness.

digraph D {

  node [shape=plaintext, fontname = "helvetica"];
  
  a [label = "Income"];
  b [label = "Happiness"];
  c [label = "Children"];
  
  { rank=same a b};
  
  a->{b, c};
  c->b;
}

D a Income b Happiness a->b c Children a->c c->b

Figure 15.3: Children as a mediator between income and happiness

In Figure 15.3, children is called a “mediator” and we would not adjust for it if we were interested in the effect of income on happiness. If we were to adjust for it, then some of what we are attributing to income, would be due to children.

Finally, in Figure 15.4 we have yet another similar situation, where we think that income causes happiness. But this time both income and happiness also cause exercise. For instance, if you have more money then it may be easier to exercise, but also it may be easier to exercise if you are happier.

digraph D {

  node [shape=plaintext, fontname = "helvetica"];
  
  a [label = "Income"];
  b [label = "Happiness"];
  c [label = "Exercise"];
  
  { rank=same a b};
  
  a->{b c};
  b->c;
}

D a Income b Happiness a->b c Exercise a->c b->c

Figure 15.4: Exercise as a collider affecting the relationship between income and happiness

In this case, exercise is called a “collider” and if we were to condition on it, then we would create a misleading relationship. Income influences exercise, but a person’s happiness also affects this. Exercise is a collider because both the independent and dependent variable of interest influence it.

It is important to be clear about this: we must create the DAG ourselves, in the same way that we must put together the model ourselves. There is nothing that will create it for us. This means that we need to think carefully about the situation. Because it is one thing to see something in the DAG and then do something about it. But it is another to not even know that it is there. McElreath (2020, 180) describes these as haunted DAGs. DAGs are helpful, but they are just a tool to help us think deeply about our situation.

When we are building models, it can be tempting to include as many independent variables as possible. DAGs show clearly why we need to be more thoughtful. For instance, if a variable is a confounder, then we would want to adjust for it, whereas if a variable was a collider then we would not. We can never know the truth, and we are informed by aspects such as theory, what we are interested in, research design, limitations of the data, or our own limitations as researchers, to name a few. Knowing the limits is as important as reporting the model. Data and models with flaws are still useful, if you acknowledge those flaws. The work of thinking about a situation is never done, and relies on others, which is why we need to make all our work as reproducible as possible.

15.3 Two common paradoxes

15.3.1 Simpson’s paradox

There are two situations where data can trick us that are so common that we will explicitly go through them. These are: 1) Simpson’s paradox, and 2) Berkson’s paradox. It is important to keep these situations in mind, and the use of DAGs can help identify them.

Simpson’s paradox occurs when we estimate some relationship for subsets of our data, but a different relationship when we consider the entire dataset (Simpson 1951). It is a particular case of the ecological fallacy, which is when we try to make claims about individuals, based on their group. For instance, it may be that there is a positive relationship between undergraduate grades and performance in graduate school in two departments when considering each department individually. But if undergraduate grades tended to be higher in one department than another while graduate school performance tended to be opposite, we may find a negative relationship between undergraduate grades and performance in graduate school. We can simulate some data to show this more clearly (Figure 15.5).

library(tidyverse)

set.seed(853)

number_in_each <- 1000

department_one <-
  tibble(
    undergrad = runif(n = number_in_each, min = 0.7, max = 0.9),
    noise = rnorm(n = number_in_each, 0, sd = 0.1),
    grad = undergrad + noise,
    type = "Department 1"
  )

department_two <-
  tibble(
    undergrad = runif(n = number_in_each, min = 0.6, max = 0.8),
    noise = rnorm(n = number_in_each, 0, sd = 0.1),
    grad = undergrad + noise + 0.3,
    type = "Department 2"
  )

both_departments <- rbind(department_one, department_two)

both_departments
# A tibble: 2,000 × 4
   undergrad   noise  grad type        
       <dbl>   <dbl> <dbl> <chr>       
 1     0.772 -0.0566 0.715 Department 1
 2     0.724 -0.0312 0.693 Department 1
 3     0.797  0.0770 0.874 Department 1
 4     0.763 -0.0664 0.697 Department 1
 5     0.707  0.0717 0.779 Department 1
 6     0.781 -0.0165 0.764 Department 1
 7     0.726 -0.104  0.623 Department 1
 8     0.749  0.0527 0.801 Department 1
 9     0.732 -0.0471 0.684 Department 1
10     0.738  0.0552 0.793 Department 1
# … with 1,990 more rows
both_departments |>
  ggplot(aes(x = undergrad, y = grad)) +
  geom_point(aes(color = type), alpha = 0.1) +
  geom_smooth(aes(color = type), method = "lm", formula = "y ~ x") +
  geom_smooth(
    method = "lm",
    formula = "y ~ x",
    color = "black"
  ) +
  labs(
    x = "Undergraduate results",
    y = "Graduate results",
    color = "Department"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 15.5: Illustration of simulated data that shows Simpson’s paradox

Simpson’s paradox is often illustrated using real-world data from University of California, Berkeley, on graduate admissions (Bickel, Hammel, and O’Connell 1975). This paper was mentioned in Chapter 5 as having one of the greatest sub-titles ever published. More recently, as mentioned in its documentation, the “penguins” dataset from palmerpenguins (Horst, Presmanes Hill, and Gorman 2020) provides an example of Simpson’s paradox, using real-world data (Figure 15.6).

library(palmerpenguins)

penguins |>
  ggplot(aes(x = body_mass_g, y = bill_depth_mm)) +
  geom_point(aes(color = species), alpha = 0.1) +
  geom_smooth(aes(color = species), method = "lm", formula = "y ~ x") +
  geom_smooth(
    method = "lm",
    formula = "y ~ x",
    color = "black"
  ) +
  labs(
    x = "Body mass (grams)",
    y = "Bill depth (millimeters)",
    color = "Species"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 15.6: Illustration of Simpson’s paradox in a dataset of penguin bill depth compared with their body mass

15.3.2 Berkson’s paradox

Berkson’s paradox occurs when we estimate some relationship based on the dataset that we have. But because the dataset is so selected, the relationship is different in a more general dataset (Berkson 1946). For instance, if we have a dataset of professional cyclists then we might find there is no relationship between their VO2 max and their chance of winning a bike race (Coyle et al. 1988; Podlogar, Leo, and Spragg 2022). But if we had a dataset of the general population then we might find a relationship between these two variables. The professional dataset has just been so selected that the relationship disappears; one cannot become a professional cyclist unless one has a good-enough VO2 max, but among professional cyclists everyone has a good-enough VO2 max. Again, we can simulate some data to show this more clearly (Figure 15.7).

set.seed(853)

number_of_pros <- 100

number_of_public <- 1000

professionals <-
  tibble(
    VO2 = runif(n = number_of_pros, min = 0.7, max = 0.9),
    chance_of_winning = runif(n = number_of_pros, min = 0.7, max = 0.9),
    type = "Professionals"
  )

general_public <-
  tibble(
    VO2 = runif(n = number_of_public, min = 0.6, max = 0.8),
    noise = rnorm(n = number_of_public, 0, sd = 0.03),
    chance_of_winning = VO2 + noise + 0.1,
    type = "Public"
  ) |>
  select(-noise)

professionals_and_public <- rbind(professionals, general_public)

professionals_and_public
# A tibble: 1,100 × 3
     VO2 chance_of_winning type         
   <dbl>             <dbl> <chr>        
 1 0.772             0.734 Professionals
 2 0.724             0.773 Professionals
 3 0.797             0.772 Professionals
 4 0.763             0.754 Professionals
 5 0.707             0.843 Professionals
 6 0.781             0.740 Professionals
 7 0.726             0.803 Professionals
 8 0.749             0.750 Professionals
 9 0.732             0.890 Professionals
10 0.738             0.821 Professionals
# … with 1,090 more rows
professionals_and_public |>
  ggplot(aes(x = VO2, y = chance_of_winning)) +
  geom_point(aes(color = type), alpha = 0.1) +
  geom_smooth(aes(color = type), method = "lm", formula = "y ~ x") +
  geom_smooth(method = "lm", formula = "y ~ x", color = "black") +
  labs(
    x = "VO2 max",
    y = "Chance of winning a bike race",
    color = "Type"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 15.7: Illustration of simulated data that shows Berkson’s paradox

15.4 Difference in differences

The ideal situation of being able to conduct an experiment is rarely possible. Can we reasonably expect that Netflix would allow us to change prices? And even if they did once, would they let us do it again, and again, and again? Further, rarely can we explicitly create treatment and control groups. Finally, experiments can be expensive or unethical. Instead, we need to make do with what we have. Rather than our counterfactual coming to us through randomization, and hence us knowing that the two are the same but for the treatment, we try to identify groups that were similar but for the treatment, and hence any differences can be attributed to the treatment.

With observational data, sometimes there are differences between our two groups before we treat. Provided those pre-treatment differences satisfy assumptions that essentially amount to the differences being both consistent, and that we expect that consistency to continue in the absence of the treatment—the “parallel trends” assumption—then we can look to any difference in the differences as the effect of the treatment. One of the aspects of difference in differences analysis is that we can do it using relatively straight-forward methods, for instance Tang (2015). Linear regression with a binary variable is enough to get started and do a convincing job.

Consider wanting to know the effect of a new tennis racket on serve speed. One way to test this would be to measure the difference between, say, Roger Federer’s serve speed without the tennis racket and the serve speed of an enthusiastic amateur, let us call them Ville, with the tennis racket. Yes, we would find a difference, but would we know how much to attribute to the tennis racket? Another way would be to consider the difference between Ville’s serve speed without the new tennis racket and Ville’s serve speed with the new tennis racket. But what if serves were just getting faster naturally over time? Instead, we combine the two approaches to look at the difference in the differences.

We begin by measuring Federer’s serve speed and compare it to Ville’s serve speed, both without the new racket. We then measure Federer’s serve speed again, and measure Ville’s serve speed with the new racket. That difference in the differences would then be the estimate of the effect of the new racket. There are a few key assumptions that we need to make for this analysis to be appropriate:

  1. Is there something else that may have affected only Ville, and not Federer that could affect Ville’s serve speed?
  2. Is it likely that Federer and Ville have the same trajectory of serve speed improvement? This is the “parallel trends” assumption, and it dominates many discussions of difference in differences analysis.
  3. Finally, is it likely that the variance of our serve speeds of Federer and Ville are the same?

Despite these requirements, difference in differences is a powerful approach because we do not need the treatment and control group to be the same before the treatment. We just need to have a good idea of how they differed.

15.4.1 Simulated example

To be more specific about the situation, we simulate data. We will simulate a situation in which there is initially a difference of one between the serve speeds of the different people, and then after a new tennis racket, there is a difference of six. We can use a graph to illustrate the situation (Figure 15.8).

set.seed(853)

simulated_diff_in_diff <-
  tibble(
    person = rep(c(1:1000), times = 2),
    time = c(rep(0, times = 1000), rep(1, times = 1000)),
    treatment_group = rep(
      sample(
        x = 0:1,
        size = 1000,
        replace = TRUE
      ),
      times = 2
    )
  ) |>
  mutate(
    treatment_group = as.factor(treatment_group),
    time = as.factor(time)
  )

simulated_diff_in_diff <-
  simulated_diff_in_diff |>
  rowwise() |>
  mutate(
    serve_speed = case_when(
      time == 0 & treatment_group == 0 ~ rnorm(n = 1, mean = 5, sd = 1),
      time == 1 & treatment_group == 0 ~ rnorm(n = 1, mean = 6, sd = 1),
      time == 0 & treatment_group == 1 ~ rnorm(n = 1, mean = 8, sd = 1),
      time == 1 & treatment_group == 1 ~ rnorm(n = 1, mean = 14, sd = 1)
    )
  )

simulated_diff_in_diff
# A tibble: 2,000 × 4
# Rowwise: 
   person time  treatment_group serve_speed
    <int> <fct> <fct>                 <dbl>
 1      1 0     0                      4.43
 2      2 0     1                      6.96
 3      3 0     1                      7.77
 4      4 0     0                      5.31
 5      5 0     0                      4.09
 6      6 0     0                      4.85
 7      7 0     0                      6.43
 8      8 0     0                      5.77
 9      9 0     1                      6.13
10     10 0     1                      7.32
# … with 1,990 more rows
simulated_diff_in_diff |>
  ggplot(aes(
    x = time,
    y = serve_speed,
    color = treatment_group
  )) +
  geom_point(alpha = 0.2) +
  geom_line(aes(group = person), alpha = 0.1) +
  theme_minimal() +
  labs(
    x = "Time period",
    y = "Serve speed",
    color = "Person got a new racket"
  ) +
  scale_color_brewer(palette = "Set1") +
  theme(legend.position = "bottom")

Figure 15.8: Illustration of simulated data that shows a difference before and after getting a new tennis racket

As it is a straight-forward example, we can obtain our estimate manually, by looking at the average difference of the differences. When we do that, we find that we estimate the effect of the new tennis racket to be 5.06, which is similar to what we simulated.

average_differences <-
  simulated_diff_in_diff |>
  pivot_wider(
    names_from = time,
    values_from = serve_speed,
    names_prefix = "time_"
  ) |>
  mutate(difference = time_1 - time_0) |>
  group_by(treatment_group) |>
  summarize(average_difference = mean(difference))

average_differences$average_difference[2] - average_differences$average_difference[1]
[1] 5.058414

And we can use linear regression to get the same result. The equation that we are interested in is:

\[Y_{i,t} = \beta_0 + \beta_1\mbox{Treatment}_i + \beta_2\mbox{Time}_t + \beta_3(\mbox{Treatment} \times\mbox{Time})_{i,t} + \epsilon_{i,t}\]

While we should include the separate aspects as well, it is the estimate of the interaction that we are interested in. In this case it is \(\beta_3\). And we find that our estimated effect is 5.06 (Table 15.1).

library(modelsummary)
library(rstanarm)

diff_in_diff_example_regression <-
  stan_glm(
    formula = serve_speed ~ treatment_group * time,
    data = simulated_diff_in_diff,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  diff_in_diff_example_regression,
  file = "diff_in_diff_example_regression.rds"
)
diff_in_diff_example_regression <-
  readRDS(file = "diff_in_diff_example_regression.rds")
library(modelsummary)

modelsummary(
  diff_in_diff_example_regression,
  statistic = "conf.int"
)
Table 15.1: Illustration of simulated data that shows a difference before and after getting a new tennis racket
Model 1
(Intercept) 4.971
[4.887, 5.054]
treatment_group1 3.035
[2.910, 3.157]
time1 1.006
[0.889, 1.125]
treatment_group1 × time1 5.057
[4.885, 5.232]
Num.Obs. 2000
R2 0.927
R2 Adj. 0.927
Log.Lik. −2802.166
ELPD −2806.3
ELPD s.e. 32.1
LOOIC 5612.5
LOOIC s.e. 64.2
WAIC 5612.5
RMSE 0.98

15.4.2 Assumptions

If we want to use difference in differences, then we need to satisfy the assumptions. There were three that were touched on earlier, but here we will focus on one: the “parallel trends” assumption. The parallel trends assumption haunts everything to do with difference in differences analysis because we can never prove it, we can just be convinced of it, and try to convince others.

To see why we can never prove it, consider an example in which we want to know the effect of a new stadium on a professional sports team’s wins/loses. To do this we consider two teams: the Golden State Warriors and the Toronto Raptors. The Warriors changed stadiums at the start of the 2019-20 season, while the Raptors did not, so we will consider four time periods: the 2016-17 season, 2017-18 season, 2018-19 season, and finally we will compare the performance with the one after they moved, so the 2019-20 season. The Raptors here act as our counterfactual. This means that we assume the relationship between the Warriors and the Raptors, in the absence of a new stadium, would have continued to change in a consistent way. But the fundamental problem of causal inference means that we can never know that for certain. We must present sufficient evidence to assuage any concerns that a reader may have.

There are four main threats to validity when we use difference in differences, and we need to address all of them (Cunningham 2021, 272–77):

  1. Non-parallel trends. The treatment and control groups may be based on differences. As such it can be difficult to convincingly argue for parallel trends. In this case, maybe try to find another factor to consider in your model that may adjust for some of that. This may require difference in difference in differences (in the earlier example, we could perhaps add the San Francisco 49ers as they are in the same broad geographic area as the Warriors). Or maybe re-think the analysis to see if we can make a different control group. Adding additional earlier time periods may help but may introduce more issues, which we touch on in the third point.
  2. Compositional differences. This is a concern when working with repeated cross-sections. What if the composition of those cross-sections change? For instance, if we are working at an app that is rapidly growing, and we want to look at the effect of some change. In our initial cross-section, we may have mostly young people, but in a subsequent cross-section, we may have more older people as the demographics of the app usage change. Hence our results may just be an age-effect, not an effect of the change that we are interested in.
  3. Long-term effects compared with reliability. As we discussed in Chapter 10, there is a trade-off between the length of the analysis that we run. As we run the analysis for longer there is more opportunity for other factors to affect the results. There is also increased chance for someone who was not treated to be treated. But, on the other hand, it can be difficult to convincingly argue that short-term results will continue in the long-term.
  4. Functional form dependence. This is less of an issue when the outcomes are similar, but if they are different then functional form may be responsible for some aspects of the results.

15.4.3 French newspaper prices between 1960 and 1974

In this case study we introduce Angelucci and Cagé (2019), and replicate its main findings.

The business model of newspapers was challenged by the internet and many local newspapers have closed. And this issue is not new. When television was introduced, there were similar concerns. Angelucci and Cagé (2019) use the introduction of television advertising in France, announced in 1967, to examine the effect of decreased advertising revenue on newspapers. They create a dataset of French newspapers from 1960 to 1974 and then use difference in differences to examine the effect of the reduction in advertising revenues on newspapers’ content and prices. The change that they focus on is the introduction of television advertising, which they argue affected national newspapers more than local newspapers. They find that this change results in both less journalism-content in the newspapers and lower newspaper prices. Focusing on this change, and analyzing it using difference in differences, is important because it allows us to disentangle a few competing effects. For instance, did newspapers become redundant because they could no longer charge high prices for their advertisements, or because consumers preferred to get their news from the television?

We can get free access to the data that underpins Angelucci and Cagé (2019) after registration. The dataset is in the Stata data format, “.dta”, which we can read with read_dta() from haven (Wickham and Miller 2020). The file that we are interested in is “Angelucci_Cage_AEJMicro_dataset.dta”, which is the “dta” folder.

library(haven)

newspapers <- read_dta("Angelucci_Cage_AEJMicro_dataset.dta")

newspapers
# A tibble: 1,196 × 52
    year id_news local national after…¹   Had po_cst ps_cst etota…² ra_cst  ra_s
   <dbl>   <dbl> <dbl>    <dbl>   <dbl> <dbl>  <dbl>  <dbl>   <dbl>  <dbl> <dbl>
 1  1960       1     1        0       0     0   2.60   2.29  1.09e8 5.29e7  45.9
 2  1961       1     1        0       0     0   2.51   2.20  1.18e8 5.66e7  47.9
 3  1962       1     1        0       0     0   2.39   2.13  1.09e8 6.48e7  52.4
 4  1963       1     1        0       0     0   2.74   2.43  1.61e8 7.06e7  50.2
 5  1964       1     1        0       0     0   2.65   2.35  1.74e8 7.50e7  52.5
 6  1965       1     1        0       0     0   2.59   2.29  1.77e8 7.44e7  52.2
 7  1966       1     1        0       0     0   2.52   2.31  2.11e8 8.14e7  49.6
 8  1967       1     1        0       0     0   3.27   2.88  2.13e8 8.03e7  43.2
 9  1968       1     1        0       0     0   3.91   3.45  2.10e8 8.72e7  40.2
10  1969       1     1        0       0     0   3.67   3.28  2.39e8 1.03e8  45.7
# … with 1,186 more rows, 41 more variables: rs_cst <dbl>, rtotal_cst <dbl>,
#   profit_cst <dbl>, nb_journ <dbl>, qs_s <dbl>, qtotal <dbl>, pages <dbl>,
#   ads_q <dbl>, ads_s <dbl>, news_hole <dbl>, share_Hard <dbl>,
#   ads_p4_cst <dbl>, R_sh_edu_primaire_ipo <dbl>,
#   R_sh_edu_secondaire_ipo <dbl>, R_sh_edu_no_ipo <dbl>,
#   R_sh_pcs_agri_ipo <dbl>, R_sh_pcs_patron_ipo <dbl>,
#   R_sh_pcs_cadre_ipo <dbl>, R_sh_pcs_employes_ipo <dbl>, …

There are 1,196 observations in the dataset and 52 variables. Angelucci and Cagé (2019) are interested in the 1960-1974 time-period which has around 100 newspapers. There are 14 national newspapers at the beginning of the period and 12 at the end. The key period is 1967, when the French government announced it would allow advertising on television. Angelucci and Cagé (2019) argue that national newspapers were affected by this chance, but local newspapers were not. So, the national newspapers are the treatment group and the local newspapers are the control group.

We focus just on the headline difference in differences result and construct summary statistics (Table 15.2).

newspapers <-
  newspapers |>
  select(
    year,
    id_news,
    after_national,
    local,
    national, # Diff in diff variables
    ra_cst,
    ads_p4_cst,
    ads_s, # Advertising side dependents
    ps_cst,
    po_cst,
    qtotal,
    qs_s,
    rs_cst
  ) |> # Reader side dependents
  mutate(ra_cst_div_qtotal = ra_cst / qtotal) |>
  mutate(across(c(id_news, after_national, local, national), as.factor)) |>
  mutate(year = as.integer(year))

newspapers
# A tibble: 1,196 × 14
    year id_news after…¹ local natio…² ra_cst ads_p…³ ads_s ps_cst po_cst qtotal
   <int> <fct>   <fct>   <fct> <fct>    <dbl>   <dbl> <dbl>  <dbl>  <dbl>  <dbl>
 1  1960 1       0       1     0       5.29e7    NA    30.6   2.29   2.60 9.45e4
 2  1961 1       0       1     0       5.66e7    NA    38.4   2.20   2.51 9.63e4
 3  1962 1       0       1     0       6.48e7    24.0  31.6   2.13   2.39 9.73e4
 4  1963 1       0       1     0       7.06e7    50.3  27.2   2.43   2.74 1.01e5
 5  1964 1       0       1     0       7.50e7    48.6  31.1   2.35   2.65 1.02e5
 6  1965 1       0       1     0       7.44e7    47.5  47.8   2.29   2.59 1.05e5
 7  1966 1       0       1     0       8.14e7    46.2  29.7   2.31   2.52 1.26e5
 8  1967 1       0       1     0       8.03e7    87.9  49.4   2.88   3.27 1.29e5
 9  1968 1       0       1     0       8.72e7    84.1  26.9   3.45   3.91 1.32e5
10  1969 1       0       1     0       1.03e8    79.0  31.7   3.28   3.67 1.32e5
# … with 1,186 more rows, 3 more variables: qs_s <dbl>, rs_cst <dbl>,
#   ra_cst_div_qtotal <dbl>, and abbreviated variable names ¹​after_national,
#   ²​national, ³​ads_p4_cst
library(modelsummary)

datasummary_skim(newspapers)
Table 15.2: Summary statistics for French newspapers dataset (1960-1974)
Unique (#) Missing (%) Mean SD Min Median Max
year 15 0 1967.0 4.3 1960.0 1967.0 1974.0
ra_cst 1053 12 91531796.9 137207312.4 549717.2 35994710.0 864369088.0
ads_p4_cst 558 32 86.4 75.3 3.8 69.0 327.2
ads_s 988 13 18.7 9.7 1.6 16.9 59.6
ps_cst 665 13 2.8 0.7 0.7 2.8 5.6
po_cst 146 11 3.2 0.9 0.8 3.3 9.3
qtotal 1052 11 130817.5 172954.3 1480.0 56775.2 1143676.0
qs_s 914 10 27.2 22.7 0.7 22.5 100.1
rs_cst 1047 13 97666503.6 125257120.3 255760.1 40736368.0 750715008.0
ra_cst_div_qtotal 1049 12 661.2 352.7 61.3 596.6 3048.4

We are interested in what happened from 1967 onward, especially in terms of advertising revenue, and whether that was different for national, compared with local newspapers (Figure 15.9). We use scales to adjust the y axis (Wickham and Seidel 2020).

library(scales)

newspapers |>
  mutate(type = if_else(local == 1, "Local", "National")) |>
  ggplot(aes(x = year, y = ra_cst)) +
  geom_point(alpha = 0.5) +
  scale_y_continuous(
    labels = dollar_format(
      prefix = "$",
      suffix = "M",
      scale = 0.000001
    )
  ) +
  labs(
    x = "Year",
    y = "Advertising revenue"
  ) +
  facet_wrap(
    vars(type),
    nrow = 2
  ) +
  theme_minimal() +
  geom_vline(xintercept = 1966.5, linetype = "dashed")

Figure 15.9: Revenue of French newspapers (1960-1974), by whether they were local or national

The model that we are interested in estimating is:

\[\mbox{ln}(y_{n,t}) = \beta_0 + \beta_1(\mbox{National binary}\times\mbox{1967 onward binary}) + \lambda_n + \gamma_y + \epsilon.\]

It is the \(\beta_1\) coefficient that we are especially interested in. We use \(\lambda_n\) as fixed effect for each newspaper, and the \(\gamma_y\) as a fixed effect for each year. We estimate the models using stan_glm().

library(modelsummary)
library(rstanarm)

ad_revenue <-
  stan_glm(
    formula = log(ra_cst) ~ after_national + id_news + year,
    data = newspapers,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  ad_revenue,
  file = "ad_revenue.rds"
)

ad_revenue_div_circulation <-
  stan_glm(
    formula = log(ra_cst_div_qtotal) ~ after_national + id_news + year,
    data = newspapers,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  ad_revenue_div_circulation,
  file = "ad_revenue_div_circulation.rds"
)

# Consumer side
subscription_price <-
  stan_glm(
    formula = log(ps_cst) ~ after_national + id_news + year,
    data = newspapers,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  subscription_price,
  file = "subscription_price.rds"
)

unit_price <-
  stan_glm(
    formula = log(po_cst) ~ after_national + id_news + year,
    data = newspapers,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  unit_price,
  file = "unit_price.rds"
)
ad_revenue <-
  readRDS(file = "ad_revenue.rds")

ad_revenue_div_circulation <-
  readRDS(file = "ad_revenue_div_circulation")

subscription_price <-
  readRDS(file = "subscription_price.rds")

unit_price <-
  readRDS(file = "unit_price.rds")

Looking at the advertising-side variables (first two columns of Table 15.3) we find consistently negative coefficients. And looking at the advertising-side variables (Table 15.3) we again, find a negative coefficient for unit price, and a positive coefficient for subscription price.

selected_variables <-
  c(
    "year" = "Year",
    "after_national1" = "After change"
  )

modelsummary(
  models = list(
    "Ad revenue" = ad_revenue,
    "Ad revenue over circulation" = ad_revenue_div_circulation,
    "Subscription price" = subscription_price,
    "Unit price" = unit_price
  ),
  fmt = 2,
  coef_map = selected_variables,
  statistic = "conf.int"
)
Table 15.3: Effect of changed television advertising laws on revenue of French newspapers (1960-1974)
Ad revenue Ad revenue over circulation Subscription price Unit price
Year 0.05 0.04 0.05 0.05
[0.04, 0.05] [0.04, 0.04] [0.04, 0.05] [0.04, 0.05]
After change −0.23 −0.15 −0.04 0.06
[−0.29, −0.17] [−0.21, −0.10] [−0.08, 0.00] [0.02, 0.10]
Num.Obs. 1052 1048 1044 1063
R2 0.984 0.896 0.868 0.859
R2 Adj. 0.983 0.886 0.852 0.844
Log.Lik. 336.539 441.471 875.559 901.155
ELPD 257.4 362.3 793.5 821.7
ELPD s.e. 34.4 45.6 24.3 23.0
LOOIC −514.8 −724.6 −1586.9 −1643.3
LOOIC s.e. 68.9 91.2 48.6 46.0
WAIC −515.9 −725.5 −1588.9 −1644.6
RMSE 0.17 0.16 0.10 0.10

In general, we are able to replicate the main results of Angelucci and Cagé (2019) and find that in many cases there appears to be a difference from 1967 onward. Our results are similar to Angelucci and Cagé (2019). Angelucci and Cagé (2019, 353–58) also includes an excellent example of the discussion of interpretation, external validity, and robustness that is required for difference in differences models.

15.5 Propensity score matching

Difference in differences is a powerful analysis framework. But it can be tough to identify appropriate treatment and control groups. Alexander and Ward (2018) compare migrant brothers, where one brother had most of their education in a different country, and the other brother had most of their education in the US. Given the data that are available, this match provides a reasonable treatment and control group. But other matches could have given different results, for instance friends or cousins.

We can only match based on observable variables. For instance, age-group or education. At two different times we compare smoking rates in 18-year-olds in one city with smoking rates in 18-year-olds in another city. This would be a coarse match because we know that there are many differences between 18-year-olds, even in terms of the variables that we commonly observe, say gender and education. One way to deal with this would be to create sub-groups: 18-year-old males with a high school education, etc. But then the sample sizes quickly become small. We also have the issue of how to deal with continuous variables. And, is a 18-year-old really that different to a 19-year-old? Why not also compare with them?

One way to proceed is to consider a nearest neighbor approach. But there can be limited concern for uncertainty with this approach. There can also be an issue with having many variables because we end up with a high-dimension graph. This leads to propensity score matching. Here we will explain the process of propensity score matching, but it is not something that should be widely used (King and Nielsen 2019), and we will then go through why that is the case.

Propensity score matching involves assigning some probability to each observation. We construct that probability based on the observation’s values for the independent variables, before treatment. That probability is our best guess at the probability of the observation being treated, regardless of whether it was treated or not. For instance, if 18-year-old males were treated but 19-year-old males were not, then as there is not much difference between 18-year-old males and 19-year-old males our assigned probability would be similar. We can then compare the outcomes of observations with similar propensity scores.

One advantage of propensity score matching is that is allows us to easily consider many independent variables at once, and it can be constructed using logistic regression. To be more specific we can simulate some data. We will pretend that we work for a large online retailer. We are going to treat some individuals with free shipping to see what happens to their average purchase.

set.seed(853)

sample_size <- 10000

purchase_data <-
  tibble(
    unique_person_id = c(1:sample_size),
    age = runif(
      n = sample_size,
      min = 18,
      max = 100
    ),
    city = sample(
      x = c("Toronto", "Montreal", "Calgary"),
      size = sample_size,
      replace = TRUE
    ),
    gender = sample(
      x = c("Female", "Male", "Other/decline"),
      size = sample_size,
      replace = TRUE,
      prob = c(0.49, 0.47, 0.02)
    ),
    income = rlnorm(
      n = sample_size,
      meanlog = 0.5,
      sdlog = 1
    )
  )

purchase_data
# A tibble: 10,000 × 5
   unique_person_id   age city     gender income
              <int> <dbl> <chr>    <chr>   <dbl>
 1                1  47.5 Calgary  Female  1.72 
 2                2  27.8 Montreal Male    1.54 
 3                3  57.7 Toronto  Female  3.16 
 4                4  43.9 Toronto  Male    0.636
 5                5  21.1 Toronto  Female  1.43 
 6                6  51.1 Calgary  Male    1.18 
 7                7  28.7 Toronto  Female  1.49 
 8                8  37.9 Toronto  Female  0.414
 9                9  31.0 Calgary  Male    0.384
10               10  33.5 Montreal Female  1.11 
# … with 9,990 more rows

Then we need to add some probability of being treated with free shipping. We will say that it depends on our variables and that younger, higher-income, male and Toronto-based individuals make this treatment slightly more likely.

purchase_data <-
  purchase_data |>
  mutate(
    age_num = case_when(
      age < 30 ~ 3,
      age < 50 ~ 2,
      age < 70 ~ 1,
      TRUE ~ 0
    ),
    city_num = case_when(
      city == "Toronto" ~ 3,
      city == "Montreal" ~ 2,
      city == "Calgary" ~ 1,
      TRUE ~ 0
    ),
    gender_num = case_when(
      gender == "Male" ~ 3,
      gender == "Female" ~ 2,
      gender == "Other/decline" ~ 1,
      TRUE ~ 0
    ),
    income_num = case_when(
      income > 3 ~ 3,
      income > 2 ~ 2,
      income > 1 ~ 1,
      TRUE ~ 0
    )
  ) |>
  rowwise() |>
  mutate(
    sum_num = sum(age_num, city_num, gender_num, income_num),
    softmax_prob = exp(sum_num) / exp(12),
    free_shipping = sample(
      x = c(0:1),
      size = 1,
      replace = TRUE,
      prob = c(1 - softmax_prob, softmax_prob)
    )
  ) |>
  ungroup()

purchase_data <-
  purchase_data |>
  select(
    -age_num,
    -city_num,
    -gender_num,
    -income_num,
    -sum_num,
    -softmax_prob
  )

Finally, we need to have some measure of a person’s average spend. We want those with free shipping to be slightly higher than those without.

purchase_data <-
  purchase_data |>
  mutate(mean_spend = if_else(free_shipping == 1, 60, 50)) |>
  rowwise() |>
  mutate(average_spend = rnorm(1, mean_spend, sd = 5)) |>
  ungroup() |>
  select(-mean_spend) |>
  mutate(across(c(city, gender, free_shipping), as.factor))

purchase_data
# A tibble: 10,000 × 7
   unique_person_id   age city     gender income free_shipping average_spend
              <int> <dbl> <fct>    <fct>   <dbl> <fct>                 <dbl>
 1                1  47.5 Calgary  Female  1.72  0                      41.1
 2                2  27.8 Montreal Male    1.54  0                      55.7
 3                3  57.7 Toronto  Female  3.16  0                      56.5
 4                4  43.9 Toronto  Male    0.636 0                      50.5
 5                5  21.1 Toronto  Female  1.43  0                      44.7
 6                6  51.1 Calgary  Male    1.18  0                      48.8
 7                7  28.7 Toronto  Female  1.49  0                      52.8
 8                8  37.9 Toronto  Female  0.414 0                      52.4
 9                9  31.0 Calgary  Male    0.384 0                      47.6
10               10  33.5 Montreal Female  1.11  0                      49.2
# … with 9,990 more rows

We use matchit() from MatchIt (Ho et al. 2011) to implement logistic regression and create matched groups. We then use match.data() to get the data of matches containing both all 371 people who were actually treated with free shipping and the untreated person who is considered as similar to them, based on propensity score, as possible. The result is a dataset of 742 observations.

library(MatchIt)

matched_groups <- matchit(
  free_shipping ~ age + city + gender + income,
  data = purchase_data,
  method = "nearest",
  distance = "glm"
)

matched_groups
A matchit object
 - method: 1:1 nearest neighbor matching without replacement
 - distance: Propensity score
             - estimated with logistic regression
 - number of obs.: 10000 (original), 742 (matched)
 - target estimand: ATT
 - covariates: age, city, gender, income
matched_dataset <- match.data(matched_groups)

matched_dataset
# A tibble: 742 × 10
   unique_pe…¹   age city  gender income free_…² avera…³ dista…⁴ weights subcl…⁵
         <int> <dbl> <fct> <fct>   <dbl> <fct>     <dbl>   <dbl>   <dbl> <fct>  
 1           5  21.1 Toro… Female  1.43  0          44.7  0.141        1 278    
 2          20  30.0 Mont… Male    8.65  0          49.0  0.129        1 162    
 3          22  22.8 Toro… Male    0.898 0          50.1  0.282        1 334    
 4          38  41.3 Toro… Female  6.01  1          61.5  0.0779       1 120    
 5          43  24.7 Toro… Male    1.59  1          59.6  0.274        1 142    
 6          76  56.4 Toro… Male   15.0   0          51.8  0.199        1 290    
 7         102  48.1 Toro… Male    3.48  1          59.8  0.111        1 1      
 8         105  76.7 Toro… Male    2.84  0          45.1  0.0231       1 200    
 9         118  26.7 Toro… Female  0.315 0          56.4  0.0971       1 98     
10         143  36.3 Toro… Male   10.6   0          49.4  0.331        1 286    
# … with 732 more rows, and abbreviated variable names ¹​unique_person_id,
#   ²​free_shipping, ³​average_spend, ⁴​distance, ⁵​subclass

Finally, we can estimate the effect of being treated on average spend using linear regression. We are particularly interested in the coefficient associated with the treatment variable, in this case free shipping.

propensity_score_regression <- lm(
  average_spend ~ age + city + gender + income + free_shipping,
  data = matched_dataset
)

modelsummary(propensity_score_regression)
Model 1
(Intercept) 49.567
(0.817)
age 0.007
(0.011)
cityMontreal 0.128
(0.736)
cityToronto 0.586
(0.627)
genderMale −1.100
(0.425)
genderOther/decline −1.999
(2.635)
income 0.019
(0.027)
free_shipping1 10.606
(0.383)
Num.Obs. 742
R2 0.517
R2 Adj. 0.513
AIC 4560.7
BIC 4602.1
Log.Lik. −2271.326
F 112.455
RMSE 5.17

We cover propensity score matching because it is widely used. But there are many issues with propensity score matching that mean that humility is needed (King and Nielsen 2019). These include:

  1. Unobservables. Propensity score matching cannot match on unobserved variables. This may be fine in a classroom setting, but in more realistic settings it will likely cause issues. It is very difficult to understand why individuals that appear to be so similar, would have received different treatments, unless there is something unobserved that causes the difference. As propensity score matching cannot account for these, it is difficult to know which features are actually being brought together.
  2. Modelling. The results of propensity score matching, tend to be specific to the model that is used. As there is considerable flexibility as to which model is used, this enables researchers to pick through matches to find that suits. Additionally, because the two regression steps (the matching and the analysis) are conducted separately, there is no propagation of uncertainty.

The fundamental problem of unobservables can never be shown to be inconsequential because that would require the unobserved data. Those who want to use propensity score matching, and other matching methods, need to be able to argue convincingly that it is appropriate. McKenzie (2021) presents a few cases where this is possible, for instance, when there are capacity limits. As is the common theme of this book, such cases will require focusing, with extreme zeal, on the data and a deep understanding of the situation that produced it.

15.6 Regression discontinuity design

Regression discontinuity design (RDD) was established by Thistlethwaite and Campbell (1960) and is a popular way to get causality when there is a continuous variable with cut-offs that determine treatment. Is there a difference between a student who gets 79 per cent and a student who gets 80 per cent? Probably not much, but one may get an A-, while the other may get a B+, and seeing that on a transcript could affect who gets a job which could affect income. In this case the percentage is a “forcing variable” and the cut-off for an A- is a “threshold”. As the treatment is determined by the forcing variable we need to control for that variable. And, these seemingly arbitrary cut-offs can be seen all the time. Hence, there has been a great deal of work using RDD.

There is sometimes slightly different terminology used when it comes to RDD. For instance, Cunningham (2021) refers to the forcing function as a running variable. The exact terminology that is used does not matter provided we use it consistently.

15.6.1 Simulated example

To be more specific about the situation, we simulate data. We will consider the relationship between income and grades, and simulate there to be a change if a student gets at least 80 (Figure 15.10).

library(tidyverse)

set.seed(853)

number_of_observation <- 1000

rdd_example_data <- tibble(
  person = c(1:number_of_observation),
  mark = runif(number_of_observation, min = 78, max = 82),
  income = rnorm(number_of_observation, 10, 1)
)

## Make income more likely to be higher if they have a mark at least 80
rdd_example_data <-
  rdd_example_data |>
  mutate(
    noise = rnorm(n = number_of_observation, mean = 2, sd = 1),
    income = if_else(mark >= 80, income + noise, income)
  )

rdd_example_data
# A tibble: 1,000 × 4
   person  mark income noise
    <int> <dbl>  <dbl> <dbl>
 1      1  79.4   9.43 1.87 
 2      2  78.5   9.69 2.26 
 3      3  79.9  10.8  1.14 
 4      4  79.3   9.34 2.50 
 5      5  78.1  10.7  2.21 
 6      6  79.6   9.83 2.47 
 7      7  78.5   8.96 4.22 
 8      8  79.0  10.5  3.11 
 9      9  78.6   9.53 0.671
10     10  78.8  10.6  2.46 
# … with 990 more rows
rdd_example_data |>
  ggplot(aes(
    x = mark,
    y = income
  )) +
  geom_point(alpha = 0.2) +
  geom_smooth(
    data = rdd_example_data |> filter(mark < 80),
    method = "lm",
    color = "black",
    formula = "y ~ x"
  ) +
  geom_smooth(
    data = rdd_example_data |> filter(mark >= 80),
    method = "lm",
    color = "black",
    formula = "y ~ x"
  ) +
  theme_minimal() +
  labs(
    x = "Mark",
    y = "Income ($)"
  )

Figure 15.10: Illustration of simulated data that shows an effect on income from getting a mark that is 80, compared with 79

We can use a binary variable with linear regression to estimate the effect. We expect the coefficient to be around two, which is what we simulated, and what we find (Table 15.4).

library(modelsummary)
library(rstanarm)

rdd_example_data <-
  rdd_example_data |>
  mutate(mark_80_and_over = if_else(mark < 80, 0, 1))

rdd_example <-
  stan_glm(
    formula = income ~ mark + mark_80_and_over,
    data = rdd_example_data,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  rdd_example,
  file = "rdd_example.rds"
)
rdd_example <-
  readRDS(file = "rdd_example.rds")
modelsummary(
  models = rdd_example,
  fmt = 2,
  statistic = "conf.int"
)
Table 15.4: Example of regression discontinuity with simulated data
Model 1
(Intercept) 5.22
[−5.32, 15.65]
mark 0.06
[−0.07, 0.19]
mark_80_and_over 1.89
[1.58, 2.19]
Num.Obs. 1000
R2 0.417
R2 Adj. 0.415
Log.Lik. −1591.847
ELPD −1595.1
ELPD s.e. 25.4
LOOIC 3190.3
LOOIC s.e. 50.9
WAIC 3190.3
RMSE 1.19

There are various caveats to this estimate that we will discuss, but the essentials of RDD are here. Given an appropriate set-up, and model, an RDD can compare favorably to randomized trials (Bloom, Bell, and Reiman 2020).

We could also implement RDD using rdrobust (Calonico et al. 2021). The advantage of this approach is that many common extensions are easily available.

library(rdrobust)

rdrobust(
  y = rdd_example_data$income,
  x = rdd_example_data$mark,
  c = 80,
  h = 2,
  all = TRUE
) |>
  summary()
Sharp RD estimates using local polynomial regression.

Number of Obs.                 1000
BW type                      Manual
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  497          503
Eff. Number of Obs.             497          503
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   2.000        2.000
BW bias (b)                   2.000        2.000
rho (h/b)                     1.000        1.000
Unique Obs.                     497          503

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional     1.913     0.161    11.876     0.000     [1.597 , 2.229]     
Bias-Corrected     1.966     0.161    12.207     0.000     [1.650 , 2.282]     
        Robust     1.966     0.232     8.461     0.000     [1.511 , 2.422]     
=============================================================================

15.6.2 Assumptions

The key assumptions of RDD are (Cunningham 2021, 163):

  1. The cut-off is specific, fixed, and known to all.
  2. The forcing function is continuous.

The first assumption is largely about being unable to manipulate the cut-off, and ensures that the cut-off has meaning. The second assumption enables us to be confident that people on either side of the threshold are similar, apart from just happening to just fall on either side of the threshold.

When we discussed randomized control trials and A/B testing in Chapter 10 the randomized assignment of the treatment meant that the control and treatment groups were the same, but for the treatment. Then we moved to difference in differences, and we assumed that there was a common trend between the treated and control groups. We allowed that the groups could be different, but that we could “difference out” their differences. Finally, we considered matching, and we said that even if we the control and treatment groups seemed quite different, we were able to match, to some extent, those who were treated with a group that were similar to them in all ways, apart from the fact that they were not treated.

In regression discontinuity we consider a slightly different setting. The two groups are completely different in terms of the forcing variable. They are on either side of the threshold. There is no overlap at all. But we know the threshold and believe that those on either side are essentially matched. Let us consider the 2019 NBA Eastern Conference Semifinals - Toronto and the Philadelphia. Game 1: Raptors win 108-95; Game 2: 76ers win 94-89; Game 3: 76ers win 116-95; Game 4: Raptors win 101-96; Game 5: Raptors win 125-89; Game 6: 76ers win 112-101; and finally, Game 7: Raptors win 92-90, because of a ball that win in after bouncing on the rim four times. Was there really that much difference between the teams?

The continuity assumption is important, but we cannot test this as it is based on a counterfactual. Instead, we need to convince people of it. Ways to do this include:

  • Using a test/train set-up.
  • Trying different specifications. We are especially concerned if results do not broadly persist with just linear or quadratic functions.
  • Considering different subsets of the data.
  • Considering different windows.
  • Be clear about uncertainty intervals, especially in graphs.
  • Discussing and assuage concerns about the possibility of omitted variables.

The threshold is also important. For instance, is there an actual shift or is there a non-linear relationship?

There are a variety of weaknesses of RDD including:

  • External validity may be difficult. For instance, when we think about the A-/B+ example, it is hard to see those generalizing to also B-/C+ students.
  • The important responses are those that are close to the cut-off. This means that even if we have many A and B students, they do not help much. Hence, we need a lot of data or we may have concerns about our ability to support our claims (Green et al. 2009).
  • As the researcher, we have a lot of freedom to implement different options. This means that open science best practice becomes vital.

To this point we have considered “sharp” RDD. That is, the threshold is strict. But, in reality, often the boundary is a little less strict. For instance, consider the drinking age. There is a legal drinking age, say 18. If we looked at the number of people who had drunk, then it is likely to increase in the few years leading up to that age.

In a sharp RDD setting, if we know the value of the forcing function then we know the outcome. For instance, if a student gets a mark of 80 then we know that they got an A-, but if they got a mark of 79 then we know that they got a B+. But with fuzzy RDD it is only known with some probability. We can say that a Canadian 19-year-old is more likely to have drunk alcohol than a Canadian 18-year-old, but the number of Canadian 18-year-olds who have drunk alcohol is not zero.

It may be possible to deal with fuzzy RDD settings with appropriate choice of model or data. It may also be possible to deal with them using instrumental variables.

We want as “sharp” an effect as possible, but if the thresholds are known, then they will be gamed. For instance, there is a lot of evidence that people run for certain marathon times, and we know that people aim for certain grades. Similarly, from the other side, it is a lot easier for an instructor to just give out As than it is to have to justify Bs. One way to look at this is to consider how “balanced” the sample is on either side of the threshold. We can do this using histograms with appropriate bins. For instance, think of the age-heaping that we found in the cleaned Kenyan census data in Chapter 9.

Another key factor for RDD is the possible effect of the decision around the choice of model. For instance, Figure 15.11 illustrates the difference between linear (Figure 15.11 (a)) and polynomial (Figure 15.11 (b)).

some_data <-
  tibble(
    outcome = rnorm(n = 100, mean = 1, sd = 1),
    running_variable = c(1:100),
    location = "before"
  )

some_more_data <-
  tibble(
    outcome = rnorm(n = 100, mean = 2, sd = 1),
    running_variable = c(101:200),
    location = "after"
  )

both <-
  rbind(some_data, some_more_data)

both |>
  ggplot(aes(x = running_variable, y = outcome, color = location)) +
  geom_point(alpha = 0.5) +
  geom_smooth(formula = y ~ x, method = "lm") +
  theme_minimal() +
  theme(legend.position = "bottom")

both |>
  ggplot(aes(x = running_variable, y = outcome, color = location)) +
  geom_point(alpha = 0.5) +
  geom_smooth(formula = y ~ poly(x, 3), method = "lm") +
  theme_minimal() +
  theme(legend.position = "bottom")

(a) Linear

(b) Polynomial

Figure 15.11: Comparing the result of considering the same situation with different functions

The result is that our estimate is dependent on the choice of model. We see this issue occur often in RDD (Gelman 2019) and it is especially recommended that higher order polynomials not be used, and instead a the choice of models be either linear, quadratic, or some other smooth function (Gelman and Imbens 2019).

15.6.3 Alcohol and crime in California

There are many opportunities to use regression discontinuity design. For instance, we often see it used in elections where one candidate barely wins. In US House elections between 1942 an 2008 Caughey and Sekhon (2011) showed that there is actually considerable difference between bare winners and bare losers, and highlight the fact that one of the advantages of regression discontinuity is the fact that the assumptions can be tested. Another common use is where there is a somewhat arbitrary cut-off. For instance, in much of the USA the legal drinking age is 21. Carpenter and Dobkin (2015) consider the possible effect of alcohol on crime by comparing arrest and other records of those who are either side of 21 in California. They find those who are slightly over 21 are slightly more likely to be arrested than those slightly under 21. We will revisit Carpenter and Dobkin (2015) in the context of crime in California.

We can obtain their replication data (Carpenter and Dobkin 2014) from here. Carpenter and Dobkin (2015) consider a large number of variables and construct a rate, and average this rate over a fortnight, but for simplicity, we will just consider numbers for a few variables: assault, aggravated_assault, DUI, and traffic violations (Figure 15.12).

library(haven)
library(tidyverse)

carpenter_dobkin <-
  read_dta(
    "P01 Age Profile of Arrest Rates 1979-2006.dta"
  )
carpenter_dobkin_prepared <-
  carpenter_dobkin |>
  mutate(age = 21 + days_to_21 / 365) |>
  select(
    age,
    assault,
    aggravated_assault,
    dui,
    traffic_violations
  ) |>
  pivot_longer(
    cols = c(assault, aggravated_assault, dui, traffic_violations),
    names_to = "arrested_for",
    values_to = "number"
  )

carpenter_dobkin_prepared |>
  mutate(
    arrested_for =
      case_when(
        arrested_for == "assault" ~ "Assault",
        arrested_for == "aggravated_assault" ~ "Aggravated assault",
        arrested_for == "dui" ~ "DUI",
        arrested_for == "traffic_violations" ~ "Traffic violations"
      )
  ) |>
  ggplot(aes(x = age, y = number)) +
  geom_point(alpha = 0.05) +
  facet_wrap(
    facets = vars(arrested_for),
    scales = "free_y"
  ) +
  theme_minimal()

Figure 15.12: Comparing the number of arrests either side of turning 21 for selected reasons

carpenter_dobkin_aggravated_assault_only <-
  carpenter_dobkin_prepared |>
  filter(
    arrested_for == "aggravated_assault",
    abs(age - 21) < 2
  ) |>
  mutate(is_21_or_more = if_else(age < 21, 0, 1))
rdd_carpenter_dobkin <-
  stan_glm(
    formula = number ~ age + is_21_or_more,
    data = carpenter_dobkin_aggravated_assault_only,
    family = gaussian(),
    prior = normal(location = 0, scale = 2.5, autoscale = TRUE),
    prior_intercept = normal(0, 2.5, autoscale = TRUE),
    prior_aux = exponential(rate = 1, autoscale = TRUE),
    seed = 853
  )

saveRDS(
  rdd_example,
  file = "rdd_example.rds"
)
rdd_carpenter_dobkin <-
  readRDS(file = "rdd_carpenter_dobkin.rds")
modelsummary(
  models = rdd_carpenter_dobkin,
  fmt = 2,
  statistic = "conf.int"
)
Table 15.5: Examining the effect of alcohol on crime in California
Model 1
(Intercept) 145.54
[116.22, 174.71]
age 3.87
[2.41, 5.33]
is_21_or_more 13.24
[9.83, 16.63]
Num.Obs. 1459
R2 0.299
R2 Adj. 0.297
Log.Lik. −6153.757
ELPD −6157.3
ELPD s.e. 32.9
LOOIC 12314.6
LOOIC s.e. 65.7
WAIC 12314.6
RMSE 16.42

And the results are similar if we use rdrobust.

library(rdrobust)

rdrobust(
  y = carpenter_dobkin_aggravated_assault_only$number,
  x = carpenter_dobkin_aggravated_assault_only$age,
  c = 21,
  h = 2,
  all = TRUE
) |>
  summary()
Sharp RD estimates using local polynomial regression.

Number of Obs.                 1459
BW type                      Manual
Kernel                   Triangular
VCE method                       NN

Number of Obs.                  729          730
Eff. Number of Obs.             729          730
Order est. (p)                    1            1
Order bias  (q)                   2            2
BW est. (h)                   2.000        2.000
BW bias (b)                   2.000        2.000
rho (h/b)                     1.000        1.000
Unique Obs.                     729          730

=============================================================================
        Method     Coef. Std. Err.         z     P>|z|      [ 95% C.I. ]       
=============================================================================
  Conventional    14.126     1.918     7.364     0.000    [10.366 , 17.886]    
Bias-Corrected    16.708     1.918     8.709     0.000    [12.948 , 20.468]    
        Robust    16.708     2.879     5.804     0.000    [11.066 , 22.350]    
=============================================================================

15.7 Instrumental variables

Instrumental variables (IV) is an approach that can be handy when we have some type of treatment and control going on, but we have a lot of correlation with other variables and we possibly do not have a variable that actually measures what we are interested in. Adjusting for observables will not be enough to create a good estimate. Instead we find some variable—the eponymous instrumental variable—that is:

  1. correlated with the treatment variable, but
  2. not correlated with the outcome.

This solves our problem because the only way the instrumental variable can have an effect is through the treatment variable, and so we can adjust our understanding of the effect of the treatment variable appropriately. The trade-off is that instrumental variables must satisfy a bunch of different assumptions, and that, frankly, they are difficult to identify ex ante. Nonetheless, when we are able to use them, they are a powerful tool for speaking about causality.

The canonical instrumental variables example is smoking. These days we know that smoking causes cancer. But because smoking is correlated with a lot of other variables, for instance, education, it could be that it was actually education that causes cancer. RCTs may be possible, but they are likely to be troublesome in terms of speed and ethics, and so instead we look for some other variable that is correlated with smoking, but not, in and of itself, with lung cancer. In this case, we look to tax rates, and other policy responses, on cigarettes. As the tax rates on cigarettes are correlated with the number of cigarettes that are smoked, but not correlated with lung cancer, other than through their impact on cigarette smoking, through them we can assess the effect of cigarettes smoked on lung cancer.

To implement instrumental variables we first regress tax rates on cigarette smoking to get some coefficient on the instrumental variable, and then (in a separate regression) regress tax rates on lung cancer to again, get some coefficient on the instrumental variable. Our estimate is then the ratio of these coefficients, which is described as a “Wald estimate” (Gelman and Hill 2007, 219).

Sometimes instrumental variables are used in the context of random allocation of treatment, such as the Oregon Health Insurance Experiment introduced in Chapter 10. Recall the issue was that a lottery was used to select individuals who were allocated to apply for health insurance, but there was nothing forcing them to do this. Our approach would then be to consider the relationship between being selected and taking up health insurance, and then between various health outcomes and taking up insurance. Our instrumental variable estimate, which would be the ratio, would estimate only those that took up health insurance because they were selected.

Following the language of (Gelman and Hill 2007, 216) when we use instrumental variables we make a variety of assumptions including:

  • Ignorability of the instrument.
  • Correlation between the instrumental variable and the treatment variable.
  • Monotonicity.
  • Exclusion restriction.

As an aside, the history of instrumental variables is intriguing, and Stock and Trebbi (2003), via Cunningham (2021), provide a brief overview. The method was first published in Wright (1928). This is a book about the effect of tariffs on animal and vegetable oil. Why might instrumental variables be important in a book about tariffs on animal and vegetable oil? The fundamental problem is that the effect of tariffs depends on both supply and demand. But we only know prices and quantities, so we do not know what is driving the effect. We can use instrumental variables to pin down causality. The intriguing aspect is that the instrumental variables discussion is only in Appendix B. It would seem odd to relegate a major statistical break-through to an appendix. Further, Philip G. Wright, the book’s author, had a son Sewall Wright, who had considerable expertise in statistics and the specific method used in Appendix B. Hence the mystery of Appendix B: did Philip or Sewall write it? Cunningham (2021), Stock and Trebbi (2003), and Angrist and Krueger (2001), all go into more detail, but on balance feel that it is likely that Philip did actually author the work.

15.7.1 Simulated example

Let us generate some data. We will explore a simulation related to the canonical example of health status, smoking, and tax rates. We are looking to explain how healthy someone is based on the amount they smoke, via the tax rate on smoking. We are going to generate different tax rates by provinces. My understanding is that the tax rate on cigarettes is now pretty much the same in each of the provinces, but that this is fairly recent. We will pretend that Alberta had a low tax, and Nova Scotia had a high tax.

As a reminder, we are simulating data for illustrative purposes, so we need to impose the answer that we want. When you actually use instrumental variables you will be reversing the process.

library(broom)
library(tidyverse)

set.seed(853)

number_of_observation <- 10000

iv_example_data <- tibble(
  person = c(1:number_of_observation),
  smoker = sample(
    x = c(0:1),
    size = number_of_observation,
    replace = TRUE
  )
)

Now we need to relate the number of cigarettes that someone smoked to their health. We will model health status as a draw from the normal distribution, with either a high or low mean depending on whether the person smokes.

iv_example_data <-
  iv_example_data |>
  mutate(health = if_else(
    smoker == 0,
    rnorm(n = n(), mean = 1, sd = 1),
    rnorm(n = n(), mean = 0, sd = 1)
  ))
## Health will be one standard deviation higher for people who do not or barely smoke.

Now we need a relationship between cigarettes and the province (because in this illustration, the provinces have different tax rates).

iv_example_data <-
  iv_example_data |>
  rowwise() |>
  mutate(
    province =
      case_when(
        smoker == 0 ~ sample(
          x = c("Nova Scotia", "Alberta"),
          size = 1,
          replace = FALSE,
          prob = c(1 / 2, 1 / 2)
        ),
        smoker == 1 ~ sample(
          x = c("Nova Scotia", "Alberta"),
          size = 1,
          replace = FALSE,
          prob = c(1 / 4, 3 / 4)
        )
      )
  ) |>
  ungroup()

iv_example_data <-
  iv_example_data |>
  mutate(tax = case_when(
    province == "Alberta" ~ 0.3,
    province == "Nova Scotia" ~ 0.5,
    TRUE ~ 9999999
  ))

iv_example_data$tax |> table()

 0.3  0.5 
6206 3794 
head(iv_example_data)
# A tibble: 6 × 5
  person smoker  health province      tax
   <int>  <int>   <dbl> <chr>       <dbl>
1      1      0  1.11   Alberta       0.3
2      2      1 -0.0831 Alberta       0.3
3      3      1 -0.0363 Alberta       0.3
4      4      0  2.48   Alberta       0.3
5      5      0  0.617  Alberta       0.3
6      6      0  0.748  Nova Scotia   0.5

Now we can look at our data.

iv_example_data |>
  mutate(smoker = as_factor(smoker)) |>
  ggplot(aes(x = health, fill = smoker)) +
  geom_histogram(position = "dodge", binwidth = 0.2) +
  theme_minimal() +
  labs(
    x = "Health rating",
    y = "Number of people",
    fill = "Smoker"
  ) +
  scale_fill_brewer(palette = "Set1") +
  facet_wrap(vars(province))

Finally, we can use the tax rate as an instrumental variable to estimate the effect of smoking on health.

health_on_tax <- lm(health ~ tax, data = iv_example_data)
smoker_on_tax <- lm(smoker ~ tax, data = iv_example_data)

coef(health_on_tax)["tax"] / coef(smoker_on_tax)["tax"]
       tax 
-0.8554502 

We find that if you smoke then your health is likely to be worse than if you do not smoke.

Equivalently, we can think of instrumental variables in a two-stage regression context.

first_stage <- lm(smoker ~ tax, data = iv_example_data)
health_hat <- first_stage$fitted.values
second_stage <- lm(health ~ health_hat, data = iv_example_data)

modelsummary(second_stage)
Model 1
(Intercept) 0.916
(0.045)
health_hat −0.855
(0.089)
Num.Obs. 10000
R2 0.009
R2 Adj. 0.009
AIC 30498.4
BIC 30520.1
Log.Lik. −15246.211
F 92.159
RMSE 1.11

We can use iv_robust() from estimatr (Blair et al. 2021) to estimate IV. One nice reason for doing this is that it can help to keep everything organised and adjust the standard errors.

library(estimatr)
iv_robust(health ~ smoker | tax, data = iv_example_data) |>
  modelsummary()
Model 1
(Intercept) 0.916
(0.041)
smoker −0.855
(0.080)
Num.Obs. 10000
R2 0.197
R2 Adj. 0.197
AIC 28395.1
BIC 28416.7
RMSE 1.00

15.7.2 Assumptions

The set-up of instrumental variables is described in (Figure 15.13), which shows education as a confounder between income and happiness. A tax rebate likely only affects income, not education, and could be used as an instrumental variable.


digraph D {
  
  node [shape=plaintext, fontname = "helvetica"];
  a [label = "Income"]
  b [label = "Happiness"]
  c [label = "Education"]
  d [label = "Tax rebate"]
  { rank=same a b};
  
  a->b
  c->a
  c->b
  d->a
}

D a Income b Happiness a->b c Education c->a c->b d Tax rebate d->a

Figure 15.13: Education as a confounder for the relationship between income and happiness with tax rebate as an instrumental variable

As discussed earlier, there are a variety of assumptions that are made when using instrumental variables. The two most important are:

  1. Exclusion Restriction. This assumption is that the instrumental variable only affects the dependent variable through the independent variable of interest.
  2. Relevance. There must actually be a relationship between the instrumental variable and the independent variable.

There is typically a trade-off between these two. There are plenty of variables that satisfy one, precisely because they do not satisfy the other. Cunningham (2021, 211) describes how one test of a good instrument is if people are initially confused before you explain it to them, only to think it obvious in hindsight.

Relevance can be tested using regression and other tests for correlation. The exclusion restriction cannot be tested. We need to present evidence and convincing arguments. The difficult aspect is that the instrument needs to seem irrelevant because that is the implication of the exclusion restriction (Cunningham 2021, 225).

Instrumental variables is a useful approach because one can obtain causal estimates even without explicit randomization. Finding instrumental variables used to be a bit of a white whale, especially in academia. But there has been increased use of IV approaches downstream of A/B tests (Taddy 2019, 162).

For a long time, the canonical instrumental variable was rainfall, or more generally, the weather. However, the issue is that if the instrumental variable is correlated with other, potentially unobserved, variables, then they could be correlated with the variable of interest. This is a similar criticism to that above of propensity score matching. Mellon (2022) found a large number of variables have been linked to weather in instrumental variable papers. It would seem that the likelihood of incorrectly estimated effects in some of them is quite high.

When considering an instrumental variable approach, it is important to spend considerable amounts of time on both of these assumptions. Considering Figure 15.13, Mellon (2022) shows that we are especially concerned that this particular tax rebate only affects income and no other variable, which could itself be linked with our variables of interest. Approaches based on instrumental variables provide extensive freedom for the researcher, and Brodeur, Cook, and Heyes (2020) find they are more associated with p-hacking and selection reporting compared with RCTs and RDD. As with multiple-imputation, and propensity score matching, we recommend caution when using IV, and that it is never blindly turned to. Indeed, Betz, Cook, and Hollenbach (2018) go further and say that spatial instruments are rarely valid.

15.8 Exercises and tutorial

Exercises

  1. (Plan) Consider the following scenario: Two children will both look when an ambulance passes, but only the older one will look if a street car passes, and only the younger one will look when a bike passes. Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation.
  3. (Acquire) Please describe a possible source of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched.
  5. (Communicate) Please write two paragraphs about what you did.
  6. For three months Sharla Gelfand shared two functions each day: one that was new to them and another that they already knew and love. Please go the “Two Functions Most Days” GitHub repo, and find a package that they mention that you have never used. Find the relevant website for the package, and then in a paragraph or two, describe what the package does and a context in which it could be useful to you.
  7. Please again, go to Sharla’s “Two Functions Most Days” GitHub repo, and find a function that they mention that you have never used. Please look at the help file for that function, and then detail the arguments of the function, and a context in which it could be useful to you.
  8. What is propensity score matching? If you were matching people, then what are some of the features that you would like to match on? What sort of ethical questions does collecting and storing such information raise for you?
  9. Putting the ethical issues to one side, following King and Nielsen (2019), in at least two paragraphs, please describe some of the statistical concerns with propensity score matching.
  10. What is the key assumption when using difference in differences?
  11. Please read the fascinating article in The Markup about car insurance algorithms: https://themarkup.org/allstates-algorithm/2020/02/25/car-insurance-suckers-list. Please read the article and tell me what you think. You may wish to focus on ethical, legal, social, statistical, or other, aspects.
  12. Please go to the GitHub page related to the fascinating article in The Markup about car insurance algorithms: https://github.com/the-markup/investigation-allstates-algorithm. What is great about their work? What could be improved?
  13. What are the fundamental features of regression discontinuity design?
  14. What are the conditions that are needed in order for regression discontinuity design to be able to be used?
  15. Can you think of a situation in your own life where regression discontinuity design may be useful?
  16. What are some threats to the validity of regression discontinuity design estimates?
  17. What is an instrumental variable?
  18. What are some circumstances in which instrumental variables might be useful?
  19. What conditions must instrumental variables satisfy?
  20. Who were some of the early instrumental variable authors?
  21. Can you please think of and explain an application of instrumental variables in your own life?
  22. What is the key assumption in difference in differences
    1. Parallel trends.
    2. Heteroscedasticity.
  23. If you are using regression discontinuity, whare are some aspects to be aware of and think hard about (select all that apply)?
    1. Is the cut-off free of manipulation?
    2. Is the forcing function continuous?
    3. To what extent is the functional form driving the estimate?
    4. Would different fitted lines affect the results?
  24. What is the main reason that Oostrom (2022) finds that the outcome of an RCT can depend on who is funding it (pick one)?
    1. Publication bias
    2. Explicit manipulation
    3. Specialisation
    4. Larger number of arms
  25. What is the key coefficient of interest in Angelucci and Cagé (2019) (pick one)?
    1. \(\beta_0\)
    2. \(\beta_1\)
    3. \(\lambda\)
    4. \(\gamma\)
  26. The instrumental variable is (please pick all that apply):
    1. Correlated with the treatment variable.
    2. Not correlated with the outcome.
    3. Heteroskedastic.
  27. Who are the two candidates to have invented instrumental variables?
    1. Sewall Wright
    2. Philip G. Wright
    3. Sewall Cunningham
    4. Philip G. Cunningham
  28. What are the two main assumptions of instrumental variables?
    1. Exclusion Restriction.
    2. Relevance.
    3. Ignorability.
    4. Randomization.
  29. According to Meng (2021) “Data science can persuade via…” (pick all that apply):
    1. the careful establishment of evidence from fair-minded and high-quality data collection
    2. processing and analysis
    3. the honest interpretation and communication of findings
    4. large sample sizes
  30. According to Riederer (2021) if I have “disjoint treated and untreated groups partitioned by a sharp cut-off” then which method should I use to measure the local treatment effect at the juncture between groups (pick one)?
    1. regression discontinuity
    2. matching
    3. difference in differences
    4. event study methods
  31. According to Riederer (2021) (pick all that apply):
    1. data management
    2. domain knowledge
    3. probabilistic reasoning
    4. data science
  32. Consider an Australian 30-39 year old male living in Toronto with two children and a PhD. Which of the following do you think they would match most closely with and why (please explain in a paragraph or two)?
    1. An Australian 30-39 year old male living in Toronto with one child and a bachelors degree
    2. A Canadian 30-39 year old male living in Toronto with one child and a PhD
    3. An Australian 30-39 year old male living in Ottawa with one child and a PhD
    4. A Canadian 18-29 year old male living in Toronto with one child and a PhD
  33. What is a DAG (in your own words please)?
  34. What is a confounder (please select one answer)?
    1. A variable, z, that causes both x and y, where x also causes y.
    2. A variable, z, that is caused by both x and y, where x also causes y.
    3. A variable, z, that causes y and is caused by x, where x also causes y.
  35. What is a mediator (please select one answer)?
    1. A variable, z, that causes y and is caused by x, where x also causes y.
    2. A variable, z, that causes both x and y, where x also causes y.
    3. A variable, z, that is caused by both x and y, where x also causes y.
  36. What is a collider (please select one answer)?
    1. A variable, z, that causes both x and y, where x also causes y.
    2. A variable, z, that causes y and is caused by x, where x also causes y.
    3. A variable, z, that is caused by both x and y, where x also causes y.
  37. Please talk through a brief example of when you may want to be very careful about checking for Simpson’s paradox.
  38. Please talk through a brief example of when you may want to be very careful about checking for Berkson’s paradox.
  39. Kahneman, Sibony, and Sunstein (2021) say ‘… while correlation does not imply causation, causation does imply correlation. Where there is a causal link, we should find a correlation’. With reference to Cunningham (2021, chap. 1), are they right or wrong, and why?
  40. Please redo the French newspaper example, but for two variables on the advertising side: “ads_p4_cst”, and “ads_s”, and for two variables on the consumer side: “qtotal” and “qs_s”.

Tutorial

You are interested in the characteristics of people’s friendship groups and how those characteristics relate to individual-level outcomes, particularly economic measures.

You have access to individual-level data from a social media website, which contains information about social interactions (comments on posts, tags, etc) on the website, as well as a wide variety of individual-level characteristics.

  1. While the social media website is very popular, not everyone in the population you are interested in has an account, and not everyone that has an account is active on the website. Given you are interested in economic measures, what are some possible issues with using these data to make inferences about the broader population?
  2. The data do not contain information on individual-level income. But for around 20 per cent of the sample you have information on the “census block” of the individual. By way of background, a census block contains no more than 3,000 individuals. The median income of each census block is known. As such, you decide to estimate individual level income as follows:
    1. Regress the median income of each census block on a series of individual level characteristics (such as age, education, marital status, gender, …).
    2. Use these estimates to predict the income of individuals that do not have location information. Briefly discuss the advantages and disadvantages of this approach, particularly in how it could affect the study of income characteristics of friendship groups. Ensure that you address the ecological fallacy.
  3. Understandably, the social media website will not allow the unfettered distribution of individual-level data. What are some ways in which you might nonetheless enhance the reproducibility of your work?

This should take at least two pages.