8  Hunt data

Required material

Key concepts and skills

Key packages and functions

8.1 Introduction

This chapter is about hunting data with experiments. This is a situation in which we can explicitly control and vary that which we are interested in. The advantage of this is that identifying and estimating an effect should be clear. There is a treatment group that is subject to that which we are interested in, and a control group that is not. These are randomly split before treatment. And so, if they end up different, then it must be because of the treatment. Unfortunately, life is rarely so smooth. Arguing about how similar the treatment and control groups were tends to carry on indefinitely. And before we can estimate an effect, we need to be able to measure whatever it is that we are interested in, which is often surprisingly difficult.

By way of motivation, consider the situation of someone who moved to San Francisco in 2014–as soon as they moved the Giants won the World Series and the Golden State Warriors began a historic streak of World Championships. They then moved to Chicago, and immediately the Cubs won the World Series for the first time in a hundred years. They then moved to Massachusetts, and the Patriots won the Super Bowl again, and again, and again. And finally, they moved to Toronto, where the Raptors immediately won the World Championship. Should a city pay them to move, or could municipal funds be better spent elsewhere?

One way to get at the answer would be to run an experiment. Make a list of the North American cities with major sports teams. Then roll a dice, send them to live there for a year, and measure the outcomes of the sports teams. With enough lifetimes, we could work it out. This would take a long time because we cannot both live in a city and not live in a city. This is the fundamental problem of causal inference: a person cannot be both treated and untreated. Experiments and randomized controlled trials are circumstances in which we try to randomly allocate some treatment, to have a belief that everything else was the same (or at least ignorable). We use the Neyman-Rubin potential outcomes framework to formalize the situation (Holland 1986).

A treatment, \(t\), will often be a binary variable, that is either 0 or 1. It is 0 if the person, \(i\), is not treated, which is to say they are in the control group, and 1 if they are treated. We will typically have some outcome, \(Y_i\), of interest for that person, and that could be binary, categorical, multinomial, ordinal, continuous, or possibly even some other type. For instance, it could be vote choice, in which case we could measure whether the person is: ‘Conservative’ or ‘Not Conservative’; which party they support, say: ‘Conservative’, ‘Liberal’, ‘Democratic’, ‘Green’; or maybe a probability of support.

A treatment is then causal if \((Y_i|t=0) \neq (Y_i|t=1)\). That is to say, the outcome for person \(i\), given they were not treated, is different to their outcome given they were treated. If we could both treat and control the one individual at the one time, then we would know that it was only the treatment that had caused any change in outcome, as there could be no other factor to explain it. But the fundamental problem of causal inference remains: we cannot both treat and control the one individual at the one time. So, when we want to know the effect of the treatment, we need to compare it with a counterfactual. The counterfactual, introduced in Chapter 4, is what would have happened if the treated individual were not treated. As it turns out, this means one way to think of causal inference is as a missing data problem, where we are missing the counterfactual.

As we cannot compare treatment and control in one individual, we instead compare the average of two groups—those treated and those not. We are looking to estimate the counterfactual at a group level because of the impossibility of doing it at an individual level. Making this trade-off allows us to move forward but comes at the cost of certainty. We must instead rely on randomization, probabilities, and expectations.

We usually consider a default of there being no effect and we look for evidence that would cause us to change our mind. As we are interested in what is happening in groups, we turn to expectations and notions of probability to express ourselves. Hence, we will make claims that apply on average. Maybe wearing fun socks really does make you have a lucky day, but on average, across the group, it is probably not the case. It is worth pointing out that we do not just have to be interested in the average effect. We may consider the median, or variance, or whatever. Nonetheless, if we were interested in the average effect, then one way to proceed would be to:

  1. divide the dataset in two—treated and not treated—and have a binary effect column;
  2. sum the column, then divide it by the length of the column; and
  3. compare the value of this division in the two groups.

This is an estimator, introduced in Chapter 4, which is a way of putting together a guess of something of interest. The estimand is the thing of interest, in this case the average effect, and the estimate is whatever our guess turns out to be. We can simulate data to illustrate the situation.



treat_control <-
    group = sample(x = c("Treatment", "Control"), size = 100, replace = TRUE),
    binary_effect = sample(x = c(0, 1), size = 100, replace = TRUE)

# A tibble: 100 × 2
   group     binary_effect
   <chr>             <dbl>
 1 Treatment             0
 2 Control               1
 3 Control               1
 4 Treatment             1
 5 Treatment             1
 6 Treatment             0
 7 Treatment             1
 8 Treatment             1
 9 Control               0
10 Control               0
# … with 90 more rows
treat_control |>
  group_by(group) |>
    result_of_being_treated = sum(binary_effect) / length(binary_effect)
# A tibble: 2 × 2
  group     result_of_being_treated
  <chr>                       <dbl>
1 Control                     0.333
2 Treatment                   0.552

In this case, we draw either 0 or 1, 100 times, for each the treatment and control group, and then the estimate of the effect of being treated is 0.22.

More broadly, to tell causal stories we need to bring together both theory and a detailed knowledge of what we are interested in (Cunningham 2021, 4). In Chapter 7 we discussed gathering data that we observed about the world. In this chapter we are going to be more active about turning the world into the data that we need. As the researcher, we will decide what to measure and how, and we will need to define what we are interested in. We will be active participants in the data generating process. That is, if we want to use this data, then as researchers we must go out and hunt it.

In this chapter we cover experiments, especially constructing treatment and control groups, and appropriately considering their results. We go through implementing a survey. We discuss some aspects of ethical behavior in experiments through reference to the abhorrent Tuskegee Syphilis Study and ECMO experiment and go through various case studies. Finally, we then turn to A/B testing, which is extensively used in industry, and consider a case study based on Upworthy data.

Ronald Fisher, the twentieth century statistician, and Francis Galton, the nineteenth century statistician, are the intellectual grandfathers of much of the work that we cover in this chapter. In some cases it is directly their work, in other cases it is work that built on their contributions. Both men believed in eugenics, amongst other things that are generally reprehensible. In the same way that art history acknowledges, say, Caravaggio as a murderer, while also considering his work and influence, so to must statistics and the data sciences more generally concern themselves with this past, at the same time as we try to build a better future.

8.2 Experiments and randomized controlled trials

8.2.1 Randomization

Correlation can be enough in some settings (Hill 1965), but to be able to make forecasts when things change, and the circumstances are slightly different, we need to understand causation. Economics went through a credibility revolution in the 2000s (Angrist and Pischke 2010). During this time economists looked back on previous work and realized it was not as reliable as it could be. There was increased concern with research design and use of experiments. This considerable increased use of experiments also happened in political science in the 2000s and 2010s (Druckman and Green 2021).

The key is the counterfactual: what would have happened in the absence of the treatment. Ideally, we could keep everything else constant, randomly divide the world into two groups, and treat one and not the other. Then we could be pretty confident that any difference between the two groups was due to that treatment. The reason for this is that if we have some population and we randomly select two groups from it, then our two groups (provided they are both big enough) should have the same characteristics as the population. Randomized controlled trials (RCTs) and A/B testing attempt to get us as close to this ‘gold standard’ as we can hope.

When we, and others such as Athey and Imbens (2017b), use such positive language to refer to these approaches, we do not mean to imply that they are perfect. Just that they can be better than most of the other options. For instance, in Chapter 14 we will consider causality from observation data, and while this is sometimes all that we can do, the circumstances in which it is possible to evaluate both makes it clear that approaches based on observational data are usually second-best (B. Gordon et al. 2019; B. R. Gordon, Moakler, and Zettelmeyer 2022). RCTs and A/B testing also bring other benefits, such as being able to design a study that focuses on a particular question and tries to uncover the mechanism by which the effect occurs (Alsan and Finkelstein 2021). But they are not perfect, and the embrace of RCTs has not been unanimous (Deaton 2010). One bedrock of experimental practice is that it be blinded, that is, a participant does not know whether they are in the treatment or control group. A failure to blind, especially with subjective outcomes is grounds for the dismissal of an entire experiment (Edwards 2017). And ideally experiments are double-blind, that is, even the researcher does not know. Stolberg (2006) discusses an early example of a randomized double-blind trial in 1835 to evaluate the effect of homeopathic drugs where neither the participants nor the organizers knew who was in which group. This is rarely the case for RCTs and A/B testing. Again, this is not to say they are not useful, after all in 1847 Semmelweis identified the benefit of having an intern wash their hands before delivering babies without a blinded study (Morange 2016, 121). Another major concern is with the extent to which the result found in the RCT generalizes to outside of that setting. Finally, there are typically few RCTs conducted over a long time, although it is possible this is changing and Bouguen et al. (2019) provide a large number of RCTs that could be followed up on to assess long-term effects.

What we hope to be able to do is to establish treatment and control groups that are the same, but for the treatment. This means that establishing the control group is critical because when we do that, we establish the counterfactual. We might be worried about, say, underlying trends, which is one issue with a before-and-after comparison, or selection bias, which could occur when we allow self-selection into the treatment group. Either of these issues could result in biased estimates. We use randomization to go some way to addressing these.

To get started, we simulate a population, and then randomly sample from it. We will set it up so that half the population likes blue, and the other half likes white. And further, if someone likes blue then they almost surely prefer dogs, but if they like white then they almost surely prefer cats. The approach of heavily using simulation is a critical part of the workflow advocated in this book. This is because we know roughly what the outcomes should be from the analysis of simulated data. Whereas if we go straight to analyzing the real data, then we do not know if unexpected outcomes are due to our own analysis errors, or actual results. Another good reason it is useful to take this approach of simulation is that when you are working in teams the analysis can get started before the data collection and cleaning is completed. The simulation will also help the collection and cleaning team think about tests they should run on their data.


number_of_people <- 5000

population <-
    person = c(1:number_of_people),
    favorite_color = sample(
      x = c("Blue", "White"),
      size = number_of_people,
      replace = TRUE
  ) |>
    prefers_dogs_to_cats =
      if_else(favorite_color == "Blue", "Yes", "No"),
    noise = sample(1:10, size = 1),
    prefers_dogs_to_cats =
        noise <= 8, # No special reason for using 8 as cut-off
          c("Yes", "No"),
          size = 1
  ) |>

# A tibble: 5,000 × 3
   person favorite_color prefers_dogs_to_cats
    <int> <chr>          <chr>               
 1      1 Blue           Yes                 
 2      2 White          No                  
 3      3 White          No                  
 4      4 Blue           Yes                 
 5      5 Blue           Yes                 
 6      6 Blue           Yes                 
 7      7 Blue           Yes                 
 8      8 Blue           Yes                 
 9      9 White          No                  
10     10 White          No                  
# … with 4,990 more rows
population |>
  group_by(favorite_color) |>
# A tibble: 2 × 2
# Groups:   favorite_color [2]
  favorite_color     n
  <chr>          <int>
1 Blue            2547
2 White           2453

Building on the terminology and concepts introduced in Chapter 6, we now construct a sampling frame that contains 80 per cent of the target population.


frame <-
  population |>
    in_frame = sample(
      x = c(0, 1),
      size = number_of_people,
      replace = TRUE,
      prob = c(0.2, 0.8)
  ) |>
  filter(in_frame == 1)

frame |>
  group_by(favorite_color) |>
# A tibble: 2 × 2
# Groups:   favorite_color [2]
  favorite_color     n
  <chr>          <int>
1 Blue            2023
2 White           1980

For now, we will set aside dog or cat preferences and focus on creating treatment and control groups based on favorite color only.


sample <-
  frame |>
  select(-prefers_dogs_to_cats) |>
  mutate(group = sample(
    x = c("Treatment", "Control"),
    size = nrow(frame),
    replace = TRUE

When we look at the mean for the two groups, we can see that the proportions that prefer blue or white are very similar to what we specified (Table 8.1).

sample |>
  group_by(group, favorite_color) |>
  count() |>
  ungroup() |>
  group_by(group) |>
  mutate(prop = n / sum(n)) |>
    col.names = c("Group", "Preferred color", "Number", "Proportion"),
    digits = 2,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")
Table 8.1: Proportion of the groups that prefer blue or white
Group Preferred color Number Proportion
Control Blue 987 0.50
Control White 997 0.50
Treatment Blue 1,036 0.51
Treatment White 983 0.49

We randomized based on favorite color. But we should also find that we took dog or cat preferences along at the same time and will have a ‘representative’ share of people who prefer dogs to cats. Why should that happen when we have not randomized on these variables? Let us start by looking at our dataset (Table 8.2).

sample |>
    frame |> select(person, prefers_dogs_to_cats),
    by = "person"
  ) |>
  group_by(group, prefers_dogs_to_cats) |>
  count() |>
  ungroup() |>
  group_by(group) |>
  mutate(prop = n / sum(n)) |>
    col.names = c(
      "Prefers dogs to cats",
    digits = 2,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")
Table 8.2: Proportion of the treatment and control group that prefer dogs or cats
Group Prefers dogs to cats Number Proportion
Control No 997 0.50
Control Yes 987 0.50
Treatment No 983 0.49
Treatment Yes 1,036 0.51

It is exciting to have a representative share on ‘unobservables’ (In this case, we do ‘observe’ them—to illustrate the point—but we did not select on them). We get this because the variables were correlated. But it will break down in several ways that we will discuss. It also assumes large enough groups. For instance, if we considered specific dog breeds, instead of dogs as an entity, we may not find ourselves in this situation. To check that the two groups are the same, we look to see if we can identify a difference between the two groups based on observables, theory, experience, and expert opinion. In this case we looked at the mean, but we could look at other aspects as well.

This would traditionally bring us to Analysis of Variance (ANOVA). ANOVA was introduced around one hundred years ago by Fisher while he was working on statistical problems in agriculture. (Stolley (1991) provides additional interesting background on Fisher.) This is less unexpected than it may seem because historically agricultural research was closely tied to statistical innovation. In particular, often statistical methods were designed to answer agricultural questions such as ‘does fertilizer work?’ and were only later adapted to clinical trials (Yoshioka 1998). It was relatively easily to divide a field into ‘treated’ and ‘non-treated’, and the magnitude of any effect was large. While appropriate for that context, often these same statistical approaches are still taught today in introductory material, even when they are being applied to different circumstances to those they were designed for. For instance, these days, many researchers worry more about data quantity than data quality, and effect sizes may be small, which creates issues for these approaches (Bradley et al. 2021). It almost always pays to take a step back and think about what is being done and whether it is appropriate to these circumstances. We mention ANOVA here because of its importance historically, but it is a variant of linear regression and, in general, we would usually not directly use ANOVA day-to-day these days. There is nothing wrong with it in the right circumstances. But it is more than one-hundred years old and the number of modern use-cases where it is the best option is small. A better option, in many cases, would be to actually build the model that underpins it ourselves, which we cover in Chapter 12.

8.2.2 Treatment and control

If the treated and control groups are the same in all ways and remain that way, but for the treatment, then we have internal validity, which is to say that our control will work as a counterfactual and our results can speak to a difference between the groups in that study. Internal validity means that our estimates of the effect of the treatment speak to the treatment and not some other aspect. It means that we can use our results to make claims about what happened in the experiment.

If the group to which we applied our randomization were representative of the broader population, and the experimental set-up were fairly similar to outside conditions, then we further could have external validity. That would mean that the difference that we find does not just apply in our own experiment, but also in the broader population. External validity means that we can use our experiment to make claims about what would happen outside the experiment. It is randomization that has allowed that to happen. It is worth noting that in practice we would not just rely on one experiment, but would instead consider that a contribution to a broader evidence-collection effort (Duflo 2020, 1955).

Shoulders of giants

Dr Esther Duflo is Abdul Latif Jameel Professor of Poverty Alleviation and Development Economics at MIT. After earning a PhD in Economics from MIT in 1999, she remained at MIT as an assistant professor, being promoted to full professor in 2003. One area of her research is applications in economic development and she uses randomized controlled trials to understand how to address poverty. One of her important books is Banerjee and Duflo (2011) and one of her important papers is Banerjee et al. (2015). She was awarded the Sveriges Riksbank Prize in Economic Sciences in Memory of Alfred Nobel in 2019 and The Prize in Economic Sciences (2019) provides an excellent overview of her work.

But this means we need randomization twice. Firstly, into the group that was subject to the experiment, and then secondly, between treatment and control. How do we think about this randomization, and to what extent does it matter?

We are interested in the effect of being treated. It may be that we charge different prices, which would be a continuous treatment variable, or that we compare different colors on a website, which would be a discrete treatment variable. Either way, we need to make sure that all the groups are otherwise the same. How can we be convinced of this? One way is to ignore the treatment variable and to examine all other variables, looking for whether we can detect a difference between the groups based on any other variables. For instance, if we are conducting an experiment on a website, then are the groups roughly similar in terms of, say:

  • Microsoft and Apple users?
  • Safari, Chrome, and Firefox users?
  • Mobile and desktop users?
  • Users from certain locations?

Further, are the groups representative of the broader population? These are all threats to the validity of our claims.

But if done properly, that is if the treatment is truly independent, then we can estimate the average treatment effect (ATE). In a binary treatment variable setting this is: \[\mbox{ATE} = \mathbb{E}[Y|t=1] - \mathbb{E}[Y|t=0].\]

That is, the difference between the treated group, \(t = 1\), and the control group, \(t = 0\), when measured by the expected value of the outcome, \(Y\). The ATE becomes the difference between the two conditional expectations.

To illustrate this concept, we first simulate some data that shows an average difference of one between the treatment and control groups.


ate_example <- tibble(
  person = c(1:1000),
  was_treated = sample(
    x = c("Yes", "No"),
    size = 1000,
    replace = TRUE

# Make outcome a bit more likely if treated.
ate_example <-
  ate_example |>
  rowwise() |>
  mutate(outcome = if_else(
    was_treated == "No",
    rnorm(n = 1, mean = 5, sd = 1),
    rnorm(n = 1, mean = 6, sd = 1)

We can see the difference, which we simulated to be one, between the two groups in Figure 8.1. And we can compute the average between the groups and then the difference to see also that we roughly get back the result that we put in (Table 8.3).

ate_example |>
    x = outcome,
    fill = was_treated
  )) +
    position = "dodge",
    binwidth = 0.2
  ) +
  theme_minimal() +
    x = "Outcome",
    y = "Number of people",
    fill = "Person was treated"
  ) +
  scale_fill_brewer(palette = "Set1")

Figure 8.1: Simulated data showing a difference between the treatment and control group

ate_example |>
  group_by(was_treated) |>
  summarize(mean = mean(outcome)) |>
  pivot_wider(names_from = was_treated, values_from = mean) |>
  mutate(difference = Yes - No) |>
    col.names = c(
      "Average for treated",
      "Average for not treated",
    digits = 2,
    booktabs = TRUE,
    linesep = ""
Table 8.3: Average difference between the treatment and control groups for data simulated to have an average difference of one
Average for treated Average for not treated Difference
5 6.06 1.06

Unfortunately, there is often a difference between simulated data and reality. For instance, an experiment cannot run for too long otherwise people may be treated many times, or become inured to the treatment; but it cannot be too short otherwise we cannot measure longer term outcomes. We cannot have a ‘representative’ sample across every facet of a population, but if not, then the treatment and control may be different. Practical difficulties may make it difficult to follow up with certain groups and so we end up with a biased collection. Some questions to explore when working with real experimental data include:

  • How are the participants being selected into the frame for consideration?
  • How are they being selected for treatment? We would hope this is being done randomly, but this term is applied to a variety of situations. Additionally, early ‘success’ can lead to pressure to treat everyone, especially in medical settings.
  • How is treatment being assessed?
  • To what extent is random allocation ethical and fair? Some argue that shortages mean it is reasonable to randomly allocate, but that may depend on how linear the benefits are. It may also be difficult to establish definitions, and the power imbalance between those making these decisions and those being treated should be considered.

Bias and other issues are not the end of the world. But we need to think about them carefully. Selection bias, introduced in Chapter 4, can be adjusted for, but only if it is recognized. For instance, how would the results of a survey about the difficulty of a university course differ if only students who completed the course were surveyed, and not those who dropped out? We should always work to try to make our dataset as representative as possible when we are creating it, but it may be possible to use a model to adjust for some of the bias ex post. For instance, if there was a variable that was correlated with, say, attrition, then it could be added to the model either by-itself, or as an interaction. Similarly, if there was correlation between the individuals. For instance, if there was some ‘hidden variable’ that we did not know about that meant some individuals were correlated, then we could use wider standard errors. This needs to be done carefully and we discuss this further in Chapter 14. That said, if such issues can be anticipated, then it can be better to change the experiment. For instance, perhaps it would be possible to stratify by that variable.

8.2.3 Fisher’s tea party

Fisher introduced an experiment designed to see if a person can distinguish between a cup of tea where the milk was added first, or last. We begin by preparing eight cups of tea: four with milk added first and the other four with milk added last. We then randomize the order of all eight cups. We tell the taster, whom we will call ‘Ian’, about the experimental set-up: there are eight cups of tea, four of each type, he will be given cups of tea in a random order, and his task is to group them into two groups.

One of the nice aspects of this experiment is that we can do it ourselves. There are a few things to be careful of in practice, including that: the quantities of milk and tea are consistent; the groups are marked in some way that the taster cannot see; and the order is randomized.

Another nice aspect of this experiment is that we can calculate the chance that Ian is able to randomly get the groupings correct. To decide if his groupings were likely to have occurred at random, we need to calculate the probability this could happen. First, we count the number of successes out of the four that were chosen. There are: \({8 \choose 4} = \frac{8!}{4!(8-4)!}=70\) possible outcomes (Fisher 1949, 14). This notation means there are eight items in the set and we are choosing four of them, and is used when the order of choice does not matter.

We are asking Ian to group the cups, not to identify which is which, and so there are two ways for him to be perfectly correct. He could either correctly identify all the ones that were milk-first (one outcome out of 70) or correctly identify all the ones that were tea-first (one outcome out of 70). This means the probability of this event is: \(\frac{2}{70} \approx 0.028\) or about 3 per cent.

As Fisher (1949, 15) makes clear, this now becomes a judgement call. We need to consider the weight of evidence that we require before we accept the groupings did not occur by chance and that Ian was well-aware of what he was doing. We need to decide what evidence it takes for us to be convinced. If there is no possible evidence that would dissuade us from the view that we held coming into the experiment, say, that there is no difference between milk-first and tea-first, then what is the point of doing an experiment? We expect that if Ian got it completely right, then the reasonable person would accept that he was able to tell the difference.

What if he is almost-perfect? By chance, there are 16 ways for a person to be ‘off-by-one’. Either Ian thinks there was one cup that was milk-first when it was tea-first—there are, \({4 \choose 1} = 4\), four ways this could happen—or he thinks there was one cup that was tea-first when it was milk-first—again, there are, \({4 \choose 1}\) = 4, four ways this could happen. These outcomes are independent, so the probability is \(\frac{4\times 4}{70} \approx 0.228\). And so on. Given there is an almost 23 per cent chance of being off-by-one just by randomly grouping the teacups, this outcome probably would not convince us that Ian can tell the difference between tea-first and milk-first.

What we are looking for, in order to claim something is experimentally demonstrable is the results of not just it being shown once, but instead to come to know the features of an experiment where such a result is reliably found (Fisher 1949, 16). We are looking to thoroughly interrogate our data and our experiments, and to think precisely about the analysis methods we are using. Rather than searching for meaning in constellations of stars, we want to make it as easy as possible for others to reproduce our work. It is only in that way that our conclusions stand a chance of holding up in the long-term.

8.3 Surveys

Having decided what to measure, one common way to get values is to use a survey. This is especially challenging, and there is an entire field, survey research, focused on this. Edelman, Vittert, and Meng (2021) make it clear that there are no new problems here, and the challenges that we face today are closely related to those that were faced in the past. There are many ways to implement surveys, and this choice really matters. For some time, the only choice was face-to-face surveys, where an enumerator would be tasked with conducting the survey with a respondent. Eventually surveys began to be conducted over the telephone, again by an enumerator. One issue is that it was found that in both these settings there were considerable interviewer effects (Elliott et al. 2022). The internet brought about a third era of survey research, characterized by low participation rates and increased options for data collection (Groves 2011). The use of surveys remains popular and an invaluable way to get data. While face-to-face and telephone surveys are still used and have an important role to play, we now focus on designing internet-based surveys and introduce one way to implement them.

We focus on internet-based surveys. There are many dedicated survey platforms, such as Survey Monkey and Qualtrics, that are largely internet-based. In general, the focus of those platforms is on putting together the survey form and they expect that we already have contact details for the sample of interest. Some other platforms, such as Mechanical Turk and Prolific, focus on providing that audience, and we can then ask that audience to do more than just take a survey. When using platforms like those, and other providers, it is vital to be aware of who is in the sample (Levay, Freese, and Druckman 2016; Enns and Rothschild 2022). While it is a useful feature, it usually comes with higher costs. Finally, platforms such as Facebook also provide the ability to run a survey. One especially common approach, because it is free, is to use Google Forms.

The survey form needs to be considered within the context of the broader research and with special concern for the respondent. The most important aspect is to test the survey before releasing it more broadly. Light, Singer, and Willett (1990, 213), talking about running pilot studies to evaluate higher education, say that there is no occasion in which a pilot will not bring improvements, and that they are almost always worth it. In the case of surveys, we go further and say that the fundamental rule of surveys is that if you do not have time, or budget, to test the survey, then do not bother doing the survey. The wording of a survey is crucial (Tourangeau, Rips, and Rasinski 2000, 23). When designing the survey, we need to have survey questions that are conversational and flow from one to the next, grouped within topics (Elson 2018). But we should also consider the cognitive load that we place on the respondent, and vary the difficulty of the questions.

When designing a survey, the critical task is to keep the respondent front-of-mind (Dillman, Smyth, and Christian 2014, 94). Drawing on Swain (1985), all questions need to be relevant and able to be answered by the respondent. The wording of the questions should be based on what the respondent would be comfortable with. The decision between different question types turns on minimizing error and the burden that we impose on the respondent. In general, if there are a small number of clear options then multiple-choice questions are appropriate. In that case, the responses should usually be mutually exclusive and collectively exhaustive. If they are not mutually exclusive, then this needs to be signaled in the text of the question. It is also important that units are specified, and that standard concepts are used, to the extent possible.

Open text boxes may be appropriate if there are a lot of potential answers, although this will increase the time the respondent spends completing the survey and increase the time it will take to analyze the answers. It is important that only one question is asked at a time and that all questions be asked in a neutral way that does not lead to one particular response. We want to avoid ambiguous or double-barreled questions. The subject-matter of the survey will also affect the appropriate choice of question type. For instance, potentially “threatening” topics may be better considered with open-ended questions (Blair et al. 1977).

All surveys need to have an introduction that specifies a title for the survey, who is conducting it, and their contact details, and the purpose. It should also include a statement about protecting confidentiality and ethics review board permission, if appropriate.

When doing surveys, it is critical to ask the right person. For instance, Lichand and Wolf (2022) consider child labor, the extent of which is typically based on surveys of parents. When questions were instead asked of children themselves, a considerable under-reporting by parents was found. Finally, one especially exciting area of research is making better use of surveys delivered on smartphones. In particular, using voice, rather than text, to allow open-ended survey questions. Gavras et al. (2022) find that collecting voice responses in this way, compared with typed answers, is associated with responses that are more “intuitive and spontaneous”.

Finally, returning to the reason for doing surveys in the first place, while doing all this, it is important to also keep what we are interested in measuring in mind.

8.4 RCT examples

8.4.1 The Oregon Health Insurance Experiment

In the US, unlike many developed countries, basic health insurance is not necessarily available to all residents even those on low incomes. The Oregon Health Insurance Experiment involved low-income adults in Oregon, a state in the north-west of the US, from 2008 to 2010 (Finkelstein et al. 2012).

Oregon funded 10,000 places in the state-run Medicaid program, which provides health insurance for people with low incomes. A lottery was used to allocate these places and this was judged fair because it was expected, correctly as it turned out, that demand for places would exceed the supply. In the end, 89,824 individuals signed up.

The draws were conducted over a six-month period and 35,169 individuals were selected (the household of those who actually won the draw was given the opportunity) but only 30 per cent of them turned out to be eligible and complete the paperwork. The insurance lasted indefinitely. This random allocation of insurance allowed the researchers to understand the effect of health insurance.

The reason that this random allocation is important is that it is not usually possible to compare those with and without insurance because the type of people that sign up to get health insurance differ to those who do not. That decision is ‘confounded’ with other variables and results in selection bias.

As the opportunity to apply for health insurance was randomly allocated, the researchers were able to evaluate the health and earnings of those who received health insurance and compare them to those who did not. To do this they used administrative data, such as hospital discharge data, matched credit reports, and, uncommonly, mortality records. This collection of data is limited, and so they included a survey conducted by mail.

The specifics of this are not important, and we will have more to say in Chapter 12, but they use a statistical model, Equation 8.1, to analyze the results (Finkelstein et al. 2012):

\[ y_{ihj} = \beta_0 + \beta_1\mbox{Lottery} + X_{ih}\beta+2 + V_{ih}\beta_3 + \epsilon_{ihj} \tag{8.1}\]

Equation 8.1 explains various \(j\) outcomes (such as health) for an individual \(i\) in household \(h\) as a function of an indicator variable as to whether household \(h\) was selected by the lottery. Hence, it is the \(\beta_1\) coefficient that is of particular interest. That is our estimate of the mean difference between the treatment and control groups. To complete the specification of Equation 8.1, \(X_{ih}\) is a set of variables that are correlated with the probability of being treated. These adjust for that impact to a certain extent. An example of that is the number of individuals in a household. And finally, \(V_{ih}\) is a set of variables that are not correlated with the lottery. These variables include demographics, hospital discharge and lottery draw.

Similar to earlier studies such as Brook et al. (1984), Finkelstein et al. (2012) found that the treatment group used more health care including both primary and preventive care as well as hospitalizations but had lower out-of-pocket medical expenditures. More generally, the treatment group reported better physical and mental health.

8.4.2 Civic Honesty Around The Globe

Trust is not something that we think regularly about, but it is actually fairly fundamental to most interactions, both economic and personal. For instance, many of us get paid after we do some work–we are trusting our employer will make good, and vice versa. If you get paid in advance then they are trusting you. In a strictly naive, one-shot, transaction-cost-less world, this does not make sense. If you get paid in advance, the incentive is for you to take the money and run in the last pay period before you quit, and through backward induction everything falls apart. We do not live in such a world. For one thing there are transaction costs, for another, generally, we have repeated interactions, and finally, the world usually ends up being fairly small.

Understanding the extent of honesty in different countries may help us to explain economic development and other aspects of interest such as tax compliance, but it is hard to measure. We cannot ask people how honest they are–the liars would lie, resulting in a lemons problem (Akerlof 1970). This is a situation of adverse selection, where the liars know they are liars, but others do not. To get around this Cohn et al. (2019a) conduct an experiment in 355 cities across 40 countries where they ‘turn in’ a wallet that is either empty or contains the local equivalent of US$13.45. They are interested in whether the ‘recipient’ attempts to return the wallet. They find that generally wallets with money were more likely to be returned (Cohn et al. 2019a, 1).

In total Cohn et al. (2019a) ‘turn in’ 17,303 wallets to various institutions including banks, museums, hotels, and police stations. The importance of such institutions to an economy is generally well-accepted (Acemoglu, Johnson, and Robinson 2001) and they are common across most countries. Importantly, for the experiment, they usually have a reception area where the wallet could be turned in (Cohn et al. 2019a, 1).

In the experiment a research assistant turned in the wallet to an employee at the reception area, using a set form of words. The research assistant had to note various features of the setting, such as the gender, age-group, and busyness of the ‘recipient’. The wallets were transparent and contained a key, a grocery list, and a business card with a name and email address. The outcome of interest was whether an email was sent to the unique email address on the business card in the wallet. The grocery list was included to signal that the owner of the wallet was a local. The key was included as something that was only useful to the owner of the wallet, and never the recipient, in contrast to the cash, to adjust for altruism. The language and currency were adapted to local conditions.

The primary treatment in the experiment is whether the wallet contained money or not. The key outcome was whether the wallet was attempted to be returned or not. It was found that the median response time was 26 minutes, and that if an email was sent then it usually happened within a day (Cohn et al. 2019b, 10).

Using the data for the paper that is made available (Cohn 2019) we can see that considerable differences were found between countries (Figure 8.2). But in almost all countries wallets with money were more likely to be returned than wallets without. The experiments were conducted across 40 countries, which were chosen based on them having enough cities with populations of at least 100,000, as well as the ability for the research assistants to safely visit and withdraw cash. Within those countries, the cities were chosen starting with the largest ones and there were usually 400 observations in each country (Cohn et al. 2019b, 5). Cohn et al. (2019a) further conducted the experiment with the equivalent of US$94.15 in three countries–Poland, the UK, and the US–and found that reporting rates further increased.

Figure 8.2: Comparison of the proportion of wallets handed in, by country, depending on whether they contained money

In addition to the experiments, Cohn et al. (2019a) conducted surveys that allowed them to understand some reasons for their findings. During the survey, participants were given one of the scenarios and then asked to answer questions. The use of surveys also allowed them to be specific about the respondents. The survey involved 2,525 respondents (829 in the UK, 809 in Poland, and 887 in the US) (Cohn et al. 2019b, 36). Participants were chosen using attention checks and demographic quotas based on age, gender, and residence, and they received US $4.00 for their participation (Cohn et al. 2019b, 36). The survey did not find that larger rewards were expected for turning in a wallet with more money. But it did find that failure to turn in a wallet with more money caused the respondent to feel more like they had stolen money.

8.5 A/B testing

The past two decades has probably seen the most experiments ever run, likely by several orders of magnitude. This is because of the extensive use of A/B testing at tech firms (Kohavi et al. 2012) for many aspects and especially around push notifications (Boykis 2022). For a long time decisions such as what font to use were based on the Highest Paid Person’s Opinion (HIPPO) (Christian 2012). These days, many large tech companies have extensive infrastructure for experiments, and they term them A/B tests because of the comparison of two groups: one that gets treatment A and the other that either gets treatment B or does not see any change (Salganik 2018, 185). We could additionally consider more than two options, in which we typically change to using the terminology of ‘arms’ of the experiment.

Every time you are online you are probably subject to tens, hundreds, or potentially thousands, of different A/B tests. This is especially the case if you use apps like TikTok. While, at their heart, they are just experiments that use sensors to measure data that need to be analyzed, they have many special features that are interesting in their own light. For instance, Kohavi, Tang, and Xu (2020, 3) discuss the example of Microsoft’s search engine Bing. They used A/B testing to examine how to display an advertisement. Based on these tests, they ended up lengthening the title on the advertisement. They found this caused revenue to increase by 12 per cent, or around $100 million annually, without any significant trade-off being measured. That all said, sometimes the hardest part of A/B testing is ensuring design enables suitable randomness, rather than the actual calculation of ATE.

We use the term A/B test to strictly refer to the situation in which we are primarily implementing an experiment through a technology stack about something that is primarily of the internet, for instance a change to a website or similar and measured with sensors rather than a survey. While at their heart they are just experiments, A/B testing has a range of specific concerns, and Bosch and Revilla (2022) detail some from a statistical perspective. There is something different about doing tens of thousands of small experiments all the time, compared with the RCT set-up of conducting one experiment over the course of months.

RCTs are often, though not exclusively, done in academia or by government agencies, but much of A/B testing occurs in industry. This means that if you are in industry and want to introduce A/B testing to your firm there can be aspects such as culture, relationship building, and nuance, that become important. It can be difficult to convince a manager to run an experiment. Indeed, sometimes it can be easier to experiment by not delivering, or delaying, a change that has been decided to create a control group rather than a treatment group (Salganik 2018, 188). This may especially be the case where there is a particularly incorrigible HIPPO. Sometimes the most difficult aspect of A/B testing, and conducting experiments more generally, is not the statistics, it is the politics. This is not unique to A/B testing and looking at the history of biology, we see that even aspects such as germ theory were not resolved through experiment, but instead on ideology and social standing (Morange 2016, 124), that is to say, politics.

When conducting A/B testing, as with all experiments, we need to be concerned with delivery. In the case of an experiment, it is usually clear how it is being delivered. For instance, we may have the person come to a doctor’s clinic and then inject them with either a drug or a placebo. But in the case of A/B testing, it is less obvious. For instance, should it be run ‘server-side’, meaning to make a change to a website, or ‘client-side’, meaning to change an app (Kohavi, Tang, and Xu 2020, 153). This decision affects our ability to both conduct the experiment and to gather data from it. Urban, Sreenivasan, and Kannan (2016) provides an overview of where A/B testing fits into Netflix workflow, assuming an app installed on a PS4.

In the case of the effect of conducting the experiment, it is relatively easy and normal to update a website all the time. This means that small changes can be easily implemented if the experiment is conducted server-side. But in the case of a client-side implementation of an app, conducting an experiment becomes a bigger deal. For instance, the release may need to go through an app store, and so would need to be part of a regular release cycle. There is also a selection concern because some users will not update their app and there is the possibility that they are different to those that do regularly update the app.

Now turning to the effect of the delivery decision on our ability to gather data from the experiment. Again, server-side is less of a big deal because we get the data anyway as part of the user interacting with the website. But in the case of an app, the user may use the app offline or with limited data upload, which then requires a data transmission protocol or caching, but this then could affect user experience, especially as some phones place limits are various aspects.

The effect of all this is that we need to plan. For instance, results are unlikely to be available the day after a change to an app, whereas they are likely available the day after a change to a website. Further, we may need to consider our results in the context of different devices and platforms, potentially using, say, regression which will be covered in Chapter 12.

The second aspect of concern is instrumentation. When we conduct a traditional experiment then we might, for instance, ask respondents to fill out a survey. But this is usually not done with A/B testing, and we usually use various sensors (Kohavi, Tang, and Xu 2020, 162). One approach is to put a cookie on the user’s device, but different users will clear these at different rates. Another approach is to use a beacon, such as forcing the user to download a tiny image from a server, so that we know when they have completed some action. For instance, this is a commonly used approach to know when a user has opened an email. There are practical concerns around when the beacon loads, for instance, if it is before the main content loads then the user experience may be worse, but if it is after then our sample may be biased.

The third aspect of concern is what are we randomizing over (Kohavi, Tang, and Xu 2020, 166). In the case of traditional experiments, this is usually clear and it is often a person, or sometimes various groups of people. But in the case of A/B testing it can be less clear. For instance, are we randomizing over the page, the session, or the user?

To think about this, let us consider color. For instance, say we are interested in whether we should change our logo from red to blue on the homepage. If we are randomizing at the page level, then if the user goes to some other page of our website, and then back to the homepage the logo could switch between colors. If we are randomizing at the session level, then while it could be blue while they use the website this time, if they close it and come back then it could be red. Finally, if we are randomizing at a user level then possibly it would always be red for one used, but always blue for another.

The extent to which this matters depends on a trade-off between consistency and importance. For instance, if we are A/B testing product prices then consistency is likely a feature. But if we are A/B testing background colors then consistency might not be as important. On the other hand, if we are A/B testing the position of a log-in button then it might be important that we not move that around too much for the one user, but between users it might matter less.

Interestingly, in A/B testing, as in traditional experiments, we are concerned that our treatment and control groups are the same, but for the treatment. In the case of traditional experiments, we satisfy ourselves of this by conducting analysis on the basis of the data that we have after the experiment is conducted. That is usually all we can do because it would be weird to treat or control both groups. But in the case of A/B testing, the pace of experimentation allows us to randomly create the treatment and control groups, and then check, before we subject the treatment group to the treatment, that the groups are the same. For instance, if we were to show each group the same website, then we would expect the same outcomes across the two groups. If we found different outcomes then we would know that we may have a randomization issue (Taddy 2019, 129).

One of the interesting aspects of A/B testing is that we are usually running them not because we desperately care about the specific outcome, but because that feeds into some other measure that we care about. For instance, do we care whether the website is quite-dark-blue or slightly-dark-blue? Probably not, but we probably care a lot about the company share price. But then what if picking the best blue comes at a cost to the share price?

To illustrate this, pretend that we work at a food delivery app and we are concerned with driver retention. Say we do some A/B tests and find that drivers are always more likely to be retained when they can deliver food to the customer faster. Our finding is that faster is better, for driver retention, always. But one way to achieve faster deliveries, is for them to not put the food into a hot box that would maintain the food’s temperature. Something like that might save 30 seconds, which is significant on a 10-15 minute delivery. Unfortunately, although we would decide to encourage that on the basis of A/B tests designed to optimize driver-retention, such a decision would likely make the customer experience worse. If customers receive cold food when it is meant to be hot, then they may stop using the app, which would ultimately be very bad for the business. C. et al. (2022) describe how they found a similar situation at Facebook in terms of notifications – although reducing the number of notifications reduced user engagement in the short-term, over the long-term it increased both user satisfaction and app usage.

This trade-off may be obvious when we run the driver experiment if we were to look at customer complaints. It is possible that on a small team we would be exposed to those tickets, but on a larger team we may not be. Ensuring that A/B tests are not resulting in false optimization is especially important. And not something that we typically have to worry about in normal experiments. As an example of this in a real setting Aprameya (2020) describes testing a feature of Duolingo, a language learning application. The feature was found to be positive for Duolingo’s revenue, but negative for customer learning habits and so was not rolled out. Presumably enough customer negativity would eventually have resulted in the feature having a negative effect on revenue. Related to this, we want to think carefully about the nature of the result that we expect. For instance, in the shades of blues example, we are unlikely to find substantial surprises, and so it might be sufficient to try a small range of blues. But what if we considered a wider variety of colors. If we are concerned that there may be fat tails on the distribution, then we need to consider a wider range of experiments than we otherwise would (Azevedo et al. 2020).

Shoulders of giants

Dr Susan Athey is the Economics of Technology Professor at Stanford University. After earning a PhD in Economics from Stanford in 1995, she joined MIT as an assistant professor, being promoted to full professor in 2006 when she moved to Harvard. One area of her research is applied economics, and one particularly important paper is Abadie et al. (2017), which considers when do standard errors need to be clustered, and another is Athey and Imbens (2017a), which considers how to analyze randomized experiments. In addition to her academic appointments, she has worked at Microsoft and other tech firms and been extensively involved in running experiments in this context. She was awarded the John Bates Clark Medal in 2007.

The trouble with much of A/B testing is that it is done by firms and so we typically do not have datasets that we can use. But Matias et al. (2021) provide access to a dataset of A/B tests from Upworthy, a clickbait media website that used A/B testing to optimize their content. Fitts (2014) provides more background information about Upworthy. And the datasets of A/B tests are available here.

We can look at what the dataset looks like, and get a sense for it by looking at the names and an extract.

upworthy <- read_csv("https://osf.io/vy8mj/download")
upworthy |>
 [1] "...1"                 "created_at"           "updated_at"          
 [4] "clickability_test_id" "excerpt"              "headline"            
 [7] "lede"                 "slug"                 "eyecatcher_id"       
[10] "impressions"          "clicks"               "significance"        
[13] "first_place"          "winner"               "share_text"          
[16] "square"               "test_week"           
upworthy |>
# A tibble: 6 × 17
   ...1 created_at          updated_at          clickabi…¹ excerpt headl…² lede 
  <dbl> <dttm>              <dttm>              <chr>      <chr>   <chr>   <chr>
1     0 2014-11-20 06:43:16 2016-04-02 16:33:38 546d88fb8… Things… They'r… "<p>…
2     1 2014-11-20 06:43:44 2016-04-02 16:25:54 546d88fb8… Things… They'r… "<p>…
3     2 2014-11-20 06:44:59 2016-04-02 16:25:54 546d88fb8… Things… They'r… "<p>…
4     3 2014-11-20 06:54:36 2016-04-02 16:25:54 546d902c2… Things… This I… "<p>…
5     4 2014-11-20 06:54:57 2016-04-02 16:31:45 546d902c2… Things… This I… "<p>…
6     5 2014-11-20 06:55:07 2016-04-02 16:25:54 546d902c2… Things… This I… "<p>…
# … with 10 more variables: slug <chr>, eyecatcher_id <chr>, impressions <dbl>,
#   clicks <dbl>, significance <dbl>, first_place <lgl>, winner <lgl>,
#   share_text <chr>, square <chr>, test_week <dbl>, and abbreviated variable
#   names ¹​clickability_test_id, ²​headline

It is also useful to look at the documentation for the dataset. This describes the structure of the dataset, which is that there are packages within tests. A package is a collection of headlines and images that were shown randomly to different visitors to the website, as part of a test. A test can include many packages. Each row in the dataset is a package and the test that it is part of is specified by the ‘clickability_test_id’ column.

There are a variety of variables. We will focus on:

  • ‘created_at’,
  • ‘clickability_test_id’ so that we can create comparison groups,
  • ‘headline’,
  • ‘impressions’ which is the number of people that saw the package, and
  • ‘clicks’ which is the number that clicked on that package.

Within each batch of tests, we are interested in the effect of the varied headlines on impressions and clicks.

upworthy_restricted <-
  upworthy |>

# A tibble: 6 × 5
  created_at          clickability_test_id     headline           impre…¹ clicks
  <dttm>              <chr>                    <chr>                <dbl>  <dbl>
1 2014-11-20 06:43:16 546d88fb84ad38b2ce000024 They're Being Cal…    3052    150
2 2014-11-20 06:43:44 546d88fb84ad38b2ce000024 They're Being Cal…    3033    122
3 2014-11-20 06:44:59 546d88fb84ad38b2ce000024 They're Being Cal…    3092    110
4 2014-11-20 06:54:36 546d902c26714c6c44000039 This Is What Sexi…    3526     90
5 2014-11-20 06:54:57 546d902c26714c6c44000039 This Is What Sexi…    3506    120
6 2014-11-20 06:55:07 546d902c26714c6c44000039 This Is What Sexi…    3380     98
# … with abbreviated variable name ¹​impressions

We will focus on the text contained in headlines, and look at whether headlines that asked a question got more clicks than those that did not. We want to remove the effect of different images and so will focus on those tests that have the same image. To identify whether a headline asks a question, we search for a question mark. Although there are more complicated constructions that we could use, this is enough to get started.

upworthy_restricted <-
  upworthy_restricted |>
    asks_question =
        string = headline,
        pattern = "\\?"

upworthy_restricted |>
# A tibble: 2 × 2
  asks_question     n
  <lgl>         <int>
1 FALSE         19130
2 TRUE           3536

For every test, and for every picture, we want to know whether asking a question affected the number of clicks.

to_question_or_not_to_question <-
  upworthy_restricted |>
  group_by(clickability_test_id, asks_question) |>
  summarize(ave_clicks = mean(clicks)) |>

look_at_differences <-
  to_question_or_not_to_question |>
    id_cols = clickability_test_id,
    names_from = asks_question,
    values_from = ave_clicks
  ) |>
    ave_clicks_not_question = `FALSE`,
    ave_clicks_is_question = `TRUE`
  ) |>
  filter(!is.na(ave_clicks_not_question)) |>
  filter(!is.na(ave_clicks_is_question)) |>
    difference_in_clicks =
      ave_clicks_is_question - ave_clicks_not_question

look_at_differences$difference_in_clicks |> mean()
[1] -4.890435

We could also consider a cross-tab (Table 8.4).


  ave_clicks ~ Mean * asks_question,
  data = to_question_or_not_to_question
Table 8.4: Difference between the average number of clicks
ave_clicks 57.06 43.86

We find that in general, having a question in the headline may slightly decrease the number of clicks on a headline, although if there is an effect it does not appear to be very large (Figure 8.3).

Figure 8.3: Comparison of the average number of clicks when a headline contains a question mark or not

8.6 Exercises


  1. (Plan) Consider the following scenario: A political candidate is interested in how two polling values change over the course of an election campaign: approval rating, and vote-share. The two are measured as percentages, and are somewhat correlated. There tends to be large changes when there is a debate between candidates. Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation. Please include five tests based on the simulated data. Submit a link to a GitHub Gist that contains your code.
  3. (Acquire) Please identify and document a possible source of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched using the simulated data. Submit a link to a GitHub Gist that contains your code.
  5. (Communicate) Please write two paragraphs about what you did.


  1. In your own words, what is the fundamental problem of causal inference, being sure to include an example and references (write at least three paragraphs)?
  2. In your own words, what is a counterfactual, being sure to include an example and references (write at least three paragraphs)?
  3. In your own words, what is the role of randomization in constructing a counterfactual, be sure to include an example and references (write at least three paragraphs)?
  4. What is external validity (pick one)?
    1. Findings from an experiment hold in that setting.
    2. Findings from an experiment hold outside that setting.
    3. Findings from an experiment that has been repeated many times.
    4. Findings from an experiment for which code and data are available.
  5. What is internal validity (pick one)?
    1. Findings from an experiment hold in that setting.
    2. Findings from an experiment hold outside that setting.
    3. Findings from an experiment that has been repeated many times.
    4. Findings from an experiment for which code and data are available.
  6. If we have a dataset named ‘netflix_data’, with the columns ‘person’ and ‘tv_show’ and ‘hours’, (‘person’ is a character class unique ID for every person, ‘tv_show’ is a character class name of a TV show, and ‘hours’ is double expressing the number of hours that person watched that TV show). Could you please write some code that would randomly assign people into one of two groups? The data looks like this:
netflix_data <-
    person = c(
      "Ian", "Ian", "Roger", "Roger", "Roger",
      "Patricia", "Patricia", "Helen"
    tv_show = c(
      "Broadchurch", "Duty-Shame", "Broadchurch",
      "Duty-Shame", "Shetland", "Broadchurch",
      "Shetland", "Duty-Shame"
    hours = c(6.8, 8.0, 0.8, 9.2, 3.2, 4.0, 0.2, 10.2)
  1. In the context of randomization, what does stratification mean to you (write at least two paragraphs)?
  2. How could you check that your randomization had been done appropriately (write at least two paragraphs)?
  3. Identify a company that conduct A/B testing commercially and write at least three paragraphs about how they work and the trade-offs involved.
  4. Pretend that you work as a junior analyst for a large consulting firm. Further, pretend that your consulting firm has taken a contract to put together a facial recognition model for a government border security department. Write at least three paragraphs, with examples and references, discussing your thoughts, with regard to ethics, on this matter.
  5. What is an estimate (pick one)?
    1. A rule for calculating an estimate of a given quantity based on observed data.
    2. The object of inquiry.
    3. A result given a particular dataset and approach.
  6. What is an estimator (pick one)?
    1. A rule for calculating an estimate of a given quantity based on observed data.
    2. The object of inquiry.
    3. A result given a particular dataset and approach.
  7. What is an estimand (pick one)?
    1. A rule for calculating an estimate of a given quantity based on observed data.
    2. The object of inquiry.
    3. A result given a particular dataset and approach.
  8. Ware (1989, 298) mentions ‘a randomized play the winner design’. What is it?
  9. Ware (1989, 299) mentions ‘adaptive randomization’. What is it, in your own words?
  10. Ware (1989, 299) mentions ‘randomized-consent’ and continues that it was ‘attractive in this setting because a standard approach to informed consent would require that parents of infants near death be approached to give informed consent for an invasive surgical procedure that would then, in some instances, not be administered. Those familiar with the agonizing experience of having a child in a neonatal intensive care unit can appreciate that the process of obtaining informed consent would be both frightening and stressful to parents. To what extent do you agree with this position, especially given, as Ware (1989, 305), mentions ’the need to withhold information about the study from parents of infants receiving CMT’?
  11. Ware (1989, 300) mentions ‘equipoise’. In your own words, please define and discuss it, using an example from your own experience and include references (write at least three paragraphs).


Pick one of the following options. Use Quarto, and include an appropriate title, author, date, link to a GitHub repo, and citations. Submit a PDF.

Option 1

Please follow Appendix E, to build a quick website about a hypothetical product. Add Google Analytics. Deploy it using Netlify. Change some aspect of the website, add a different tracker, and push it to a new branch. Then use Netlify to conduct an A/B test. Write a paper, of at least two pages, about what you did and what you found.

Option 2

Please consider the Special Virtual Issue on Nonresponse Rates and Nonresponse Adjustments of the Journal of Survey Statistics and Methodology. Focus on one aspect of the editorial, and with reference to relevant literature, please discuss it in at least two pages.


At about this point the Howrah Paper from Appendix D would be appropriate.