11  Clean and prepare

Required material

Key concepts and skills

Key packages and functions

11.1 Introduction

“Well, Lyndon, you may be right and they may be every bit as intelligent as you say,” said Rayburn, “but I’d feel a whole lot better about them if just one of them had run for sheriff once.”

Sam Rayburn’s reaction to Lyndon Johnson’s enthusiasm about Kennedy’s incoming cabinet, as quoted in The Best and the Brightest (Halberstam 1972, 41).

In this chapter we put in place more-formal approaches for data cleaning and preparation. These are centered around:

  1. validity;
  2. internal consistency; and
  3. external consistency.

Your model does not care whether you validated your data, but you should. Validity means that the values in the dataset are not obviously wrong. For instance, with few exceptions, currencies should not have letters in them, names should not have numbers, and velocities should not be faster than the speed of light. Internal consistency means the dataset does not contradict itself. For instance, that might mean that constituent columns actually do add up to be the same as a total column. External consistency means that the dataset does not, in general, contradict outside sources, and is deliberate when it does. For instance, if our dataset purports to be about the population of cities, then we would expect that they are the same as, to a rough approximation, say, those available from relevant censuses on Wikipedia.

SpaceX, the US rocket company, uses cycles of 10 or 50 Hertz (equivalent to 0.1 and 0.02 seconds, respectively) to control their rockets. Each cycle, the inputs from sensors, such as temperature and pressure, are read, processed, and used to make a decision, such as whether to adjust some setting (Martin and Popper 2021). We use a similar iterative approach of small adjustments for data cleaning and preparation. Rather than trying to make everything perfect from the start, we get started, and iterate using a process of small, continuous, improvements.

To a large extent, the role of data cleaning and preparation is so great that the only people that understand a dataset, are those that who have cleaned it. Yet, the paradox of data cleaning is that often those that do the cleaning and preparation are those that have the least trust in the resulting dataset. At some point in every data science workflow, those doing the modelling should get their hands dirty with data cleaning. Even though few want to do it (Sambasivan et al. 2021), it is a far more influential stage than modelling. To clean and prepare data is to make many decisions, some of which may have important effects on our results. For instance, Northcutt, Athalye, and Mueller (2021) find the test sets of some popular datasets in computer science contain, on average, labels that are wrong in around three per cent of cases. This has the potential to result in incorrect conclusions.

For a long time, data cleaning and preparation was largely overlooked. We now realize that was a mistake. It has been difficult to trust results in disciplines that apply statistics for some time. The reproducibility crisis, which started in psychology but has now extended to many other fields in the physical and social sciences, brought to light issues such as p-value “hacking”, researcher degrees of freedom, file-drawer issues, and even data and results fabrication (Gelman and Loken 2013). Steps are now being put in place to address these. But there has been relatively little focus on the data gathering, cleaning, and preparation aspects of applied statistics, despite evidence that decisions made during these steps greatly affect statistical results (Huntington-Klein et al. 2021). In this chapter we focus on these issues.

While the statistical practices that underpin data science are themselves correct and robust when applied to simulated datasets, data science is not typically conducted with these types of datasets. For instance, data scientists are interested in “messy, unfiltered, and possibly unclean data—tainted by heteroskedasticity, complex dependence and missingness patterns—that until recently were avoided in polite conversations between more traditional statisticians” (Craiu 2019). Big data does not resolve this issue and may even exacerbate it. For instance, population inference based on larger amounts of poor-quality data will just lead to more confidently wrong conclusions (Meng 2018). The problems that are found in much of applied statistics research are not necessarily associated with researcher quality, or their biases (Silberzahn et al. 2018). Instead, they are a result of the environment within which data science is conducted. This chapter provides an approach and tools to explicitly think about this work.

Gelman and Vehtari (2020) writing about the most important statistical ideas of the past 50 years say that each of them enabled new ways of thinking about data analysis and they brought into the tent of statistics, approaches that “had been considered more a matter of taste or philosophy”. The focus on data cleaning and preparation in this chapter is analogous, insofar, as it represents a codification, or bringing inside the tent, of aspects that are typically, incorrectly, considered those of taste rather than core statistical concerns.

The workflow for data cleaning and preparation that we advocate is:

  1. Save the raw data.
  2. Begin with an end in mind.
  3. Write tests and documentation.
  4. Execute the plan on a small sample.
  5. Iterate the plan.
  6. Generalize the execution.
  7. Update tests and documentation.

We will need a variety of skills to be effective, but this is the very stuff of data science. The approach needed is some combination of dogged and sensible. Perfect is very much the enemy of good enough when it comes to data cleaning. And to be specific, it is better to have 90 per cent of the data cleaned and prepared, and to start exploring that, before deciding whether it is worth the effort to clean and prepare the remaining 10 per cent. Because that remainder will likely take an awful lot of time and effort.

All data regardless of whether they were obtained from farming, gathering, or hunting, will have issues. It is critical that we have approaches that can deal with a variety of concerns, and more importantly, understand how they might affect our modelling (Van den Broeck et al. 2005). To clean data is to analyze data. This is because the process forces us to make choices about what we value in our results (Au 2020).

11.2 Workflow

11.2.1 Save a copy of the raw data

The first step is to save the raw data into a separate, local, folder. It is important to save this raw data, to the extent that is possible, because it establishes the foundation for reproducibility (Wilson et al. 2017). If we are obtaining our data from a third-party, such as a government website, then we have no control over whether they will continue to host that data, update it, or change the address at which it is available. Saving a local copy also reduces the burden that we impose on their servers.

Having locally saved the raw data we must maintain a copy of it in that state, and not modify it. As we begin to clean and prepare it, we instead create a copy of the dataset. Maintaining the initial, raw, state of the dataset, and using scripts to create the dataset that we are interested in analyzing, ensures that our entire workflow is reproducible.

11.2.2 Begin with an end in mind

Planning the endpoint forces us to begin with an end in mind and is important for a variety of reasons. As with scraping data, introduced in Chapter 9, it helps us to be proactive about scope-creep. But with data cleaning it additionally forces us to really think about what we want the final dataset to look like.

The first step is to sketch the dataset that we are interested in. The key features of the sketch will be aspects such as the names of the columns, their class, and the possible range of values. For instance, we might be interested in the populations of US states. So, our sketch might look like Figure 11.1.

Figure 11.1: Planned dataset of US states and their populations

In this case, the sketch forces us to decide that we want full names rather than abbreviations for the state names, and the population to be measured in millions. The process of sketching this endpoint has forced us to make decisions early on and be clear about our desired endpoint.

We then implement that using code to simulate data. Again, this process forces us to think about what reasonable values look like in our dataset because we are literally forced to decide which functions to use. We need to think carefully about the unique values of each variable. For instance, if the variable is meant to be “gender” then unique values such as “male”, “female”, “other”, and “unknown” may be expected, but a number such as “1,000” would likely be wrong. It also forces us to be explicit about names because we must assign the output of those functions to a variable. For instance, we could simulate some population data for the US states.

library(tidyverse)

set.seed(853)

simulated_population <-
  tibble(
    state = state.name,
    population = runif(n = 50, min = 0, max = 50) |>
      round(digits = 2)
  )

simulated_population
# A tibble: 50 × 2
   state       population
   <chr>            <dbl>
 1 Alabama          18.0 
 2 Alaska            6.01
 3 Arizona          24.2 
 4 Arkansas         15.8 
 5 California        1.87
 6 Colorado         20.2 
 7 Connecticut       6.54
 8 Delaware         12.1 
 9 Florida           7.9 
10 Georgia           9.44
# … with 40 more rows
# ℹ Use `print(n = ...)` to see more rows

Our purpose, during data cleaning and preparation, is to then bring our raw data close to that plan. Ideally, we would plan so that the desired endpoint of our dataset is “tidy data”. This was introduced in Chapter 3, but briefly, it means that (Wickham and Grolemund 2022; Wickham 2014, 4):

  1. each variable is in its own column;
  2. each observation is in its own row; and
  3. each value is in its own cell.

At this stage it is important to begin to think about validity and internal consistency. What are some of the features that we know these data should have? Note these down as you got through the process of simulating the dataset because we will draw on them to write tests.

11.2.3 Start small

Having thoroughly planned we can turn to the raw data that we are dealing with. Usually, regardless of what the raw data look like, we want to manipulate them into a rectangular dataset as quickly as possible. This allows us to use familiar tidyverse approaches. For instance, let us assume that we are starting with a .txt file.

The first step is to look for regularities in the dataset. We want to end up with tabular data, which means that we need some type of delimiter to distinguish different columns. Ideally this might be features such as a comma, a semicolon, a tab, a double space, or a line break. For instance, in the following case we would take advantage of the comma.

Alabama, 5
Alaska, 0.7
Arizona, 7
Arkansas, 3
California, 40

In more challenging cases there may be some regular feature of the dataset that we can take advantage of. For instance, sometimes various text is repeated, as in the following case with “State is” and “and population is”.

State is Alabama and population is 5 million.
State is Alaska and population is 0.7 million.
State is Arizona and population is 7 million.
State is Arkansas and population is 3 million.
State is California and population is 40 million.

In this case, although we do not have a traditional delimiter, we can use the regularity of “State is” and “and population is” to get what we need. A more difficult case is when we do not have line breaks. This final case is illustrative of that.

Alabama 5 Alaska 0.7 Arizona 7 Arkansas 3 California 40

One way to approach this is to take advantage of the different classes and values that we are looking for. For instance, in this case, we know that we are after US states, so there are only 50 possible options (setting D.C. to one side for the time being), and we could use the presence of these as a delimiter. We could also use the fact that population is a number, and so split based on a space followed by a number.

We will now convert this final case into tidy data using tidyr (Wickham 2021).

raw_data <-
  c("Alabama 5 Alaska 0.7 Arizona 7 Arkansas 3 California 40")

data_as_tibble <-
  tibble(raw = raw_data)

tidy_data <-
  data_as_tibble |>
  separate(
    col = raw,
    into = letters[1:5],
    sep = "(?<=[[:digit:]]) "
  ) |>
  pivot_longer(
    cols = letters[1:5],
    names_to = "drop_me",
    values_to = "separate_me"
  ) |>
  separate(
    col = separate_me,
    into = c("state", "population"),
    sep = " (?=[[:digit:]])"
  ) |>
  select(-drop_me)

tidy_data
# A tibble: 5 × 2
  state      population
  <chr>      <chr>     
1 Alabama    5         
2 Alaska     0.7       
3 Arizona    7         
4 Arkansas   3         
5 California 40        

11.2.4 Write tests and documentation

Having established a rectangular dataset, albeit a messy one, we should begin to look at the classes that we have. We do not necessarily want to fix the classes at this point, because that can result in lost data. But we look at the class to see what it is, and then compare it to our simulated dataset to see where it needs to get to. We note the columns where it is different.

Before changing the class and before going onto more bespoke issues, we should deal with some of the common issues in each class. Some common issues are:

  • Commas and other punctuation, such as denomination signs, in columns that should be numeric.
  • Inconsistent formatting of dates, such as “December” and “Dec” and “12”.
  • Unexpected character encoding, especially in Unicode, which may not display consistently. By way of background, character encoding is needed for computers, which are based on strings of 0s and 1s, to be able to consider symbols such as alphabets. One source of particularly annoying data cleaning issues is different character encoding, particularly when dealing with foreign languages or odd characters. In general, we use UTF-8. The encoding of a character vector can be found using Encoding().

Typically, we want to fix anything immediately obvious. For instance, we should remove commas that have been used to group digits in currencies. However, the situation will typically quickly become dire and feel overwhelming. What we need to do is to look at the unique values in each variable, and then triage what we will fix. We make the decision of how to triage based on what is likely to have the largest impact. That usually means starting with counts of the observations, sorting them in descending order, and then dealing with each as they come.

When the tests of membership are passed, then finally we can change the class, and run all the tests again. We have adapted this idea from the software development approach of unit testing. Tests are crucial because they enable us to understand whether software (or in this case data) is fit for purpose (Wilson 2021).

Let us run through an example with a collection of strings, some of which are slightly wrong. This type of output is typical of OCR, which often gets most of the way there, but not quite.

messy_string <- paste(
  c("Patricia, Ptricia, PatricIa, Patncia, PatricIa"),
  c("Patricia, Patricia, Patric1a, Patricia , 8atricia"),
  sep = ", "
)

As before, we first want to get this into a rectangular dataset.

messy_dataset <-
  tibble(names = messy_string) |>
  separate_rows(names, sep = ", ")

messy_dataset
# A tibble: 10 × 1
   names      
   <chr>      
 1 "Patricia" 
 2 "Ptricia"  
 3 "PatricIa" 
 4 "Patncia"  
 5 "PatricIa" 
 6 "Patricia" 
 7 "Patricia" 
 8 "Patric1a" 
 9 "Patricia "
10 "8atricia" 

We now need decide which of these errors we are going to fix. To help us decide which are most important, we could create a count.

messy_dataset |>
  count(names, sort = TRUE)
# A tibble: 7 × 2
  names           n
  <chr>       <int>
1 "Patricia"      3
2 "PatricIa"      2
3 "8atricia"      1
4 "Patncia"       1
5 "Patric1a"      1
6 "Patricia "     1
7 "Ptricia"       1

The most common unique observation is the correct one. The next one—“PatricIa”—looks like the “i” has been incorrectly capitalized, and the one after that—“8atricia”—is distinguished by an “8” instead of a “P”. We can fix these issues with str_replace_all() and then redo the count.

messy_dataset_fix_I_8 <-
  messy_dataset |>
  mutate(
    names = str_replace_all(names, "PatricIa", "Patricia"),
    names = str_replace_all(names, "8atricia", "Patricia")
  )

messy_dataset_fix_I_8 |>
  count(names, sort = TRUE)
# A tibble: 5 × 2
  names           n
  <chr>       <int>
1 "Patricia"      6
2 "Patncia"       1
3 "Patric1a"      1
4 "Patricia "     1
5 "Ptricia"       1

Already this is much better and 60 per cent of the values are correct, compared with earlier where it was 30 per cent. There are two more obvious errors—“Ptricia” and “Patncia”—with the first missing an “a” and the second having an “n” where the “ri” should be. Again, we can fix those.

messy_dataset_fix_a_n <-
  messy_dataset_fix_I_8 |>
  mutate(
    names = str_replace_all(names, "Ptricia", "Patricia"),
    names = str_replace_all(names, "Patncia", "Patricia")
  )

messy_dataset_fix_a_n |>
  count(names, sort = TRUE)
# A tibble: 3 × 2
  names           n
  <chr>       <int>
1 "Patricia"      8
2 "Patric1a"      1
3 "Patricia "     1

We have achieved an 80 per cent outcome with not too much effort. The two remaining issues are more subtle. The first has occurred because the “i” has been incorrectly coded as a “1”. In some fonts this will show up, but in others it will be more difficult to see. This is a common issue, especially with OCR, and something to be aware of. The second is similarly subtle and occurs because of a trailing space. Trailing and leading spaces are another common issue and we can address them with str_trim(). After we fix these two remaining issues then we will have all entries corrected.

cleaned_data <-
  messy_dataset_fix_a_n |>
  mutate(
    names = str_replace_all(names, "Patric1a", "Patricia"),
    names = str_trim(names, side = c("right"))
  )

cleaned_data |>
  count(names, sort = TRUE)
# A tibble: 1 × 2
  names        n
  <chr>    <int>
1 Patricia    10

We have been doing the tests in our head in this example. We know that we are hoping for “Patricia”. But we can start to document this test as well. One way is to look to see if values other than “Patricia” exist in the dataset.

check_me <-
  cleaned_data |>
  filter(names != "Patricia")

if (nrow(check_me) > 0) {
  print("Still have values that are not Patricia!")
}

We can make things a little more imposing by stopping our code execution if the condition is not met with stopifnot(). To use that we define a condition that we would like met. We could implement this type of check throughout our code. For instance, if we expected there to be a certain number of rows in the dataset, or for a certain column to have various properties, such as being an integer, or a factor.

stopifnot(nrow(check_me) == 0)

We can use stopifnot() to ensure that our script is working as expected as it goes through.

Another way to write tests for our dataset is to use testthat (Wickham 2011). Although developed for testing packages, we can use the same functionality to test our datasets. For instance, we can use expect_length() to check the length of a dataset and expect_equal() to check the content.

library(testthat)

expect_length(check_me, 1)
expect_equal(class(cleaned_data$names), "character")
expect_equal(unique(cleaned_data$names), "Patricia")

If the tests pass then nothing happens, but if the tests fail then the script will stop.

What do we test? It is a difficult problem, and we detail a range of more-specific tests in the next section. But broadly we test what we have, against what we expect. The engineers working on the software for the Apollo program in the 1960s initially considered writing tests to be “busy work” (Mindell 2008, 170). But they eventually came to realize that NASA would not have faith that software could be used to send men to the moon unless it was accompanied by a comprehensive suite of tests. And it is the same for data science.

Start with tests for validity. These will typically be aspects such as the class of the variables, and then their unique values. For instance, if we were using a recent dataset then columns that are years could be tested to ensure that all elements have four digits and start with a “2”. Baumgartner (2021) describes this as tests on the schema. After that turn to checks of internal consistency. For instance, if there is a column for different responses, then check that the sum of those is equal to the total column. Finally, turn to tests for external consistency. Here we want to use outside information to inform our tests. For instance, if we had a column of, say, NMR for Germany (this concept was introduced in Chapter 2), then we could look at the estimates from, say, the WHO, and ensure our NMR column is consist with these. Experienced analysts do this all in their head. This is great, but the issue is that it does not scale, can be faulty and inconsistent, and overloads reputation. We return to this issue in Chapter 14 in the context of modelling.

We write tests throughout our code, rather than only right at the end. In particular using stopifnot() statements on as many intermediate steps as possible ensures that the dataset is being cleaned in a way that we expect. For instance, when merging two datasets we could check that the column names in the datasets are unique, apart from the column/s to be used as the key. Or that the number of observations of each type is being carried through appropriately. And even that the dimensions of the dataset are not being unexpectedly changed.

11.2.5 Iterate, generalize, and update

We could now iterate the plan. In this most recent case, we started with 10 entries. There is no reason that we could not increase this to 100 or even 1,000. We may need to generalize the cleaning procedures and tests. But eventually we would start to being the dataset into some sort of order.

11.3 Checking and testing

Robert Caro, the biographer of Lyndon Johnson introduced in Chapter 5, spent years tracking down everyone connected to the 36th President of the United States. Caro and his wife went so far as to live in Texas Hill Country for three years so that they could better understand where Johnson was from. When Caro heard that Johnson, as a senator, would run to the Senate from where he stayed in D.C., he ran that route multiple times himself to try to understand why Johnson was running. Caro eventually understood it only when he ran the route as the sun was rising—just as Johnson had done; it turns out that the sun hits the Senate Rotunda in a particularly inspiring way (Caro 2019, 156). This background work enabled him to uncover aspects that no one else knew. For instance, Johnson almost surely stole his first election win (Caro 2019, 116). We need to understand our data to this same extent. We must turn every page, and go to every extreme.

The idea of negative space is well established in design. It refers to that which surrounds the subject. Sometimes negative space is used as an effect, for instance the logo of FedEx, an American logistics company, has negative space between the E and x that creates an arrow. In a similar way, we want to be cognizant of the data that we have, and the data that we do not have (Hodgetts 2022). We are worried that the data that we do not have somehow has meaning, potentially even to the extent of changing our conclusions. When we are cleaning data, we are looking for anomalies. We are interested in values that are in there that should not be, but also the opposite situation—values that are missing that should not be. There are three tools that we use to identify these situations: graphs, counts, and tests.

We use these tools to ensure that we are not changing good data to bad. Especially when our cleaning and preparation requires many steps, it may be that fixes at one stage are un-done later. We use graphs, counts, and especially tests, to prevent this. The importance of these grows exponentially with dataset size, with small and medium datasets being amenable to manual inspection and other aspects that rely on the analyst, while larger datasets especially requiring tests (Hand 2018).

11.3.1 Graphs

Graphs are an invaluable tool when cleaning data, because they show each observation in the dataset, in relation to the other observations. They are especially useful for identifying when a value does not belong. For instance, if a value is expected to be numerical, but it is still a character then it will not plot and a warning will be displayed. Graphs will be especially useful for numerical data, but are still useful for text and categorical data. Let us pretend that we have a situation where we are interested in a person’s age, for some youth survey. We have the following data:

youth_survey_data <-
  tibble(ages = c(
    15.9,
    14.9,
    16.6,
    15.8,
    16.7,
    17.9,
    12.6,
    11.5,
    16.2,
    19.5,
    150
  ))

youth_survey_data |>
  ggplot(aes(y = ages, x = 0)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Position",
    y = "Simulated ages"
  )

Figure 11.2: The ages in the simulated youth survey dataset clearly identify a likely data issue

Figure 11.2 shows the unexpected value of 150. The most likely explanation is that the data were incorrectly entered, missing the decimal place, and should be 15.0. We could fix that, document it, and then redo the graph, which would show that everything seemed more valid.

11.3.2 Counts

We want to focus on getting most of the data right. So, we are interested in the counts of unique values. Hopefully most of the data are concentrated in the most common counts. But it can also be useful to invert it and see what is especially uncommon. The extent to which we want to deal with these depends on what we need. Ultimately, each time we fix one we are getting very few additional observations, potentially even just one. Counts are especially useful with text or categorical data but can be helpful with numerical data as well.

Let us see an example of text data, each of which is meant to be “Australia”.

australian_names_data <-
  tibble(
    country = c(
      "Australie",
      "Austrelia",
      "Australie",
      "Australie",
      "Aeustralia",
      "Austraia",
      "Australia",
      "Australia",
      "Australia",
      "Australia"
    )
  )

australian_names_data |>
  count(country, sort = TRUE)
# A tibble: 5 × 2
  country        n
  <chr>      <int>
1 Australia      4
2 Australie      3
3 Aeustralia     1
4 Austraia       1
5 Austrelia      1

The use of this count clearly identifies where we should spend our time: changing “Australie” to “Australia” would almost double the amount of usable data.

11.3.3 Tests

As we said in Chapter 4, if you write code, then you are a programmer, but there is a difference between a student messing around for fun, and an astronomer writing the code that runs the James Webb Telescope. Following Weinberg (1971, 122), we distinguish between amateurs and professionals based on the end-user. When we first start out coding, we typically write code that only we write, read, and rely on. For instance, we may write some code for a paper that we submit for a class. After we get a grade, then in most cases, everything can be forgotten about. In contrast, a professional writes code for, and often with, other people. For instance, much of the academic research in the social sciences relies on code. If that research is to contribute to lasting knowledge, then partly that means the code that underpins it is being written for others and must work for others well after the researcher has moved onto other projects. A professional places appropriate care on tasks that ensure code can be considered by others. A large part of that is tests. Code without tests can only ever be an amateurish affair.

Some things are so important that we require that the cleaned dataset have them. These are conditions that we should check. They would typically come out of experience, expert knowledge, or the planning and simulation stages. For instance, there should be no negative numbers in an age variable, and no ages above 140. For these we could specifically require that the condition is met. Another example is when doing cross-country analysis, then a list of country names that we know should be in our dataset would be useful. Our test would then be that there were: 1) values not in that list that were in our dataset, or, vice versa; 2) countries that we expected to be in there that were not.

To have a concrete example, let us consider if we were doing some analysis about the five largest counties in Kenya. From talking with experts, we find these are: “Nairobi”, “Kiambu”, “Nakuru”, “Kakamega”, and “Bungoma”. Let us create that variable first.

correct_kenya_counties <-
  c(
    "Nairobi",
    "Kiambu",
    "Nakuru",
    "Kakamega",
    "Bungoma"
  )

We have the following dataset, which contains errors.

top_five_kenya <-
  tibble(county = c(
    "Nairobi",
    "Nairob1",
    "Nakuru",
    "Kakamega",
    "Nakuru",
    "Kiambu",
    "Kiambru",
    "Kabamega",
    "Bun8oma",
    "Bungoma"
  ))

top_five_kenya |>
  count(county, sort = TRUE)
# A tibble: 9 × 2
  county       n
  <chr>    <int>
1 Nakuru       2
2 Bun8oma      1
3 Bungoma      1
4 Kabamega     1
5 Kakamega     1
6 Kiambru      1
7 Kiambu       1
8 Nairob1      1
9 Nairobi      1

Based on the count we know that we have to fix some of them. There are two with numbers in the names.

top_five_kenya_fixed_1_8 <-
  top_five_kenya |>
  mutate(
    county = str_replace_all(county, "Nairob1", "Nairobi"),
    county = str_replace_all(county, "Bun8oma", "Bungoma")
  )

top_five_kenya_fixed_1_8 |>
  count(county, sort = TRUE)
# A tibble: 7 × 2
  county       n
  <chr>    <int>
1 Bungoma      2
2 Nairobi      2
3 Nakuru       2
4 Kabamega     1
5 Kakamega     1
6 Kiambru      1
7 Kiambu       1

At this point we can compare this with our known correct variable. We check both ways i.e. is there anything in the correct variable not in our dataset, and is there anything in the dataset not in our correct variable. We use our check conditions to decide whether we are finished or not.

top_five_kenya_fixed_1_8$county |> unique()
[1] "Nairobi"  "Nakuru"   "Kakamega" "Kiambu"   "Kiambru"  "Kabamega" "Bungoma" 
if (all(top_five_kenya_fixed_1_8$county |>
  unique() %in% correct_kenya_counties)) {
  "The cleaned counties match the expected countries"
} else {
  "Not all of the counties have been cleaned completely"
}
[1] "Not all of the counties have been cleaned completely"
if (all(correct_kenya_counties %in% top_five_kenya_fixed_1_8$county |>
  unique())) {
  "The expected countries are in the cleaned counties"
} else {
  "Not all the expected countries are in the cleaned counties"
}
[1] "The expected countries are in the cleaned counties"

And so it is clear that we still have cleaning to do because not all the counties match what we were expecting.

We will talk about explicit tests for class and dates, given their outsized importance, and how common it is for them to go wrong. But other aspects to explicitly consider testing include:

  • Variables of monetary values should be tested for reasonable bounds given the situation. In some cases negative values will not be possible. Sometimes a top bound can be identified. Monetary variables should be numeric. They should not have commas or other separators. They should not contain symbols such as a currency signs or semicolons.
  • Variables of population values should likely not be negative. Populations of cities should likely be somewhere between 100,000 and 50,000,000. They again should be numeric, and contain only numbers, no symbols.
  • Names should be character variables. They likely do not contain numbers. They may contain some limited set of symbols and this would be context specific.

More generally, work with experts and draw on prior knowledge to work out some reasonable features for the variables of interest and then implement these.

We can use validate (Loo and Jonge 2021) to set-up a series of tests. For instance, here we will simulate some data with clear issues.

library(tidyverse)

set.seed(853)

dataset_with_issues <-
  tibble(
    age = c(
      runif(n = 9, min = 0, max = 100) |> round(),
      1000
    ),
    gender = c(
      sample(
        x = c("female", "male", "other", "prefer not to disclose"),
        size = 9,
        replace = TRUE,
        prob = c(0.4, 0.4, 0.1, 0.1)
      ),
      "elephant"
    ),
    income = rexp(n = 10, rate = 0.10) |> round() |> as.character()
  )

dataset_with_issues
# A tibble: 10 × 3
     age gender                 income
   <dbl> <chr>                  <chr> 
 1    36 female                 20    
 2    12 prefer not to disclose 16    
 3    48 male                   0     
 4    32 female                 2     
 5     4 female                 1     
 6    40 female                 13    
 7    13 female                 13    
 8    24 female                 7     
 9    16 male                   3     
10  1000 elephant               2     

In this case, there is an impossible age, one observation in the gender variable that should not be there, and finally, income is a character variable instead of a numeric. We use validator() to establish rules we expect the data to satisfy and confront() to determine whether it does.

library(validate)

rules <- validator(
  is.numeric(age),
  is.character(gender),
  is.numeric(income),
  age < 120,
  gender %in% c("female", "male", "other", "prefer not to disclose")
)

out <-
  confront(dataset_with_issues, rules)

summary(out)
  name items passes fails nNA error warning
1   V1     1      1     0   0 FALSE   FALSE
2   V2     1      1     0   0 FALSE   FALSE
3   V3     1      0     1   0 FALSE   FALSE
4   V4    10      9     1   0 FALSE   FALSE
5   V5    10      9     1   0 FALSE   FALSE
                                                           expression
1                                                     is.numeric(age)
2                                                is.character(gender)
3                                                  is.numeric(income)
4                                                           age < 120
5 gender %vin% c("female", "male", "other", "prefer not to disclose")

In this case, we can see that there are issues with the final three rules that we established. More generally, van der Loo (2022) provides many example tests that can be used.

11.3.4 Class

It is sometimes said that Americans are obsessed with money, while the English are obsessed with class. In the case of data cleaning and preparation we need to be English. Class is critical and worthy of special attention. We introduced class in Chapter 3 and here we focus on “numeric”, “character”, and “factor”. Explicit checks of the class of variables are essential. Accidentally assigning the wrong class to a variable can have a large effect on subsequent analysis. In particular:

  • check whether some value should be a number or a factor; and
  • check that values are numbers not characters.

To understand why it is important to be clear about whether a value is a number or a factor, consider the following situation:

simulated_class_data <-
  tibble(
    response = c(1, 1, 0, 1, 0, 1, 1, 0, 0),
    group = c(1, 2, 1, 1, 2, 3, 1, 2, 3)
  ) |>
  mutate(
    group_as_integer = as.integer(group),
    group_as_factor = as.factor(group),
  )

Let us start with “group” as an integer and use logistic regression, which we cover in more detail in Chapter 14. We can then try it as a factor. Table 11.1 clearly shows how different the results are and highlights the absolute necessity of getting the class of variables used in regression right. The interpretation of the variable is completely different.

library(modelsummary)

models <- list(
  "Group as integer" = lm(
    response ~ group_as_integer,
    data = simulated_class_data
  ),
  "Group as factor" = lm(
    response ~ group_as_factor,
    data = simulated_class_data
  )
)

modelsummary(models)
Table 11.1: Examining the effect of class on regression results
Group as integer Group as factor
(Intercept) 0.840 0.750
(0.450) (0.283)
group_as_integer −0.160
(0.231)
group_as_factor2 −0.417
(0.432)
group_as_factor3 −0.250
(0.489)
Num.Obs. 9 9
R2 0.064 0.138
R2 Adj. −0.070 −0.150
AIC 18.4 19.6
BIC 18.9 20.4
Log.Lik. −6.179 −5.811
F 0.479 0.478
RMSE 0.48 0.46

Class is so important, subtle, and can have such a pernicious effect on analysis, that it is prima facie difficult to believe any analysis that does not have a suite of tests that check class. This is especially the case just before modelling, but it is worthwhile setting this up as part of data cleaning and preparation. One reason that Jane Street, the US proprietary trading firm, use a particular programming language, OCaml, is that its type system make it more reliable (Somers 2015). When code matters, class is of vital concern.

11.3.5 Dates

A shibboleth for whether someone has worked with dates before is their reaction when you tell them you are going to be working with dates. If they immediately involuntarily shudder and share a horror story, then they have worked with dates before, and any other reaction means they have not.

Extensive checking of dates is critical. Ideally, we would like dates to be in the following format: YYYY-MM-DD. There are differences of opinion as to what is an appropriate date format in the broader world, and reasonable people can differ on whether 1 July 2022 or July 1, 2022, is better, but YYYY-MM-DD is the only acceptable format for datasets.

A few tests that could be useful include:

  • If a column is days of the week, then test that the only components are Monday, Tuesday, … Sunday. Further, test that all seven days are present. Similarly, for month.
  • Test that the numbers of days are appropriate for each month, for instance, check that September has 30 days, etc.
  • Check whether the dates are in order in the dataset.
  • Check that the years are complete and appropriate to the analysis period.

In Chapter 2 we introduced a dataset of homeless shelter usage in Toronto in 2021 using opendatatoronto (Gelfand 2020). Here we use that same dataset, but for 2017 to 2020, to illustrate one process for checking dates. We first need to download the data.1

library(opendatatoronto)
library(tidyverse)

earlier_toronto_shelters <-
  search_packages("Daily Shelter Occupancy") |>
  list_package_resources() |>
  filter(
    name %in% c(
      "Daily shelter occupancy 2017.csv",
      "Daily shelter occupancy 2018.csv",
      "Daily shelter occupancy 2019.csv",
      "Daily shelter occupancy 2020.csv"
    )
  ) |>
  group_split(name) |>
  map_dfr(get_resource, .id = "file")

write_csv(
  x = earlier_toronto_shelters,
  file = "earlier_toronto_shelters.csv"
)
# A tibble: 6 × 14
   file CAPACITY FACILI…¹ OCCUP…² OCCUP…³ ORGAN…⁴ PROGR…⁵ SECTOR SHELT…⁶ SHELT…⁷
  <dbl>    <dbl> <chr>      <dbl> <chr>   <chr>   <chr>   <chr>  <chr>   <chr>  
1     1       16 COSTI R…      16 2017-0… COSTI … COSTI … Co-ed  100 Li… Toronto
2     1       17 Christi…      13 2017-0… Christ… Christ… Men    973 La… Toronto
3     1       63 Christi…      63 2017-0… Christ… Christ… Men    973 La… Toronto
4     1       70 Christi…      66 2017-0… Christ… Christ… Famil… 43 Chr… Toronto
5     1       60 Birchmo…      58 2017-0… City o… Birchm… Men    1673 K… Toronto
6     1      160 Birkdal…     168 2017-0… City o… Birkda… Famil… 1229 E… Toronto
# … with 4 more variables: SHELTER_NAME <chr>, SHELTER_POSTAL_CODE <chr>,
#   SHELTER_PROVINCE <chr>, X_id <dbl>, and abbreviated variable names
#   ¹​FACILITY_NAME, ²​OCCUPANCY, ³​OCCUPANCY_DATE, ⁴​ORGANIZATION_NAME,
#   ⁵​PROGRAM_NAME, ⁶​SHELTER_ADDRESS, ⁷​SHELTER_CITY
# ℹ Use `colnames()` to see all variable names

We need to make the names easier to type, specify the year based on the file, and only keep relevant columns.

library(janitor)

earlier_toronto_shelters_clean_years <-
  earlier_toronto_shelters |>
  clean_names() |>
  mutate(
    file =
      case_when(
        file == "1" ~ 2017,
        file == "2" ~ 2018,
        file == "3" ~ 2019,
        file == "4" ~ 2020,
        TRUE ~ -1
      )
  ) |>
  select(
    -x_id,
    -facility_name,
    -organization_name,
    -program_name,
    -shelter_address,
    -shelter_city
  )

The main issue with this dataset will be the dates. For 2017-2019 (inclusive) we will find appear to be year-month-day, but for 2020 they seem to be month-day-year. The separator is also inconsistent, changing from “-” to “/”. We first fix that, check our guesses, and then get to a more pernicious and subtle issue. When working with dates, we draw heavily on lubridate (Grolemund and Wickham 2011).

library(lubridate)

earlier_toronto_shelters_test_date <-
  earlier_toronto_shelters_clean_years |>
  mutate(
    occupancy_date =
      str_remove(
        # remove times
        occupancy_date,
        "T[:digit:]{2}:[:digit:]{2}:[:digit:]{2}"
      ),
    occupancy_date =
      str_replace_all(
        # make separation consistent
        occupancy_date,
        "/",
        "-"
      )
  ) |>
  # Parsing differs between 2017-2019 and 2020.
  mutate(date = case_when(
    file == 2020 ~ mdy(occupancy_date, quiet = TRUE),
    file %in% c(2017, 2018, 2019) ~ ymd(occupancy_date, quiet = TRUE),
    TRUE ~ NA_Date_
  )) |>
  rename(year_from_file = file) |>
  select(
    year_from_file,
    date,
    occupancy_date,
    sector,
    occupancy,
    capacity
  )

earlier_toronto_shelters_test_date
# A tibble: 156,977 × 6
   year_from_file date       occupancy_date sector   occupancy capacity
            <dbl> <date>     <chr>          <chr>        <dbl>    <dbl>
 1           2017 2017-01-01 2017-01-01     Co-ed           16       16
 2           2017 2017-01-01 2017-01-01     Men             13       17
 3           2017 2017-01-01 2017-01-01     Men             63       63
 4           2017 2017-01-01 2017-01-01     Families        66       70
 5           2017 2017-01-01 2017-01-01     Men             58       60
 6           2017 2017-01-01 2017-01-01     Families       168      160
 7           2017 2017-01-01 2017-01-01     Families       119      150
 8           2017 2017-01-01 2017-01-01     Men             23       28
 9           2017 2017-01-01 2017-01-01     Families         8        0
10           2017 2017-01-01 2017-01-01     Co-ed           14       40
# … with 156,967 more rows
# ℹ Use `print(n = ...)` to see more rows

We can check whether the guess of the date orderings was at least plausible by looking at the distribution of year (Table 11.2), month (Figure 11.3), and day components (Figure 11.4). We are interested in differences between the three years 2017-2019 and 2020.

earlier_toronto_shelters_test_date <-
  earlier_toronto_shelters_test_date |>
  separate(
    occupancy_date,
    into = c("one", "two", "three"),
    sep = "-",
    remove = FALSE
  )
earlier_toronto_shelters_test_date |>
  count(one) |>
  rename(`First component` = one, Number = n) |>
  knitr::kable(
    col.names = c("First component", "Number"),
    digits = 2,
    booktabs = TRUE,
    linesep = "",
    format.args = list(big.mark = ",")
  )
Table 11.2: Counts, by the first component of occupancy date, for 2017-2020
First component Number
01 3,503
02 3,277
03 3,555
04 3,562
05 3,671
06 3,534
07 3,601
08 3,355
09 3,210
10 3,317
11 3,195
12 3,281
2017 38,700
2018 37,770
2019 39,446
earlier_toronto_shelters_test_date |>
  count(year_from_file, two) |>
  ggplot(aes(x = two, y = n)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Second component of occupancy date",
    y = "Number"
  ) +
  facet_wrap(
    vars(year_from_file),
    scales = "free",
    nrow = 2,
    ncol = 2
  )

Figure 11.3: Counts, by the second component of occupancy date, for 2017-2020

earlier_toronto_shelters_test_date |>
  count(year_from_file, three) |>
  ggplot(aes(x = three, y = n)) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Third component of occupancy date",
    y = "Number"
  ) +
  facet_wrap(
    vars(year_from_file),
    scales = "free",
    nrow = 2,
    ncol = 2
  )

Figure 11.4: Counts, by third component of occupancy date, of entries for 2017-2020

It is increasingly clear that our guess that the order was swapped around for 2020 seems right. We would be especially concerned if the distribution of the days was not roughly uniform, or if we had values other than [1-12] in the month.

One graph that is especially useful when cleaning a dataset is the order the observations appear in the dataset. For instance, we would generally expect that there would be a rough ordering in terms of date. To examine whether this is the case, we can graph the date variable in the order it appears in the dataset (Figure 11.5).

earlier_toronto_shelters |>
  mutate(row_number = c(seq_len(nrow(earlier_toronto_shelters)))) |>
  ggplot(aes(x = row_number, y = date), alpha = 0.1) +
  geom_point() +
  theme_minimal() +
  labs(
    x = "Row number",
    y = "Date"
  )

Figure 11.5: Comparison of row number with date

While this is just a quick graph it illustrates the point—the data are not in order of date in the dataset. If they were in order, then we would expect them to be along the diagonal. It is odd that the data are not in order, especially as there appears to be something systematic initially. We can summarize the data to get a count of occupancy by day.

library(tidyr)

# Based on code by Lisa Lendway
toronto_shelters_by_day <-
  earlier_toronto_shelters |>
  drop_na(occupancy, capacity) |>
  group_by(date) |>
  summarize(
    occupancy = sum(occupancy),
    capacity = sum(capacity),
    usage = occupancy / capacity,
    .groups = "drop"
  )

We are interested in the availability of shelter spots in Toronto for each day (Figure 11.6 (a)). And we can focus on 2017, as that is where the biggest issue is and facet by month (Figure 11.6 (b)).

toronto_shelters_by_day |>
  ggplot(aes(x = date, y = occupancy)) +
  geom_point(alpha = 0.3) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(
    color = "Type",
    x = "Date",
    y = "Occupancy (number)"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

toronto_shelters_by_day |>
  filter(year(date) == 2017) |>
  ggplot(aes(x = day(date), y = occupancy)) +
  geom_point(alpha = 0.3) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(
    color = "Type",
    x = "Day",
    y = "Occupancy (number)"
  ) +
  facet_wrap(
    vars(month(date, label = TRUE)),
    scales = "free_x"
  ) +
  theme_minimal() +
  scale_color_brewer(palette = "Set1")

(a) In 2017-2020

(b) Only in 2017

Figure 11.6: Occupancy per day in Toronto shelters

It is clear there is an issue with the first twelve days of the month. We noted that when we look at the data it is a bit odd that it is not in order. We can take another look at that by going back to the data as it was given to us (as opposed to the counts by day that we have been using). Although there is no rule that says the dataset must be in order of the date, if it were, then all the points would lie on the diagonal line. We have a lot of deviation from that. To get a sense of what we are expecting let us look at all four years (Figure 11.7).

earlier_toronto_shelters |>
  mutate(counter = seq_len(nrow(earlier_toronto_shelters))) |>
  ggplot(aes(x = counter, y = date)) +
  geom_point(alpha = 0.3) +
  facet_wrap(
    vars(year(date)),
    scales = "free"
  ) +
  labs(
    x = "Row in the dataset",
    y = "Date of that row"
  ) +
  theme_minimal()

Figure 11.7: Date of each row in order (2017-2020)

It looks like 2020 is roughly as we would expect, although with some potential weirdness at the start. 2019 has a few odd situations, but not too many. 2018 has a small cluster early in the dataset and then possibly something systematic toward the middle. But 2017 has a large number of systematic issues. These affect a lot of observations, so we prioritize and focus our attention here.

In general, it seems that it might be the case that in 2017 the first 12 days are the wrong way around, i.e. we think it is year-month-day, but it is actually year-day-month, but there are exceptions. As a first pass, we can flip those first 12 days of each month and see if that helps. It will be fairly blunt, but hopefully gets us somewhere.

list_of_dates_to_flip <- c(
"2017-02-01", "2017-03-01", "2017-04-01", "2017-05-01",
"2017-06-01", "2017-07-01", "2017-08-01", "2017-09-01",
"2017-10-01", "2017-11-01", "2017-12-01", "2017-01-02",
"2017-03-02", "2017-04-02", "2017-05-02", "2017-06-02",
"2017-07-02", "2017-08-02", "2017-09-02", "2017-10-02",
"2017-11-02", "2017-12-02", "2017-01-03", "2017-02-03",
"2017-04-03", "2017-05-03", "2017-06-03", "2017-07-03",
"2017-08-03", "2017-09-03", "2017-10-03", "2017-11-03",
"2017-12-03", "2017-01-04", "2017-02-04", "2017-03-04",
"2017-05-04", "2017-06-04", "2017-07-04", "2017-08-04",
"2017-09-04", "2017-10-04", "2017-11-04", "2017-12-04",
"2017-01-05", "2017-02-05", "2017-03-05", "2017-04-05",
"2017-06-05", "2017-07-05", "2017-08-05", "2017-09-05",
"2017-10-05", "2017-11-05", "2017-12-05", "2017-01-06",
"2017-02-06", "2017-03-06", "2017-04-06", "2017-05-06",
"2017-07-06", "2017-08-06", "2017-09-06", "2017-10-06",
"2017-11-06", "2017-12-06", "2017-01-07", "2017-02-07",
"2017-03-07", "2017-04-07", "2017-05-07", "2017-06-07",
"2017-08-07", "2017-09-07", "2017-10-07", "2017-11-07",
"2017-12-07", "2017-01-08", "2017-02-08", "2017-03-08",
"2017-04-08", "2017-05-08", "2017-06-08", "2017-07-08",
"2017-09-08", "2017-10-08", "2017-11-08", "2017-12-08",
"2017-01-09", "2017-02-09", "2017-03-09", "2017-04-09",
"2017-05-09", "2017-06-09", "2017-07-09", "2017-08-09",
"2017-10-09", "2017-11-09", "2017-12-09", "2017-01-10",
"2017-02-10", "2017-03-10", "2017-04-10", "2017-05-10",
"2017-06-10", "2017-07-10", "2017-08-10", "2017-09-10",
"2017-11-10", "2017-12-10", "2017-01-11", "2017-02-11",
"2017-03-11", "2017-04-11", "2017-05-11", "2017-06-11",
"2017-07-11", "2017-08-11", "2017-09-11", "2017-10-11",
"2017-12-11", "2017-01-12", "2017-02-12", "2017-03-12",
"2017-04-12", "2017-05-12", "2017-06-12", "2017-07-12",
"2017-08-12", "2017-09-12", "2017-10-12", "2017-11-12"
)

earlier_toronto_shelters_2017_flipped <-
  earlier_toronto_shelters |>
  mutate(
    year = year(date),
    month = month(date),
    day = day(date),
    date = as.character(date),
    changed_date = if_else(
      date %in% list_of_dates_to_flip,
      paste(year, day, month, sep = "-"),
      paste(year, month, day, sep = "-"),
    ),
    changed_date = ymd(changed_date)
  ) |>
  select(-year, -month, -day)

Now let us take a look (Figure 11.8). It is probably a little blunt. For instance, notice there are now no entries below the diagonal (Figure 11.8 (a)). But we can see that has almost entirely taken care of the systematic differences (Figure 11.8 (b)).

library(tidyr)

earlier_toronto_shelters_2017_flipped |>
  mutate(counter = seq_len(nrow(earlier_toronto_shelters_2017_flipped))) |>
  filter(year(date) == 2017) |>
  ggplot(aes(x = counter, y = changed_date)) +
  geom_point(alpha = 0.3) +
  labs(
    x = "Row in the dataset",
    y = "Date of that row"
  ) +
  theme_minimal()

earlier_toronto_shelters_2017_flipped |>
  drop_na(occupancy, capacity) |>
  group_by(changed_date) |>
  summarise(occupancy = sum(occupancy), .groups = "drop") |>
  filter(year(changed_date) == 2017) |>
  ggplot(aes(x = day(changed_date), y = occupancy)) +
  geom_point(alpha = 0.3) +
  scale_y_continuous(limits = c(0, NA)) +
  labs(
    color = "Type",
    x = "Changed day",
    y = "Occupancy (number)"
  ) +
  facet_wrap(
    vars(month(changed_date, label = TRUE)),
    scales = "free_x"
  ) +
  theme_minimal()

(a) Date of each row in order in 2017 after adjustment

(b) Toronto shelters daily occupancy in 2017 after adjustment

Figure 11.8: Adjusted dates, occupancy in Toronto shelters

11.4 Names

An improved scanning software we developed identified gene name errors in 30.9% (3,436/11,117) of articles with supplementary Excel gene lists; a figure significantly higher than previously estimated. This is due to gene names being converted not just to dates and floating-point numbers, but also to internal date format (five-digit numbers).

Abeysooriya et al. (2021)

Names matter. The land on which much of this book was written is today named Canada, but for a long time was known as Turtle Island. Similarly, there is a big rock in the center of Australia. For a long time, it was called Uluru, then it was known as Ayers Rock. Today it has a dual name that combines both. And in parts of the US South, including signage surrounding the South Carolina State House, the US Civil War is referred to as the War of Northern Aggression. In these examples, the name that is used conveys information, not only about the user, but about the circumstances. Even the British Royal Family recognize the power of names. In 1917 they changed from the House of Saxe-Coburg and Gotha to the House of Windsor. It was felt that the former was too Germanic given World War I was ongoing. Names matter in everyday life. And they matter in our code too.

Names impart meaning (Kimmerer 2013, 34). By ignoring existing names, we ignore what has come before us. The importance of names, and of how existing relationships were ignored by re-naming, was clear in those cases mentioned above, but we see it in data science as well. We need to be careful when we name our datasets, variables, and functions, even our statistical methods! Tukey (1962) essentially defined what we today call data science, but it was popularized by folks in computer science in the late 2000s who ignored, either deliberately or through ignorance, what came before them. With that came the renaming of concepts that were well-established in the fields that computer science had expanded into. For instance, the use of binary variables in regression, sometimes called “dummy variables”, is often called one-hot encoding in computer science. The social sciences went through a similar experience, starting in the 1980s, with economics expanding in a similarly (and self-described) imperialistic way (Lazear 2000). The social sciences now recognize the cost of this, and data science should try to avoid it. No area of study is ever actually without existing claims. And recognizing, adopting, and using existing names and practices is important.

Names are critical and worthy of special attention because (Hermans 2021):

  1. they help document our code as they are, by definition, contained in the code;
  2. they make-up a large proportion of any script;
  3. they are referred to a lot by others; and
  4. they help the reader understand what is happening in the code.

In addition to respecting the nature of the data, names need to satisfy two additional considerations:

  1. they need to be machine readable, and
  2. they need to be human readable.

Machine readable names is an easier standard to meet, but usually means avoiding spaces and special characters. A space can be replaced with an underscore. For instance, we prefer “my_data” to “my data”. Avoiding spaces enables tab-completion which makes us more efficient. It also helps with reproducibility because spaces are considered differently by different operating systems.

Usually, special characters should be removed because they can be inconsistent between different computers and languages. This is especially the case with slash and backslash, asterisk, and single and double quotation marks, none of which should, almost, ever be used in names.

Names should also be unique within a dataset, and unique within a collection of datasets unless that particular column is being deliberately used as a key to join different datasets. This usually means that the domain is critical for effective names, and when working as part of a team this all gets much difficult (Hermans 2017). Names need to not only be unique, but notably different when there is a potential for confusion. For instance, for many years, the language PHP had both mysql_escape_string and mysql_real_escape_string (Somers 2015). It is easy to see how programmers may have accidentally written one when they meant the other.

An especially useful function to use to get closer to machine readable names is clean_names() from janitor (Firke 2020). This deals with those issues mentioned above as well as a few others.

library(janitor)

some_bad_names <-
  tibble(
    "First" = c(1),
    "second name has spaces" = c(1),
    "weird#symbol" = c(1),
    "InCoNsIsTaNtCaPs" = c(1)
  )

bad_names_made_better <-
  some_bad_names |>
  clean_names()

some_bad_names
# A tibble: 1 × 4
  First `second name has spaces` `weird#symbol` InCoNsIsTaNtCaPs
  <dbl>                    <dbl>          <dbl>            <dbl>
1     1                        1              1                1
bad_names_made_better
# A tibble: 1 × 4
  first second_name_has_spaces weird_number_symbol in_co_ns_is_ta_nt_ca_ps
  <dbl>                  <dbl>               <dbl>                   <dbl>
1     1                      1                   1                       1

“Programs must be written for people to read, and only incidentally for machines to execute” (Abelson and Sussman 1996). In the same way that we emphasized in Chapter 5 that we write papers for the reader, here we emphasize that we write code for the reader. Human readable names require an additional layer, and extensively more consideration than machine readable ones. For instance, following Lockheed Martin (2005, 25), we should avoid names that only differ by the use of the letter “O”, instead of the number “0” or the letter “D”. Similarly, “S” with “5”. We need to consider other cultures and how they may interpret some of the names that we are using. We also need to consider different experience levels that subsequent users of the dataset may have. This is both in terms of experience with data science, but also experience with similar datasets. For instance, a column called “flag” is often used to signal that a column contains data that needs to be followed up with or treated carefully in some way. An experienced analyst will know this, but a beginner will not. Try to use meaningful names wherever possible (Lin, Ali, and Wilson 2021). It has been found that shorter names may take longer to comprehend (Hofmeister, Siegmund, and Holt 2017), and so it is often useful to avoid uncommon abbreviations where possible.

Bryan (2015) additionally recommends that file names, in particular, should consider the default ordering that a file manager will impose. This might mean adding prefixes such as “00-”, “01-”, etc to filenames, which might involve left-padding with zeros depending on the number of files. Critically it means using ISO 8601 for dates, which means that 2 December 2022 would be written “2022-12-02”. The reason for using such file names is to provide information to other people about the order of the files.

One interesting feature of R is that in certain cases partial matching on names is possible. For instance:

never_use_partial_matching <-
  data.frame(
    my_first_name = c(1, 2),
    another_name = c("wow", "great")
  )

never_use_partial_matching$my_first_name
[1] 1 2
never_use_partial_matching$my
[1] 1 2

Correctly, this behavior is not possible within the tidyverse (for instance if data.frame were replaced with tibble in the above code). Partial matching should almost never be used. It makes it more difficult to understand code after a break, and for others to come to it fresh. This is especially the case with R, where it is not commonly used.

R, Python, and many of the other languages that are commonly used for data science are dynamically typed, as opposed to static typed. This means that class can be defined independently of declaring a variable. One interesting area of data science research is going partially toward static typed by including class in the name. In computer science, a variant of this is known as Hungarian notation and used at Microsoft for Word and Excel (Hermans 2021, 74). While Hungarian notation fell out of favor in computer science, in data science it may bring considerable benefits, and best practice is an area of open research. Indeed, Python enabled type hints in 2014 (Boykis 2019). While not required, this goes someway to being more explicit about types.

Riederer (2020) advises using column names as contracts, through establishing a controlled vocabulary for column names. In this way, we would define a set of words that we can use in column names. In the controlled vocabulary of Riederer (2020) a column could start with an abbreviation for its class, then something specific to what it pertains to, and then various details.

For instance, we could consider column names of “age” and “sex”. Following Riederer (2020) we may change these to be more informative of the class and other information.

some_names <-
  tibble(
    age = as.integer(c(1, 3, 35, 36)),
    sex = factor(c("male", "male", "female", "male"))
  )

riederer_names <-
  some_names |>
  rename(
    integer_age_respondent = age,
    factor_sex_respondent = sex
  )

some_names
# A tibble: 4 × 2
    age sex   
  <int> <fct> 
1     1 male  
2     3 male  
3    35 female
4    36 male  
riederer_names
# A tibble: 4 × 2
  integer_age_respondent factor_sex_respondent
                   <int> <fct>                
1                      1 male                 
2                      3 male                 
3                     35 female               
4                     36 male                 

11.5 1996 Tanzanian DHS

As introduced in Chapter 9, the Demographic and Health Surveys play an important role in gathering data in areas where we may not have other datasets. Here we will clean and prepare a DHS table about household populations in Tanzania in 1996.

We are interested in the distribution of age-groups, gender, and urban/rural. A quick sketch might look like Figure 11.9.

Figure 11.9: Quick sketch of a dataset that we might be interested in

We can then simulate a dataset.

library(tidyverse)

simulated_tanzania_dataset <-
  tibble(
    age_group = rep(x = c("0-4", "5-9")),
    urban_male = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    urban_female = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    urban_total = urban_male + urban_female,
    rural_male = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    rural_female = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    rural_total = rural_male + rural_female,
    total_male = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    total_female = rnorm(n = 2, mean = 10, sd = 1) |> round(digits = 0),
    total_total = total_male + total_female,
  )

simulated_tanzania_dataset <-
  rbind(
    simulated_tanzania_dataset,
    c("Total", 100, 100, 100, 100, 100, 100, 100, 100, 100, 100),
    c("Number", 5000, 5500, 10500, 4000, 4500, 8500, 9000, 10000, 19000)
  )
simulated_tanzania_dataset
# A tibble: 4 × 10
  age_group urban_male urban_f…¹ urban…² rural…³ rural…⁴ rural…⁵ total…⁶ total…⁷
  <chr>     <chr>      <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
1 0-4       10         11        21      10      11      21      8       8      
2 5-9       10         10        20      9       11      20      9       11     
3 Total     100        100       100     100     100     100     100     100    
4 Number    5000       5500      10500   4000    4500    8500    9000    10000  
# … with 1 more variable: total_total <chr>, and abbreviated variable names
#   ¹​urban_female, ²​urban_total, ³​rural_male, ⁴​rural_female, ⁵​rural_total,
#   ⁶​total_male, ⁷​total_female
# ℹ Use `colnames()` to see all variable names

Based on this simulation we are interested to test:

  1. Whether there are only numbers
  2. Whether the sum of urban and rural match the total column.
  3. Whether the sum of the age-groups match the total.

We begin by downloading the data.2

download.file(
  url = "https://dhsprogram.com/pubs/pdf/FR83/FR83.pdf",
  destfile = "1996_Tanzania_DHS.pdf",
  mode = "wb"
)

When we have a PDF and want to read the content into R, then pdf_text() from pdftools (Ooms 2019) is useful. It works well for many recently produced PDFs because the content is text which it can extract. But if the PDF is an image, then pdf_text() will not work. Instead, the PDF will first need to go through OCR, which was introduced in Chapter 9.

library(pdftools)

tanzania_dhs <-
  pdf_text(
    pdf = "1996_Tanzania_DHS.pdf"
  )

In this case we are interested in Table 2.1, which is on the 33rd page of the PDF. That page looks like (Figure 11.10).

Figure 11.10: The page of interest in the 1996 Tanzanian DHS

We use stri_split_lines() from stringi (Gagolewski 2020) to focus on that particular page.

library(stringi)

# Based on Bob Rudis: https://stackoverflow.com/a/47793617
tanzania_dhs_page_33 <- stri_split_lines(tanzania_dhs[[33]])[[1]]

We first want to remove all the written content and focus on the table. We then want to convert that into a tibble so that we can use our familiar tidyverse approaches.

library(tidyverse)

tanzania_dhs_page_33_only_data <- tanzania_dhs_page_33[31:55]

tanzania_dhs_raw <- tibble(all = tanzania_dhs_page_33_only_data)

tanzania_dhs_raw
# A tibble: 25 × 1
   all                                                                          
   <chr>                                                                        
 1 "                                  Urban                              Rural …
 2 ""                                                                           
 3 " Age group             Male      Female       Total          Male   Female …
 4 ""                                                                           
 5 ""                                                                           
 6 " 0-4                   16.4        13.8        15.1          18.1     17.1 …
 7 " 5-9                   13.5        13.0        13.2          17.5     16,0 …
 8 " 10-14                 12.6        13.1        12.8          15.3     13.5 …
 9 " 15-19                 10.8        11.3        11.1           9.8      8.8 …
10 " 20-~                   9.4        12.2        10,8           5.9      8.2 …
# … with 15 more rows
# ℹ Use `print(n = ...)` to see more rows

All the columns have been collapsed into one, so we need to separate them. We will do this based on the existence of a space, which means we first need to change “Age group” to “Age-group” because we do not want that separated.

# Separate columns
tanzania_dhs_separated <-
  tanzania_dhs_raw |>
  mutate(all = str_squish(all)) |>
  mutate(all = str_replace(all, "Age group", "Age-group")) |>
  separate(
    col = all,
    into = c(
      "age_group",
      "male_urban",
      "female_urban",
      "total_urban",
      "male_rural",
      "female_rural",
      "total_rural",
      "male_total",
      "female_total",
      "total_total"
    ),
    sep = " ",
    remove = TRUE,
    fill = "right",
    extra = "drop"
  )

tanzania_dhs_separated
# A tibble: 25 × 10
   age_group   male_ur…¹ femal…² total…³ male_…⁴ femal…⁵ total…⁶ male_…⁷ femal…⁸
   <chr>       <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 "Urban"     Rural     Total   <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 2 ""          <NA>      <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 3 "Age-group" Male      Female  Total   Male    Female  Total   Male    Female 
 4 ""          <NA>      <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 5 ""          <NA>      <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 6 "0-4"       16.4      13.8    15.1    18.1    17.1    17.6    17.8    16.4   
 7 "5-9"       13.5      13.0    13.2    17.5    16,0    16.7    16.7    15.4   
 8 "10-14"     12.6      13.1    12.8    15.3    13.5    14.4    14.8    13.4   
 9 "15-19"     10.8      11.3    11.1    9.8     8.8     9.3     10.0    9.3    
10 "20-~"      9.4       12.2    10,8    5.9     8.2     7.1     6.6     9.0    
# … with 15 more rows, 1 more variable: total_total <chr>, and abbreviated
#   variable names ¹​male_urban, ²​female_urban, ³​total_urban, ⁴​male_rural,
#   ⁵​female_rural, ⁶​total_rural, ⁷​male_total, ⁸​female_total
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names

Now we need to clean-up the rows and columns. One helpful approach to work out what we need to remove, is to look at what is left if we temporarily remove everything that we know we want. Whatever is left is then a candidate for being removed. In this case we know that we want the columns to contain numbers.

tanzania_dhs_separated |>
  mutate(across(everything(), ~ str_remove_all(., "[:digit:]"))) |>
  distinct()
# A tibble: 15 × 10
   age_group   male_ur…¹ femal…² total…³ male_…⁴ femal…⁵ total…⁶ male_…⁷ femal…⁸
   <chr>       <chr>     <chr>   <chr>   <chr>   <chr>   <chr>   <chr>   <chr>  
 1 "Urban"     Rural     Total   <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 2 ""          <NA>      <NA>    <NA>    <NA>    <NA>    <NA>    <NA>    <NA>   
 3 "Age-group" Male      Female  Total   Male    Female  Total   Male    Female 
 4 "-"         .         .       .       .       .       .       .       .      
 5 "-"         .         .       .       .       ,       .       .       .      
 6 "-"         .         .       .       .       .       .       .       .      
 7 "-~"        .         .       ,       .       .       .       .       .      
 8 "-"         .         .       ,       ,       ,       ,       ,       .      
 9 "-"         ,         .       .       .       .       .       .       .      
10 "-"         .         .       .       .       .       .       .       ,      
11 "-"         ,         .       .       ;       .       .       .       .      
12 "-"         .         .       .       ,       .       .       .       .      
13 "+"         .         .       .       .       .       .       .       .      
14 "Total"     .         .       .       .       .       .       .       .      
15 "Number"    ,         ,       ,       .       ,       ,       ,       ,      
# … with 1 more variable: total_total <chr>, and abbreviated variable names
#   ¹​male_urban, ²​female_urban, ³​total_urban, ⁴​male_rural, ⁵​female_rural,
#   ⁶​total_rural, ⁷​male_total, ⁸​female_total
# ℹ Use `colnames()` to see all variable names

In this case we can see that some commas, semicolons have been incorrectly considered decimal places. Also, some tildes and blank lines need to be removed. After that we can impose the correct class.

tanzania_dhs_cleaned <-
  tanzania_dhs_separated |>
  slice(6:22, 24, 25) |>
  mutate(
    across(everything(), ~ str_replace_all(., ",", ".")),
    across(everything(), ~ str_replace_all(., ";", "."))
  ) |>
  mutate(
    age_group = str_replace(age_group, "20-~", "20-24"),
    age_group = str_replace(age_group, "40-~", "40-44"),
    male_rural = str_replace(male_rural, "14.775", "14775")
  ) |>
  mutate(across(starts_with(c(
    "male",
    "female",
    "total"
  )), ~ as.numeric(.)))

tanzania_dhs_cleaned
# A tibble: 19 × 10
   age_group male_urban female…¹ total…² male_…³ femal…⁴ total…⁵ male_…⁶ femal…⁷
   <chr>          <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 0-4            16.4     13.8    15.1     18.1    17.1    17.6    17.8    16.4
 2 5-9            13.5     13      13.2     17.5    16      16.7    16.7    15.4
 3 10-14          12.6     13.1    12.8     15.3    13.5    14.4    14.8    13.4
 4 15-19          10.8     11.3    11.1      9.8     8.8     9.3    10       9.3
 5 20-24           9.4     12.2    10.8      5.9     8.2     7.1     6.6     9  
 6 25-29           8.4      9.8     9.1      5.6     7.1     6.4     6.2     7.6
 7 30-34           6.6      6.3     6.4      5.2     5.6     5.4     5.5     5.8
 8 35-39           5.8      5.9     5.8      4       4.5     4.3     4.4     4.8
 9 40-44           4.4      3.5     3.9      3.3     3.5     3.4     3.5     3.5
10 45-49           3.2      2.3     2.7      3.2     3.3     3.2     3.2     3.1
11 50-54           2        2.4     2.2      2.2     3.4     2.9     2.2     3.2
12 55-59           1.8      1.8     1.8      2.1     2.9     2.5     2       2.7
13 60-64           2.1      1.7     1.9      2.4     2       2.2     2.3     2  
14 65-69           1.3      1.3     1.3      2.2     1.6     1.9     2       1.5
15 70-74           0.9      0.7     0.8      1.3     1.2     1.2     1.2     1.1
16 75-79           0.3      0.4     0.4      0.8     0.6     0.7     0.7     0.6
17 80+             0.3      0.5     0.4      0.9     0.7     0.8     0.8     0.7
18 Total         100      100     100      100     100     100     100     100  
19 Number          3.69     3.88    7.57 14775      15.9    30.7    18.5    19.8
# … with 1 more variable: total_total <dbl>, and abbreviated variable names
#   ¹​female_urban, ²​total_urban, ³​male_rural, ⁴​female_rural, ⁵​total_rural,
#   ⁶​male_total, ⁷​female_total
# ℹ Use `colnames()` to see all variable names

11.6 2019 Kenyan Census

11.6.1 Gather and clean data

Finally, let us consider a more extensive example and gather, clean, and prepare some data from the 2019 Kenyan census. The distribution of population by age, sex, and administrative unit from the 2019 Kenyan census can be downloaded here. While this format as a PDF makes it easy to look up a particular result, it is not overly useful if we want to model the data. In order to be able to do that, we need to convert a PDF of Kenyan census results of counts, by age and sex, by county and sub-county, into a tidy dataset that can be analyzed. We use janitor (Firke 2020), pdftools (Ooms 2019), purrr (Henry and Wickham 2020), tidyverse (Wickham et al. 2019), and stringi (Gagolewski 2020).

library(janitor)
library(pdftools)
library(purrr)
library(tidyverse)
library(stringi)

We first need to download and read in the PDF of the 2019 Kenyan census.3

census_url <-
  paste0(
    "https://www.knbs.or.ke/download/2019-kenya-population-and-housing",
    "-census-volume-iii-distribution-of-population-by-age-sex-and-",
    "administrative-units/?wpdmdl=5729&refresh=620561f1ce3ad1644519921"
  )

download.file(
  url = census_url,
  destfile = "2019_Kenya_census.pdf",
  mode = "wb"
)

We can use pdf_text() from pdftools (Ooms 2019) again here.

kenya_census <-
  pdf_text(
    pdf = "2019_Kenya_census.pdf"
  )

In this example we will need to parse many pages, but we can see an example page of the PDF of the 2019 Kenyan census (Figure 11.11).

Figure 11.11: Example page from the 2019 Kenyan census

11.6.1.1 Make rectangular

In the Tanzanian example, we were interested in only one page, so we made our modifications directly to that page. Here we want to consider many pages, so we instead write a function, as introduced in Chapter 3, and then apply it to many pages. The first challenge is to get the dataset into a format that we can more easily manipulate. We will consider each page of the PDF and extract the relevant parts. To do this, we first write a function, and then apply it to each page.

# The function is going to take an input of a page number
get_data <- function(i) {
  # Focus on the page of interest
  just_page_i <- stri_split_lines(kenya_census[[i]])[[1]]

  # Remove blank lines
  just_page_i <- just_page_i[just_page_i != ""]

  # Get and format the location
  area <- just_page_i[3] |> str_squish()
  area <- str_to_title(area)

  # Get the type of table
  type_of_table <- just_page_i[2] |> str_squish()

  # Remove titles, headings and other content at the top of the page
  just_page_i_no_header <- just_page_i[5:length(just_page_i)]

  # Remove page numbers and other content at the bottom of the page
  just_page_i_no_header_no_footer <- just_page_i_no_header[1:62]

  # Convert into a tibble
  demography_data <- tibble(all = just_page_i_no_header_no_footer)

  # Separate columns
  demography_data <-
    demography_data |>
    mutate(all = str_squish(all)) |>
    mutate(all = str_replace(all, "10 -14", "10-14")) |>
    mutate(all = str_replace(all, "Not Stated", "NotStated")) |>
    separate(
      col = all,
      into = c(
        "age",
        "male",
        "female",
        "total",
        "age_2",
        "male_2",
        "female_2",
        "total_2"
      ),
      sep = " ",
      remove = TRUE,
      fill = "right",
      extra = "drop"
    )

  # They are side by side at the moment, need to append to bottom
  demography_data_long <-
    rbind(
      demography_data |> select(age, male, female, total),
      demography_data |>
        select(age_2, male_2, female_2, total_2) |>
        rename(
          age = age_2,
          male = male_2,
          female = female_2,
          total = total_2
        )
    )

  # There is one row of NAs, so remove it
  demography_data_long <-
    demography_data_long |>
    remove_empty(which = c("rows"))

  # Add the area and the page as variables
  demography_data_long$area <- area
  demography_data_long$table <- type_of_table
  demography_data_long$page <- i

  rm(
    just_page_i,
    i,
    area,
    type_of_table,
    just_page_i_no_header,
    just_page_i_no_header_no_footer,
    demography_data
  )

  return(demography_data_long)
}

We now have a function that does what we need to each page of the PDF. We use map_dfr() from purrr (Henry and Wickham 2020) to apply that function to each page, and then combine all the outputs into one tibble.

pages <- c(30:513)
all_tables <- map_dfr(pages, get_data)
rm(pages, get_data)
all_tables
# A tibble: 59,532 × 7
   age   male    female  total     area    table                            page
   <chr> <chr>   <chr>   <chr>     <chr>   <chr>                           <int>
 1 Total 610,257 598,046 1,208,303 Mombasa Table 2.3: Distribution of Pop…    30
 2 0     15,111  15,009  30,120    Mombasa Table 2.3: Distribution of Pop…    30
 3 1     15,805  15,308  31,113    Mombasa Table 2.3: Distribution of Pop…    30
 4 2     15,088  14,837  29,925    Mombasa Table 2.3: Distribution of Pop…    30
 5 3     14,660  14,031  28,691    Mombasa Table 2.3: Distribution of Pop…    30
 6 4     14,061  13,993  28,054    Mombasa Table 2.3: Distribution of Pop…    30
 7 0-4   74,725  73,178  147,903   Mombasa Table 2.3: Distribution of Pop…    30
 8 5     13,851  14,023  27,874    Mombasa Table 2.3: Distribution of Pop…    30
 9 6     12,889  13,216  26,105    Mombasa Table 2.3: Distribution of Pop…    30
10 7     13,268  13,203  26,471    Mombasa Table 2.3: Distribution of Pop…    30
# … with 59,522 more rows
# ℹ Use `print(n = ...)` to see more rows

Having got it into a rectangular format, we now need to clean the dataset to make it useful.

11.6.1.2 Validity

To attain validity requires a number of steps. The first step is to make the numbers into actual numbers, rather than characters. Before we can convert the type, we need to remove anything that is not a number otherwise that cell will be converted into an NA. We first identify any values that are not numbers so that we can remove them, and distinct() is especially useful.

all_tables |>
  select(male, female, total) |>
  mutate(across(everything(), ~ str_remove_all(., "[:digit:]"))) |>
  distinct()
# A tibble: 16 × 3
   male   female total
   <chr>  <chr>  <chr>
 1 ","    ","    ",," 
 2 ","    ","    ","  
 3 ","    ""     ","  
 4 ""     ""     ","  
 5 ""     ""     ""   
 6 ""     ","    ","  
 7 "-"    ""     ""   
 8 ""     "-"    ""   
 9 "-"    "-"    "-"  
10 "_"    "_"    "_"  
11 "-"    "-"    ""   
12 "-Aug" ","    ","  
13 "-Jun" ","    ","  
14 ","    ","    ""   
15 ""     ","    ""   
16 ",,"   ",,"   ",," 

We clearly need to remove commas, underbars, and hyphens. While we could use janitor here, it is worthwhile at least first looking at what is going on because sometimes there is odd stuff that janitor (and other packages) will not deal with issues in a way that we want.

We also have an odd situation with some months in what should be a numerical variable. If we look at these issues, which are on page 185, then we see that in this case it seems like the Kenyan government likely used Excel or similar, and this has converted two entries into dates. If we just took the numbers from the variable then we would have 23 and 15 here, but by inspecting the column we can use Excel to reverse the process and enter the correct values of 4,923 and 4,611, respectively.

all_tables <-
  all_tables |>
  mutate(
    male = if_else(male == "23-Jun", "4923", male),
    male = if_else(male == "15-Aug", "4611", male)
  )

all_tables
# A tibble: 59,532 × 7
   age   male    female  total     area    table                            page
   <chr> <chr>   <chr>   <chr>     <chr>   <chr>                           <int>
 1 Total 610,257 598,046 1,208,303 Mombasa Table 2.3: Distribution of Pop…    30
 2 0     15,111  15,009  30,120    Mombasa Table 2.3: Distribution of Pop…    30
 3 1     15,805  15,308  31,113    Mombasa Table 2.3: Distribution of Pop…    30
 4 2     15,088  14,837  29,925    Mombasa Table 2.3: Distribution of Pop…    30
 5 3     14,660  14,031  28,691    Mombasa Table 2.3: Distribution of Pop…    30
 6 4     14,061  13,993  28,054    Mombasa Table 2.3: Distribution of Pop…    30
 7 0-4   74,725  73,178  147,903   Mombasa Table 2.3: Distribution of Pop…    30
 8 5     13,851  14,023  27,874    Mombasa Table 2.3: Distribution of Pop…    30
 9 6     12,889  13,216  26,105    Mombasa Table 2.3: Distribution of Pop…    30
10 7     13,268  13,203  26,471    Mombasa Table 2.3: Distribution of Pop…    30
# … with 59,522 more rows
# ℹ Use `print(n = ...)` to see more rows

Having identified everything that needs to be removed, we can do the actual removal and convert our character column of numbers to integers.

all_tables <-
  all_tables |>
  mutate(across(c(male, female, total), ~ str_remove_all(., ","))) |>
  mutate(across(c(male, female, total), ~ str_replace(., "_", "0"))) |>
  mutate(across(c(male, female, total), ~ str_replace(., "-", "0"))) |>
  mutate(across(c(male, female, total), ~ as.integer(.)))

all_tables
# A tibble: 59,532 × 7
   age     male female   total area    table                                page
   <chr>  <int>  <int>   <int> <chr>   <chr>                               <int>
 1 Total 610257 598046 1208303 Mombasa Table 2.3: Distribution of Populat…    30
 2 0      15111  15009   30120 Mombasa Table 2.3: Distribution of Populat…    30
 3 1      15805  15308   31113 Mombasa Table 2.3: Distribution of Populat…    30
 4 2      15088  14837   29925 Mombasa Table 2.3: Distribution of Populat…    30
 5 3      14660  14031   28691 Mombasa Table 2.3: Distribution of Populat…    30
 6 4      14061  13993   28054 Mombasa Table 2.3: Distribution of Populat…    30
 7 0-4    74725  73178  147903 Mombasa Table 2.3: Distribution of Populat…    30
 8 5      13851  14023   27874 Mombasa Table 2.3: Distribution of Populat…    30
 9 6      12889  13216   26105 Mombasa Table 2.3: Distribution of Populat…    30
10 7      13268  13203   26471 Mombasa Table 2.3: Distribution of Populat…    30
# … with 59,522 more rows
# ℹ Use `print(n = ...)` to see more rows

11.6.1.3 Internal consistency

The next thing to clean is the areas. We know that there are 47 counties in Kenya, and a large number of sub-counties. The Kenyan government purports to provide a list on pages 19 to 22 of the PDF (document pages 7 to 10). But this list is not complete, and there are a few minor issues that we will deal with later. In any case, we first need to fix a few inconsistencies.

# Fix some area names
all_tables <-
  all_tables |>
  mutate(
    area = if_else(area == "Taita/ Taveta", "Taita/Taveta", area),
    area = if_else(area == "Elgeyo/ Marakwet", "Elgeyo/Marakwet", area),
    area = if_else(area == "Nairobi City", "Nairobi", area),
  )

Kenya has 47 counties, each of which has sub-counties. The PDF has them arranged as the county data then the sub-counties, without designating which is which. We can use the names, to a certain extent, but in a handful of cases, there is a sub-county that has the same name as a county, so we need to first fix that.

The PDF is made-up of three long tables. So, we can first get the names of the counties based on those final two tables and then reconcile them to get a list of the counties.

list_counties <-
  all_tables |>
  filter(table %in% c(
    "Table 2.4a: Distribution of Rural Population by Age, Sex* and County",
    "Table 2.4b: Distribution of Urban Population by Age, Sex* and County"
  )) |>
  select(area) |>
  distinct()

list_counties
# A tibble: 47 × 1
   area        
   <chr>       
 1 Kwale       
 2 Kilifi      
 3 Tana River  
 4 Lamu        
 5 Taita/Taveta
 6 Garissa     
 7 Wajir       
 8 Mandera     
 9 Marsabit    
10 Isiolo      
# … with 37 more rows
# ℹ Use `print(n = ...)` to see more rows

As we hoped, there are 47 of them. But before we can add a flag based on those names, we need to deal with the sub-counties that share their name. We will do this based on the page, then looking it up and deciding which is the county page and which is the sub-county page.

all_tables |>
  filter(table ==
    "Table 2.3: Distribution of Population by Age, Sex*, County and Sub- County") |>
  filter(area %in% c(
    "Busia",
    "Garissa",
    "Homa Bay",
    "Isiolo",
    "Kiambu",
    "Machakos",
    "Makueni",
    "Samburu",
    "Siaya",
    "Tana River",
    "Vihiga",
    "West Pokot"
  )) |>
  select(area, page) |>
  distinct()
# A tibble: 24 × 2
   area        page
   <chr>      <int>
 1 Samburu       42
 2 Tana River    53
 3 Tana River    56
 4 Garissa       65
 5 Garissa       69
 6 Isiolo        98
 7 Isiolo       100
 8 Machakos     149
 9 Machakos     154
10 Makueni      159
# … with 14 more rows
# ℹ Use `print(n = ...)` to see more rows

Now we can add the flag for whether the area is a county, and adjust for the ones that are troublesome.

all_tables <-
  all_tables |>
  mutate(area_type = if_else(
    area %in% list_counties$area,
    "county",
    "sub-county"
  ))

all_tables <-
  all_tables |>
  mutate(area_type = case_when(
    area == "Samburu" & page == 42 ~ "sub-county",
    area == "Tana River" & page == 56 ~ "sub-county",
    area == "Garissa" & page == 69 ~ "sub-county",
    area == "Isiolo" & page == 100 ~ "sub-county",
    area == "Machakos" & page == 154 ~ "sub-county",
    area == "Makueni" & page == 164 ~ "sub-county",
    area == "Kiambu" & page == 213 ~ "sub-county",
    area == "West Pokot" & page == 233 ~ "sub-county",
    area == "Vihiga" & page == 333 ~ "sub-county",
    area == "Busia" & page == 353 ~ "sub-county",
    area == "Siaya" & page == 360 ~ "sub-county",
    area == "Homa Bay" & page == 375 ~ "sub-county",
    TRUE ~ area_type
  ))

rm(list_counties)

all_tables
# A tibble: 59,532 × 8
   age     male female   total area    table                        page area_…¹
   <chr>  <int>  <int>   <int> <chr>   <chr>                       <int> <chr>  
 1 Total 610257 598046 1208303 Mombasa Table 2.3: Distribution of…    30 county 
 2 0      15111  15009   30120 Mombasa Table 2.3: Distribution of…    30 county 
 3 1      15805  15308   31113 Mombasa Table 2.3: Distribution of…    30 county 
 4 2      15088  14837   29925 Mombasa Table 2.3: Distribution of…    30 county 
 5 3      14660  14031   28691 Mombasa Table 2.3: Distribution of…    30 county 
 6 4      14061  13993   28054 Mombasa Table 2.3: Distribution of…    30 county 
 7 0-4    74725  73178  147903 Mombasa Table 2.3: Distribution of…    30 county 
 8 5      13851  14023   27874 Mombasa Table 2.3: Distribution of…    30 county 
 9 6      12889  13216   26105 Mombasa Table 2.3: Distribution of…    30 county 
10 7      13268  13203   26471 Mombasa Table 2.3: Distribution of…    30 county 
# … with 59,522 more rows, and abbreviated variable name ¹​area_type
# ℹ Use `print(n = ...)` to see more rows

Having dealt with the areas, we can deal with the ages. First, we need to fix some clear errors.

table(all_tables$age) |> head()

    0   0-4     1    10 10-14 10-19 
  484   484   484   484   482     1 
unique(all_tables$age) |> head()
[1] "Total" "0"     "1"     "2"     "3"     "4"    
all_tables <-
  all_tables |>
  mutate(
    age = if_else(age == "NotStated", "Not Stated", age),
    age = if_else(age == "43594", "5-9", age),
    age = if_else(age == "43752", "10-14", age),
    age = if_else(age == "9-14", "5-9", age),
    age = if_else(age == "10-19", "10-14", age),
  )

The census has done some of the work of putting together age-groups for us, but we want to make it easy to just focus on the counts by single-year-age. As such we will add a flag as to the type of age it is: an age group, such as “ages 0 to 5”, or a single age, such as “1”.

all_tables <-
  all_tables |>
  mutate(
    age_type = if_else(str_detect(age, "-"), "age-group", "single-year"),
    age_type = if_else(str_detect(age, "Total"), "age-group", age_type)
  )

At the moment, age is a character variable. We have a decision to make here. We do not want it to be a character variable (because it will not graph properly), but we do not want it to be numeric, because there is total and 100+ in there. For now, we will just make it into a factor, and at least that will be able to be nicely graphed.

all_tables <-
  all_tables |>
  mutate(
    age = as_factor(age)
  )

11.6.2 Check data

Having gathered and cleaned the data, we would like to run a few checks. Given the format of the data, we can check that “total” is the sum of “male” and “female”. (While we would prefer to use different groupings, this is what the Kenyan government collected and makes available.)

all_tables |>
  mutate(
    check_sum = male + female,
    totals_match = if_else(total == check_sum, 1, 0)
  ) |>
  filter(totals_match == 0)
# A tibble: 1 × 11
  age    male female total area      table  page area_…¹ age_t…² check…³ total…⁴
  <fct> <int>  <int> <int> <chr>     <chr> <int> <chr>   <chr>     <int>   <dbl>
1 10        0      1     2 Mt. Keny… Tabl…   187 sub-co… single…       1       0
# … with abbreviated variable names ¹​area_type, ²​age_type, ³​check_sum,
#   ⁴​totals_match

And we can adjust the one that looks to be wrong.

all_tables <-
  all_tables |>
  mutate(
    male = if_else(age == "10" & page == 187, as.integer(1), male)
  )

The Kenyan census provides different tables for the total of each county and sub-county; and then within each county, for the number in an urban area in that county, and the number in a urban area in that county. Some counties only have an urban count, but we would like to make sure that the sum of rural and urban counts equals the total count. This requires pivoting the data from long to wide.

First, we construct different tables for each of the three.

# Table 2.3
table_2_3 <- all_tables |>
  filter(table ==
    "Table 2.3: Distribution of Population by Age, Sex*, County and Sub- County")
table_2_4a <- all_tables |>
  filter(table ==
    "Table 2.4a: Distribution of Rural Population by Age, Sex* and County")
table_2_4b <- all_tables |>
  filter(table ==
    "Table 2.4b: Distribution of Urban Population by Age, Sex* and County")

Then having constructed the constituent parts, we can join them based on age, area, and whether it is a county.

both_2_4s <-
  full_join(
    table_2_4a,
    table_2_4b,
    by = c("age", "area", "area_type"),
    suffix = c("_rural", "_urban")
  )

all <-
  full_join(
    table_2_3,
    both_2_4s,
    by = c("age", "area", "area_type"),
    suffix = c("_all", "_")
  )

all <-
  all |>
  mutate(
    page = glue::glue(
      "Total from p. {page}, rural from p. {page_rural}, urban from p. {page_urban}"
    )
  ) |>
  select(
    -page,
    -page_rural,
    -page_urban,
    -table,
    -table_rural,
    -table_urban,
    -age_type_rural,
    -age_type_urban
  )

rm(both_2_4s, table_2_3, table_2_4a, table_2_4b)

We can now check that the sum of rural and urban is the same as the total.

follow_up <-
  all |>
  mutate(
    total_from_bits = total_rural + total_urban,
    check_total_is_rural_plus_urban =
      if_else(total == total_from_bits, 1, 0),
    total_from_bits - total
  ) |>
  filter(check_total_is_rural_plus_urban == 0)

head(follow_up)
# A tibble: 3 × 16
  age          male female  total area   area_…¹ age_t…² male_…³ femal…⁴ total…⁵
  <fct>       <int>  <int>  <int> <chr>  <chr>   <chr>     <int>   <int>   <int>
1 Not Stated     31     10     41 Nakuru county  single…       8       6      14
2 Total      434287 441379 875666 Bomet  county  age-gr…  420119  427576  847695
3 Not Stated      3      2      5 Bomet  county  single…       2       1       3
# … with 6 more variables: male_urban <int>, female_urban <int>,
#   total_urban <int>, total_from_bits <int>,
#   check_total_is_rural_plus_urban <dbl>, `total_from_bits - total` <int>, and
#   abbreviated variable names ¹​area_type, ²​age_type, ³​male_rural,
#   ⁴​female_rural, ⁵​total_rural
# ℹ Use `colnames()` to see all variable names
rm(follow_up)

There are just a few, but as they only have a difference of 1, we will just move on.

Finally, we want to check that the single age counts sum to the age-groups.

follow_up <-
  all |>
  mutate(groups = case_when(
    age %in% c("0", "1", "2", "3", "4", "0-4") ~ "0-4",
    age %in% c("5", "6", "7", "8", "9", "5-9") ~ "5-9",
    age %in% c("10", "11", "12", "13", "14", "10-14") ~ "10-14",
    age %in% c("15", "16", "17", "18", "19", "15-19") ~ "15-19",
    age %in% c("20", "21", "22", "23", "24", "20-24") ~ "20-24",
    age %in% c("25", "26", "27", "28", "29", "25-29") ~ "25-29",
    age %in% c("30", "31", "32", "33", "34", "30-34") ~ "30-34",
    age %in% c("35", "36", "37", "38", "39", "35-39") ~ "35-39",
    age %in% c("40", "41", "42", "43", "44", "40-44") ~ "40-44",
    age %in% c("45", "46", "47", "48", "49", "45-49") ~ "45-49",
    age %in% c("50", "51", "52", "53", "54", "50-54") ~ "50-54",
    age %in% c("55", "56", "57", "58", "59", "55-59") ~ "55-59",
    age %in% c("60", "61", "62", "63", "64", "60-64") ~ "60-64",
    age %in% c("65", "66", "67", "68", "69", "65-69") ~ "65-69",
    age %in% c("70", "71", "72", "73", "74", "70-74") ~ "70-74",
    age %in% c("75", "76", "77", "78", "79", "75-79") ~ "75-79",
    age %in% c("80", "81", "82", "83", "84", "80-84") ~ "80-84",
    age %in% c("85", "86", "87", "88", "89", "85-89") ~ "85-89",
    age %in% c("90", "91", "92", "93", "94", "90-94") ~ "90-94",
    age %in% c("95", "96", "97", "98", "99", "95-99") ~ "95-99",
    TRUE ~ "Other"
  )) |>
  group_by(area_type, area, groups) |>
  mutate(
    group_sum = sum(total, na.rm = FALSE),
    group_sum = group_sum / 2,
    difference = total - group_sum
  ) |>
  ungroup() |>
  filter(age == groups) |>
  filter(total != group_sum)

head(follow_up)
# A tibble: 6 × 16
  age    male female total area  area_…¹ age_t…² male_…³ femal…⁴ total…⁵ male_…⁶
  <fct> <int>  <int> <int> <chr> <chr>   <chr>     <int>   <int>   <int>   <int>
1 0-4       1      5     6 Mt. … sub-co… age-gr…      NA      NA      NA      NA
2 5-9       1      2     3 Mt. … sub-co… age-gr…      NA      NA      NA      NA
3 10-14     6      0     6 Mt. … sub-co… age-gr…      NA      NA      NA      NA
4 15-19     9      1    10 Mt. … sub-co… age-gr…      NA      NA      NA      NA
5 20-24    21      4    25 Mt. … sub-co… age-gr…      NA      NA      NA      NA
6 25-29    59      9    68 Mt. … sub-co… age-gr…      NA      NA      NA      NA
# … with 5 more variables: female_urban <int>, total_urban <int>, groups <chr>,
#   group_sum <dbl>, difference <dbl>, and abbreviated variable names
#   ¹​area_type, ²​age_type, ³​male_rural, ⁴​female_rural, ⁵​total_rural,
#   ⁶​male_urban
# ℹ Use `colnames()` to see all variable names
rm(follow_up)

Mt. Kenya Forest, Aberdare Forest, Kakamega Forest are all slightly dodgy. It does not seem to be in the documentation, but it looks like the Kenyan government has apportioned these between various countries. This is understandable, and unlikely to be a big deal, so, again, we will just move on.

11.6.3 Tidy-up

Now that we are confident that everything is looking good, we can just convert it to tidy format. This will make it easier to work with.

all <-
  all |>
  rename(
    male_total = male,
    female_total = female,
    total_total = total
  ) |>
  pivot_longer(
    cols = c(
      male_total,
      female_total,
      total_total,
      male_rural,
      female_rural,
      total_rural,
      male_urban,
      female_urban,
      total_urban
    ),
    names_to = "type",
    values_to = "number"
  ) |>
  separate(
    col = type,
    into = c("gender", "part_of_area"),
    sep = "_"
  ) |>
  select(area, area_type, part_of_area, age, age_type, gender, number)

head(all)
# A tibble: 6 × 7
  area    area_type part_of_area age   age_type  gender  number
  <chr>   <chr>     <chr>        <fct> <chr>     <chr>    <int>
1 Mombasa county    total        Total age-group male    610257
2 Mombasa county    total        Total age-group female  598046
3 Mombasa county    total        Total age-group total  1208303
4 Mombasa county    rural        Total age-group male        NA
5 Mombasa county    rural        Total age-group female      NA
6 Mombasa county    rural        Total age-group total       NA

The original purpose of cleaning this dataset was to make a table that is used by Alexander and Alkema (2021). Just to bring this all together, we could make a graph of single-year counts, by gender, for Nairobi (Figure 11.12).

all |>
  filter(area_type == "county") |>
  filter(part_of_area == "total") |>
  filter(age_type == "single-year") |>
  select(area, age, gender, number) |>
  filter(area == "Nairobi") |>
  filter(gender != "total") |>
  ggplot(aes(x = age, y = number, fill = gender)) +
  geom_col(aes(x = age, y = number, fill = gender), position = "dodge") +
  scale_y_continuous(labels = scales::comma) +
  scale_x_discrete(breaks = c(seq(from = 0, to = 99, by = 5), "100+")) +
  theme_classic() +
  scale_fill_brewer(palette = "Set1") +
  labs(
    y = "Number",
    x = "Age",
    fill = "Gender",
    caption = "Data source: 2019 Kenya Census"
  )

Figure 11.12: Distribution of age and gender in Nairobi in 2019, based on Kenyan census

A variety of features are clear from Figure 11.12, including age-heaping, a slight difference in the ratio of male-female birth, and a substantial difference between ages 15 and 25.

Finally, we may wish to use more informative names. For instance, in the Kenyan data example earlier we have the following column names: “area”, “age”, “gender”, and “number”. If we were to use our column names as contracts, then these could be: “chr_area”, “fctr_group_age”, “chr_group_gender”, and “int_group_count”.

column_names_as_contracts <-
  all |>
  filter(area_type == "county") |>
  filter(part_of_area == "total") |>
  filter(age_type == "single-year") |>
  select(area, age, gender, number) |>
  rename(
    "chr_area" = "area",
    "fctr_group_age" = "age",
    "chr_group_gender" = "gender",
    "int_group_count" = "number"
  )

We can then use pointblank (Iannone and Vargas 2022) to set-up tests for us.

library(pointblank)

agent <-
  create_agent(tbl = column_names_as_contracts) |>
  col_is_character(columns = vars(chr_area, chr_group_gender)) |>
  col_is_factor(columns = vars(fctr_group_age)) |>
  col_is_integer(columns = vars(int_group_count)) |>
  col_vals_in_set(
    columns = chr_group_gender,
    set = c("male", "female", "total")
  ) |>
  interrogate()

agent
Pointblank Validation
[2022-08-08|07:02:50]

tibble column_names_as_contracts
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_is_character
 col_is_character()

chr_area

1 1
1
0
0

2
col_is_character
 col_is_character()

chr_group_gender

1 1
1
0
0

3
col_is_factor
 col_is_factor()

fctr_group_age

1 1
1
0
0

4
col_is_integer
 col_is_integer()

int_group_count

1 1
1
0
0

5
col_vals_in_set
 col_vals_in_set()

chr_group_gender

male, female, total

14K 14K
1
0
0

2022-08-08 07:02:50 EDT < 1 s 2022-08-08 07:02:50 EDT

11.7 Exercises and tutorial

Exercises

  1. (Plan) Consider the following scenario: You manage a bar with two bartenders and are interested in modelling their efficiency. The bar opens at 8pm and closes at 2am. The efficiency of the bartenders is mildly correlated and defined by the number of drinks that they prepare. Be clear about whether you assume a negative or positive correlation. Please sketch what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation.
  3. (Acquire) Please describe a possible source of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched.
  5. (Communicate) Please write two paragraphs about what you did.
  6. Where would you save a copy of the raw data?
    1. “inputs”
    2. “outputs”
    3. “scripts”
  7. Which of the following provides a random sample of 10 U.S. states?
    1. sample(x = states, size = 10)
    2. sample(x = us.state.name, size = 10)
    3. sample(x = state.name, size = 10)
    4. sample(x = us.state, size = 10)
  8. If we had a character variable “some_words” with one observation "You know what" within a dataset called “sayings”, then which of the following would split it into its constituent words?
    1. separate(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")
    2. split(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")
    3. divide(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")
    4. part(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")
    5. unattach(data = sayings, col = some_words, into = c("one", "two", "three"), sep = " ")
  9. Is the following an example of tidy data?
tibble(
  name = c("Anne", "Bethany", "Stephen", "William"),
  age_group = c("18-29", "30-44", "45-60", "60+"),
)
# A tibble: 4 × 2
  name    age_group
  <chr>   <chr>    
1 Anne    18-29    
2 Bethany 30-44    
3 Stephen 45-60    
4 William 60+      
  1. Which function would change “lemons” into “lemonade”?
    1. str_replace(string = "lemons", pattern = "lemons", replacement = "lemonade")
    2. chr_replace(string = "lemons", pattern = "lemons", replacement = "lemonade")
    3. str_change(string = "lemons", pattern = "lemons", replacement = "lemonade")
    4. chr_change(string = "lemons", pattern = "lemons", replacement = "lemonade")
  2. When dealing with ages what is the most likely class for the variable (select all that apply)?
    1. integer
    2. matrix
    3. numeric
    4. factor
  3. Please consider the cities in Germany. Use testthat to define three tests that could apply.
  4. Which is the only acceptable format for dates in data science?
    1. YYYY-DD-MM
    2. YYYY-MM-DD
    3. DD-MM-YYYY
    4. MM-MM-YYYY
  5. Which of the following does not belong? c(15.9, 14.9, 16.6, 15.8, 16.7, 17.9, I2.6, 11.5, 16.2, 19.5, 15.0)
  6. With regard to “AV Rule 48” from Lockheed Martin (2005, 25) which of the following are not allowed to differ identifiers (select all that apply)?
    1. Only a mixture of case
    2. The presence/absence of the underscore character
    3. The interchange of the letter “O”, with the number “0” or the letter “D”
    4. The interchange of the letter “I”, with the number “1” or the letter “l”
    5. The interchange of the letter “S” with the number “5”
    6. The interchange of the letter “Z” with the number 2
    7. The interchange of the letter “n” with the letter “h”

Tutorial

This is a variant of a “think-pair-share” exercise (Lyman 1981). With regard to Jordan (2019), D’Ignazio and Klein (2020, chap. 6), Au (2020), and other relevant work, to what extent do you think we should let the data speak for themselves? Please write at least two pages. Following this, please pair with another student and exchange your written work. Update it if needed. Finally, please share your response with the rest of the class.


  1. If this does not work, then the City of Toronto government may have moved the datasets. Instead use: earlier_toronto_shelters <- read_csv("https://www.tellingstorieswithdata.com/inputs/data/earlier_toronto_shelters.csv", show_col_types = FALSE).↩︎

  2. If the DHS link breaks then replace their URL with: https://www.tellingstorieswithdata.com/inputs/pdfs/1996_Tanzania_DHS.pdf.↩︎

  3. If the Kenyan government link breaks then replace their URL with: https://www.tellingstorieswithdata.com/inputs/pdfs/2019_Kenya_census.pdf.↩︎