13  Exploratory data analysis

Required material

Key concepts and skills

Key packages and functions

Key functions

13.1 Introduction

The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?

Tukey (1962, 64).

Exploratory data analysis is never finished, you just die. It is the active process of exploring and becoming familiar with our data. Like a farmer with their hands in the earth, we need to know every contour and aspect of our data. We need to know how it changes, what it shows, hides, and what are its limits. Exploratory data analysis (EDA) is the unstructured process of doing this.

EDA is a means to an end. While it will inform the entire paper, especially the data section, it is not typically something that ends up in the final paper. The way to proceed is to make a separate R Markdown file, and add code as well as brief notes on-the-go. Do not delete previous code, just add to it. By the end of it we will have created a useful notebook that captures our exploration of the dataset. This is a document that will guide the subsequent analysis and modelling.

EDA draws on a variety of skills and there are a lot of options for EDA (Staniak and Biecek 2019). Every tool should be considered. Look at the data and scroll through it. Make tables, plots, summary statistics, even some models. The key is to iterate, move quickly rather than perfectly, and come to a thorough understanding of the data.

In this chapter we will go through various examples of EDA including TTC subway delays, and Airbnb.

Shoulders of giants

John Tukey

13.2 Case study: TTC subway delays

We can use opendatatoronto (Gelfand 2020) and tidyverse (Wickham et al. 2019) to obtain data about the Toronto subway system, and especially the delays that have occurred. The idea for this case study comes from Monica Alexander.

To begin, we download the data on Toronto Transit Commission (TTC) subway delays in 2020. The data are available as a separate dataset for each month. We are interested in 2020, so we create a column that of the year, and then filter the resources to just those months that were in 2020. We download them using get_resource(), iterating through each month using map_dfr from purrr (Henry and Wickham 2020) which brings each of the twelve datasets together, and then save them. then save them.


# We know this unique key by looking the 'id' of the interest.
ttc_resources <- 
  list_package_resources("996cfe8d-fb35-40ce-b569-698d51fc683b") |>
  mutate(year = str_extract(name, "20..?")) |>  
  filter(year == 2020)

all_2020_ttc_data <- 
  map_dfr(ttc_resources$id, get_resource)

all_2020_ttc_data <- clean_names(all_2020_ttc_data)

write_csv(all_2020_ttc_data, "all_2020_ttc_data.csv")


The dataset has a variety of columns, and we can find out more about each of them by downloading the codebook. The reason for the delay is coded, and so we can also download the explanations. One particular variable of interest appears to be ‘min_delay’, which gives the extent of the delay in minutes.

# Data codebook
delay_data_codebook <- get_resource("54247e39-5a7d-40db-a137-82b2a9ab0708")
delay_data_codebook <- clean_names(delay_data_codebook)
write_csv(delay_data_codebook, "delay_data_codebook.csv")

# Explanation for delay codes
delay_codes <- get_resource("fece136b-224a-412a-b191-8d31eb00491e")
delay_codes <- clean_names(delay_codes)
write_csv(delay_codes, "delay_codes.csv")

There is no one way to explore a dataset while conducting EDA, but we are usually especially interested in:

  • What should the variables look like? For instance, what is their type, what are the values, and what does the distribution of these look like?
  • What aspects are surprising, both in terms of data that are there that we do not expect, such as outliers, but also in terms of data that we may expect here but do not have such as missing data.
  • Developing a goal for our analysis. For instance, in this case, it might be understanding the factors such as stations and the time of day, that are associated with delays.

It is important to document all aspects as we go through and make note of anything surprising. We are looking to create a record of the steps and assumptions that we made as we were going because these will be important when we come to modelling.

13.2.1 Checking data

We should check that the variables are what they say they are. If they are not, then we need to work out what to do, for instance, should we recode them, or even remove them? It is also important to ensure that the class of the variables is as we expect, for instance variables that should be a factor are a factor and those that should be a character are a character. And also that we do not accidentally have, say, factors as numbers, or vice versa. One way to do this is to use unique(), and another is to use table().

[1] "Wednesday" "Thursday"  "Friday"    "Saturday"  "Sunday"    "Monday"   
[7] "Tuesday"  
 [1] "SRT"                    "YU"                     "BD"                    
 [4] "SHP"                    "YU/BD"                  "YU / BD"               
 [7] "999"                    NA                       "29 DUFFERIN"           
[10] "95 YORK MILLS"          "35 JANE"                "YU-BD"                 
[13] "BLOOR - DANFORTH"       "YU/BD LINE"             "YUS"                   
[16] "YUS/BD"                 "40 JUNCTION-DUNDAS WES" "71 RUNNYMEDE"          
[19] "BD/YU"                  "102 MARKHAM ROAD"       "YUS/DB"                
[22] "YU & BD"                "SHEP"                  

   Friday    Monday  Saturday    Sunday  Thursday   Tuesday Wednesday 
     2174      2222      1867      1647      2353      2190      2329 

      102 MARKHAM ROAD            29 DUFFERIN                35 JANE 
                     1                      1                      1 
                     1                      1                      1 
                   999                     BD                  BD/YU 
                     2                   5473                      1 
      BLOOR - DANFORTH                   SHEP                    SHP 
                     1                      1                    619 
                   SRT                     YU                YU / BD 
                   644                   7620                     22 
               YU & BD                  YU-BD                  YU/BD 
                     2                      1                    338 
            YU/BD LINE                    YUS                 YUS/BD 
                     1                      2                      1 

It is clear that we have likely issues in terms of the lines. Some of them have a clear re-code, but not all. One option would be to drop them, but we would need to think about whether these errors might be correlated with something that is of interest, because if they were then we may be dropping important information. There is usually no one right answer, because it will usually depend on what we are using the data for. We would note the issue, as we continued with EDA and then decide later about what to do. For now, we will remove all the lines that are not the ones that we know to be correct.

all_2020_ttc_data <- 
  all_2020_ttc_data |> 
  filter(line %in% c("BD", "YU", "SHP", "SRT"))

Exploring missing data could be a course in itself, but the presence, or lack, of missing values can haunt an analysis. To get started we could look at known-unknowns, which are the NAs for each variable. And vis_dat() and vis_miss() from visdat (Tierney 2017) can be useful to get a feel for how the missing values are distributed.


all_2020_ttc_data |>
  summarise_all(list( ~ sum(is.na(.))))
# A tibble: 1 × 10
   date  time   day station  code min_delay min_gap bound  line vehicle
  <int> <int> <int>   <int> <int>     <int>   <int> <int> <int>   <int>
1     0     0     0       0     0         0       0  3272     0       0
vis_dat(x = all_2020_ttc_data,
         palette = "cb_safe"


In this case we have many missing values in ‘bound’ and two in ‘line’. For these known-unknowns, we are interested in whether the they are missing at random. We want to, ideally, show that data happened to just drop out. This is unlikely, and so we are usually trying to look at what is systematic about how our data are missing.

Sometime data happen to be duplicated. If we did not notice this then our analysis would be wrong in ways that we’d not be able to consistently expect. There are a variety of ways to look for duplicated rows, but get_dupes() from janitor (Firke 2020) is especially useful.

No variable names specified - using all columns.
# A tibble: 37 × 11
   date                time   day    station code  min_delay min_gap bound line 
   <dttm>              <time> <chr>  <chr>   <chr>     <dbl>   <dbl> <chr> <chr>
 1 2020-02-10 00:00:00 06:00  Monday TORONT… MRO           0       0 <NA>  SRT  
 2 2020-02-10 00:00:00 06:00  Monday TORONT… MRO           0       0 <NA>  SRT  
 3 2020-02-10 00:00:00 06:00  Monday TORONT… MUO           0       0 <NA>  SHP  
 4 2020-02-10 00:00:00 06:00  Monday TORONT… MUO           0       0 <NA>  SHP  
 5 2020-03-10 00:00:00 23:00  Tuesd… YORK M… MUO           0       0 <NA>  YU   
 6 2020-03-10 00:00:00 23:00  Tuesd… YORK M… MUO           0       0 <NA>  YU   
 7 2020-03-26 00:00:00 13:20  Thurs… VAUGHA… MUNOA         3       6 S     YU   
 8 2020-03-26 00:00:00 13:20  Thurs… VAUGHA… MUNOA         3       6 S     YU   
 9 2020-03-26 00:00:00 18:32  Thurs… VAUGHA… MUNOA         3       6 S     YU   
10 2020-03-26 00:00:00 18:32  Thurs… VAUGHA… MUNOA         3       6 S     YU   
# … with 27 more rows, and 2 more variables: vehicle <dbl>, dupe_count <int>

This dataset has many duplicates. Again, we are interested in whether there is something systematic going on. Remembering that during EDA we are trying to quickly come to terms with a dataset, one way forward is to flag this as an issue to come back to and explore later, and to just remove duplicates for now using distinct().

all_2020_ttc_data <- 
  all_2020_ttc_data |> 

The station names are a mess. We could try to quickly bring a little order to the chaos by just taking just the first word (or, the first two if it starts with ‘ST’).

all_2020_ttc_data <-
  all_2020_ttc_data |>
  mutate(station_clean = if_else(str_starts(station, "ST"), 
                                 word(station, 1, 2), 
                                 word(station, 1)))

13.2.2 Visualizing data

We need to see the data in its original state to understand it and we use bar charts, scatterplots, line plots and histograms extensively for this. During EDA we are not as concerned with whether the graph is aesthetically pleasing, but are instead trying to acquire a sense of the data as quickly as possible. We can start by looking at the distribution of ‘min_delay’, which is one outcome of interest.

all_2020_ttc_data |>
  ggplot(aes(x = min_delay)) + 

The largely empty graph suggests the presence of outliers. There are a variety of ways to try to understand what could be going on, but one quick way to proceed it to use a log, remembering that we would expect values of 0 to drop away.

all_2020_ttc_data |>
  ggplot(aes(x = min_delay)) + 
  geom_bar() +

This initial exploration further hints at an issue that we might like to explore further. We will join this dataset with ‘delay_codes’ to understand what is going on.

all_2020_ttc_data <-
  all_2020_ttc_data |>
    delay_codes |>
      rename(code = sub_rmenu_code, 
             code_desc = code_description_3) |>
      select(code, code_desc),
    by = "code"

all_2020_ttc_data <-
  all_2020_ttc_data |>
  mutate(code_srt = ifelse(line == "SRT", code, "NA")) |>
    delay_codes |>
      rename(code_srt = sub_rmenu_code, code_desc_srt = code_description_7) |>
      select(code_srt, code_desc_srt),
    by = "code_srt"
  ) |>
    code = ifelse(code_srt == "NA", code, code_srt),
    code_desc = ifelse(is.na(code_desc_srt), code_desc, code_desc_srt)
  ) |>

And so we can see that the 450 minute delay was due to ‘Transit Control Related Problems’, the 446 minute delay was due to ‘Miscellaneous Other’, they seem to be outliers, even among the outliers.

all_2020_ttc_data |> 
  left_join(delay_codes |> 
              rename(code = sub_rmenu_code, code_desc = code_description_3) |>
              select(code, code_desc)) |> 
  arrange(-min_delay) |> 
  select(date, time, station, line, min_delay, code, code_desc)
Joining, by = c("code", "code_desc")
# A tibble: 14,335 × 7
   date                time   station            line  min_delay code  code_desc
   <dttm>              <time> <chr>              <chr>     <dbl> <chr> <chr>    
 1 2020-02-13 00:00:00 05:30  ST GEORGE YUS STA… YU          450 TUCC  Transit …
 2 2020-05-08 00:00:00 16:16  ST CLAIR STATION   YU          446 MUO   Miscella…
 3 2020-01-22 00:00:00 05:57  KEELE STATION      BD          258 EUTR  Trucks   
 4 2020-03-19 00:00:00 11:26  ROYAL YORK STATION BD          221 MUPR1 Priority…
 5 2020-11-12 00:00:00 23:10  SHEPPARD STATION   YU          197 PUCSC Signal C…
 6 2020-12-13 00:00:00 21:37  MCCOWAN STATION    SRT         167 PRSP  <NA>     
 7 2020-12-04 00:00:00 16:23  MCCOWAN STATION    SRT         165 PRSP  <NA>     
 8 2020-01-18 00:00:00 05:48  SCARBOROUGH RAPID… SRT         162 PRSL  <NA>     
 9 2020-02-22 00:00:00 05:16  SPADINA TO OSGOODE YU          159 PUSWZ Work Zon…
10 2020-09-03 00:00:00 14:35  CASTLE FRANK STAT… BD          150 MUPR1 Priority…
# … with 14,325 more rows

13.2.3 Groups of small counts

Another thing that we are looking for is various groupings of the data, especially where sub-groups may end up with only a small numbers of observations in them. This is because any analysis could be especially influenced by them. One quick way to do this is to group the data by a variable that is of interest, for instance ‘line’, using color.

ggplot(data = all_2020_ttc_data) + 
  geom_histogram(aes(x = min_delay, y = ..density.., fill = line), 
                 position = 'dodge', 
                 bins = 10) + 

Figure 13.1 uses density so that we can look at the the distributions more comparably, but we should also be aware of differences in frequency (Figure 13.2)). In this case, we will see that ‘SHP’ and ‘SRT’ have much smaller counts.

ggplot(data = all_2020_ttc_data) + 
  geom_histogram(aes(x = min_delay, fill = line), 
                 position = 'dodge', 
                 bins = 10) + 

To group by another variable we can add facets (Figure 13.3).

ggplot(data = all_2020_ttc_data) + 
  geom_density(aes(x = min_delay, color = line), 
               bw = .08) + 
  scale_x_log10() + 

We can now plot the top five stations by mean delay.

all_2020_ttc_data |> 
  group_by(line, station_clean) |> 
  summarise(mean_delay = mean(min_delay), n_obs = n()) |> 
  filter(n_obs>1) |> 
  arrange(line, -mean_delay) |> 
  slice(1:5) |> 
  ggplot(aes(station_clean, mean_delay)) + 
    geom_col() + 
    coord_flip() + 
               scales = "free_y")
`summarise()` has grouped output by 'line'. You can override using the
`.groups` argument.

13.2.4 Dates

Dates are often difficult to work with because they are so prone to having issues. For this reason, it is especially important to consider them during EDA. We could create a graph by week, to see if there is any seasonality. When using dates, lubridate (Grolemund and Wickham 2011) is especially useful. For instance, we can look at the average delay, of those that were delayed, by week drawing on week() to construct the weeks (Figure 13.4).


all_2020_ttc_data |>
  filter(min_delay > 0) |>
  mutate(week = week(date)) |>
  group_by(week, line) |>
  summarise(mean_delay = mean(min_delay)) |>
  ggplot(aes(week, mean_delay, color = line)) +
  geom_point() +
  geom_smooth() +
              scales = "free_y"

Now let us look at the proportion of delays that were greater than 10 minutes (Figure 13.5).

all_2020_ttc_data |> 
  mutate(week = week(date)) |> 
  group_by(week, line) |> 
  summarise(prop_delay = sum(min_delay>10)/n()) |> 
  ggplot(aes(week, prop_delay, color = line)) + 
    geom_point() + 
    geom_smooth() + 
              scales = "free_y"

These figures, tables, and analysis have no place in a final paper. Instead, they allow us to become comfortable with the data. We note aspects about each that stand out, as well as the warnings and any implications or aspects to return to.

13.2.5 Relationships

We are also interested in looking at the relationship between two variables. Scatter plots are especially useful here for continuous variables, and are a good precursor to modeling. For instance, we may be interested in the relationship between the delay and the gap (Figure 13.6).

all_2020_ttc_data |>
  ggplot(aes(x = min_delay, y = min_gap)) +
  geom_point() +
  scale_x_log10() +

The relationship between categorical variables takes more work, but we could also, for instance, look at the top five reasons for delay by station. In particular, we may be interested in whether they differ, and how any difference could be modeled (Figure 13.7).

all_2020_ttc_data |>
  group_by(line, code_desc) |>
  summarise(mean_delay = mean(min_delay)) |>
  arrange(-mean_delay) |>
  slice(1:5) |>
  ggplot(aes(x = code_desc,
             y = mean_delay)) +
  geom_col() + 
             scales = "free_y",
             nrow = 4) +

Principal components analysis (PCA) is another powerful exploratory tool. It allows us to pick up potential clusters and outliers that can help to inform modeling. To see this, we can look at the types of delay by station. The delay categories are messy and there a lot of them, but as we are trying to come to terms with the dataset, we will just take the first word.

all_2020_ttc_data <-
  all_2020_ttc_data |>
  mutate(code_red = case_when(
    str_starts(code_desc, "No") ~ word(code_desc, 1, 2),
    str_starts(code_desc, "Operator") ~ word(code_desc, 1, 2),
    TRUE ~ word(code_desc, 1)

Let us also just restrict the analysis to causes that happen at least 50 times over 2019. To do the PCA, the dataframe also needs to be switched to wide format.

dwide <-
  all_2020_ttc_data |>
  group_by(line, station_clean) |>
  mutate(n_obs = n()) |>
  filter(n_obs > 1) |>
  group_by(code_red) |>
  mutate(tot_delay = n()) |>
  arrange(tot_delay) |>
  filter(tot_delay > 50) |>
  group_by(line, station_clean, code_red) |>
  summarise(n_delay = n()) |>
  pivot_wider(names_from = code_red, values_from = n_delay) |>
  mutate_all(.funs = funs(ifelse(is.na(.), 0, .)))
`summarise()` has grouped output by 'line', 'station_clean'. You can override
using the `.groups` argument.
`mutate_all()` ignored the following grouping variables:
Warning: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas: 

  # Simple named list: 
  list(mean = mean, median = median)

  # Auto named with `tibble::lst()`: 
  tibble::lst(mean, median)

  # Using lambdas
  list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Now we can quickly do some PCA.

delay_pca <- prcomp(dwide[,3:ncol(dwide)])

df_out <- as_tibble(delay_pca$x)
df_out <- bind_cols(dwide |> select(line, station_clean), df_out)
# A tibble: 6 × 32
# Groups:   line, station_clean [6]
  line  station_clean    PC1    PC2     PC3   PC4    PC5   PC6    PC7   PC8
  <chr> <chr>          <dbl>  <dbl>   <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
1 BD    BATHURST       10.9   17.2   2.04   12.9   4.60  -2.98  2.31   5.02
2 BD    BAY            14.6    6.54  4.76   14.4   0.406  3.09 -0.144  4.68
3 BD    BLOOR          22.8  -18.6  19.7    -7.37 -1.54  -8.60 -1.36   1.08
4 BD    BLOOR-DANFORTH 23.4  -20.2  20.4    -4.85 -0.429 -6.77 -0.562  1.19
5 BD    BROADVIEW       9.29  22.0  -0.0365  6.72  4.31   1.73  0.304  4.71
6 BD    CASTLE         15.1    5.21  7.62   11.6  -1.17   2.77 -1.71   4.93
# … with 22 more variables: PC9 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>,
#   PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>,
#   PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>,
#   PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>, PC30 <dbl>

We can plot the first two principal components, and add labels to some of the outlying stations.

ggplot(df_out,aes(x=PC1,y=PC2,color=line )) + 
  geom_point() + 
  geom_text_repel(data = df_out |> filter(PC2>100|PC1<100*-1), 
                  aes(label = station_clean)

We could also plot the factor loadings. We see some evidence that perhaps one is to do with the public, compared with another to do with the operator.

df_out_r <- as_tibble(delay_pca$rotation)
df_out_r$feature <- colnames(dwide[,3:ncol(dwide)])

# A tibble: 30 × 31
        PC1    PC2      PC3      PC4      PC5     PC6      PC7      PC8     PC9
      <dbl>  <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>
 1 -0.0279  0.125  -0.0576   0.0679  -0.0133   0.0307 -0.00449  0.134   -0.0472
 2 -0.108   0.343  -0.150    0.205   -0.100    0.317  -0.116    0.252   -0.407 
 3 -0.0196  0.0604 -0.0207   0.0195   0.0189   0.0574 -0.0803  -0.0467  -0.146 
 4 -0.0244  0.0752 -0.0325  -0.0237   0.00121 -0.0263 -0.0137   0.0251   0.104 
 5 -0.0128  0.0113 -0.00340 -0.00977 -0.0255   0.0186 -0.0645  -0.0552  -0.0302
 6 -0.231   0.650  -0.309    0.245    0.222   -0.309   0.172   -0.0711   0.363 
 7 -0.0871  0.233  -0.0904  -0.692   -0.311   -0.414  -0.216    0.00703 -0.116 
 8 -0.00377 0.0193 -0.00201 -0.0140  -0.0424   0.0751 -0.146   -0.0712  -0.0203
 9 -0.0167  0.120  -0.0367  -0.578    0.336    0.563   0.207    0.226    0.289 
10 -0.0708  0.276  -0.118    0.116   -0.368    0.435  -0.178    0.0583  -0.0173
# … with 20 more rows, and 22 more variables: PC10 <dbl>, PC11 <dbl>,
#   PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
#   PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>,
#   PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
#   PC30 <dbl>, feature <chr>
ggplot(df_out_r,aes(x=PC1,y=PC2,label=feature )) + geom_text_repel()
Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
increasing max.overlaps

13.3 Case study: Airbnb listing in Toronto

In this case study we look at Airbnb listings in Toronto. The dataset is from Inside Airbnb (Cox 2021) and we will read it from their website, and then save a local copy. We can give read_csv() a link to where the dataset is and it will download it. This helps with reproducibility because the source is clear. But, as that link could change at any time, longer-term reproducibility, as well as wanting to minimize the effect on the Inside Airbnb servers, suggest that we should also save a local copy of the data and then use that. As the original data is not ours, we should not make that public without first getting written permission. The ‘guess_max’ option in read_csv helps us avoid having to specify the column types. Usually read_csv() takes a best guess at the column types based on the first few rows. But sometimes those first ones are misleading and so ‘guess_max’ forces it to look at a larger number of rows to try to work out what is going on.


airbnb_data <- 
           guess_max = 20000)

write_csv(airbnb_data, "airbnb_data.csv")


13.3.1 Explore individual variables

There are a large number of columns, so we will just select a few.

names(airbnb_data) |> 
[1] 74
airbnb_data_selected <- 
  airbnb_data |> 

# A tibble: 18,265 × 21
   host_id host_since host_response_time host_is_superhost host_listings_count
     <dbl> <date>     <chr>              <lgl>                           <dbl>
 1    1565 2008-08-08 N/A                FALSE                               1
 2   22795 2009-06-22 N/A                FALSE                               2
 3   48239 2009-10-25 N/A                FALSE                               1
 4   93825 2010-03-15 N/A                FALSE                               2
 5  118124 2010-05-04 within a day       FALSE                               1
 6   22795 2009-06-22 N/A                FALSE                               2
 7  174063 2010-07-20 within an hour     TRUE                                3
 8  183071 2010-07-28 within an hour     TRUE                                2
 9  187320 2010-08-01 within a few hours TRUE                               13
10  192364 2010-08-05 N/A                FALSE                               1
# … with 18,255 more rows, and 16 more variables:
#   host_total_listings_count <dbl>, host_neighbourhood <chr>,
#   neighbourhood_cleansed <chr>, room_type <chr>, bathrooms <lgl>,
#   bedrooms <dbl>, price <chr>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>, …

First we might be interested in price. It is a character at the moment and so we need to convert it to a numeric. This is a common problem, and we need to be a little careful that it does not all just convert to NAs. In our case if we just force the price data to be a numeric then it will go to NA because there are a lot of characters where it is unclear what the numeric equivalent is, such as ‘$’. So we need to remove those characters first.

airbnb_data_selected$price |> head()
[1] "$469.00" "$96.00"  "$64.00"  "$70.00"  "$45.00"  "$127.00"
airbnb_data_selected$price |> str_split("") |> unlist() |> unique()
 [1] "$" "4" "6" "9" "." "0" "7" "5" "1" "2" "3" "8" ","
airbnb_data_selected |> 
  select(price) |> 
  filter(str_detect(price, ","))
# A tibble: 145 × 1
 1 $1,724.00
 2 $1,000.00
 3 $1,100.00
 4 $1,450.00
 5 $1,019.00
 6 $1,000.00
 7 $1,300.00
 8 $2,142.00
 9 $2,000.00
10 $1,200.00
# … with 135 more rows
airbnb_data_selected <- 
  airbnb_data_selected |> 
  mutate(price = str_remove(price, "\\$"),
         price = str_remove(price, ","),
         price = as.integer(price)

Now we can have a look at the distribution of prices (Figure 13.8).

airbnb_data_selected |>
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

It is clear that there are outliers, so again we might like to consider it on the log scale (Figure 13.9).

airbnb_data_selected |>
  filter(price > 1000) |> 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties") +

If we focus on prices that are less than $1,000 then we see that the majority of properties have a nightly price less than $250. Interestingly it looks like there is some bunching of prices, possible around numbers ending in zero or nine. Let us just zoom in on prices between $90 and $210, out of interest, but change the bins to be smaller (?fig-airbnbpricesbunch).

airbnb_data_selected |>
  filter(price < 1000) |> 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 10) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

airbnb_data_selected |>
  filter(price > 90) |> 
  filter(price < 210) |> 
  ggplot(aes(x = price)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Price per night",
       y = "Number of properties")

For now, we will just remove all prices that are more than $999.

airbnb_data_selected <- 
  airbnb_data_selected |> 
  filter(price < 1000)

Superhosts are especially experienced Airbnb hosts, and we might be interested to learn more about them. For instance, a host either is or is not a superhost, and so we would not expect any NAs. But we can see that there are. It might be that the host removed a listing or similar. For now, we will remove them.

airbnb_data_selected |>
# A tibble: 11 × 21
     host_id host_since host_response_time host_is_superhost host_listings_count
       <dbl> <date>     <chr>              <lgl>                           <dbl>
 1  23472830 NA         <NA>               NA                                 NA
 2  31675651 NA         <NA>               NA                                 NA
 3  75779190 NA         <NA>               NA                                 NA
 4  47614482 NA         <NA>               NA                                 NA
 5 201103629 NA         <NA>               NA                                 NA
 6 111820893 NA         <NA>               NA                                 NA
 7  23472830 NA         <NA>               NA                                 NA
 8 196269219 NA         <NA>               NA                                 NA
 9  23472830 NA         <NA>               NA                                 NA
10 266594170 NA         <NA>               NA                                 NA
11 118516038 NA         <NA>               NA                                 NA
# … with 16 more variables: host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>, room_type <chr>,
#   bathrooms <lgl>, bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>

We will also want to create a binary variable from this. It is true/false at the moment, which is fine for the modelling, but there are a handful of situations where it will be easier if we have a 0/1. And for now we will just remove anyone with an NA for whether they are a super host.

airbnb_data_selected <- 
  airbnb_data_selected |>
  filter(!is.na(host_is_superhost)) |>
  mutate(host_is_superhost_binary = if_else(
    host_is_superhost == TRUE, 1, 0)

On Airbnb, guests can give 1-5 star ratings across a variety of different aspects, including cleanliness, accuracy, value, and others. But when we look at the reviews in our dataset, it is less clear how this is being constructed, because it appears to be a rating between 20 and 100 and there are also NA values (Figure 13.12).

airbnb_data_selected |>
  ggplot(aes(x = review_scores_rating)) +
  geom_bar() +
  theme_classic() +
  labs(x = "Reviews scores rating",
       y = "Number of properties")

We would like to deal with the NAs in ‘review_scores_rating’, but this is more complicated as there are a lot of them. It may be that this is just because they do not have any reviews.

airbnb_data_selected |>
  filter(is.na(review_scores_rating)) |>
[1] 4308
airbnb_data_selected |>
  filter(is.na(review_scores_rating)) |> 
  select(number_of_reviews) |> 

   0    1    2    3    4 
4046  226   23   10    3 

So these properties do not have a review rating yet because they do not have enough reviews. It is a large proportion of the total, at almost a fifth of them so we might like to look at this in more detail using vis_miss() from visdat (Tierney 2017). We are interested to see whether there is something systematic happening with these properties. For instance, if the NAs were being driven by, say, some requirement of a minimum number of reviews, then we would expect they would all be missing.

airbnb_data_selected |> 
         review_scores_value) |> 

Given it looks convincing that in almost all cases, the different types of reviews are missing for the same observation. One approach would be to just focus on those that are not missing and the main review score. It is clear that almost all the reviews are more than 80. Let us just zoom in on that 60 to 80 range to check what the distribution looks like in that range (Figure 13.13).

airbnb_data_selected |>
  filter(!is.na(review_scores_rating)) |> 
  filter(review_scores_rating > 60) |>
  filter(review_scores_rating < 80) |> 
  ggplot(aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Average review score",
       y = "Number of properties")

For now, we will remove anyone with an NA in their main review score, even though this will remove roughly 20 per cent of observations. And we will also focus only on those hosts with a main review score of at least 70. If we ended up using this dataset for actual analysis, then we would want to justify this decision in an appendix or similar.

airbnb_data_selected <- 
  airbnb_data_selected |>
  filter(!is.na(review_scores_rating)) |>
  filter(review_scores_rating >= 70)

Another important factor is how quickly a host responds to an enquiry. Airbnb allows hosts up to 24 hours to respond, but encourages responses within an hour.

airbnb_data_selected |>
# A tibble: 5 × 2
  host_response_time     n
  <chr>              <int>
1 a few days or more   494
2 N/A                 5708
3 within a day         952
4 within a few hours  1649
5 within an hour      4668

It is unclear a host could have a response time of NA. It may be that they are related to some other variable. Interestingly it seems like what looks like ‘NAs’ in ‘host_response_time’ variable are not coded as proper NAs, but are instead being treated as another category. We will recode them to be actual NAs and also change the variable to be a factor.

airbnb_data_selected <- 
  airbnb_data_selected |>
  mutate(host_response_time = if_else(host_response_time == "N/A", NA_character_, host_response_time),
         host_response_time = factor(host_response_time)

There is clearly an issue with NAs as there are a lot of them. For instance, we might be interested to see if there is a relationship with the review score (Figure 13.14). There are a lot that have an overall review of 100.

airbnb_data_selected |> 
  filter(is.na(host_response_time)) |> 
  ggplot(aes(x = review_scores_rating)) +
  geom_histogram(binwidth = 1) +
  theme_classic() +
  labs(x = "Average review score",
       y = "Number of properties")

For now, we will remove anyone with a NA in their response time. This will again removes roughly another 20 per cent of the observations.

airbnb_data_selected <- 
  airbnb_data_selected |> 

There are two versions of a variable that suggest how many properties a host has on Airbnb. We might be interested to know whether there is a difference between them.

airbnb_data_selected |> 
  mutate(listings_count_is_same = if_else(host_listings_count == host_total_listings_count, 1, 0)) |> 
  filter(listings_count_is_same == 0)
# A tibble: 0 × 23
# … with 23 variables: host_id <dbl>, host_since <date>,
#   host_response_time <fct>, host_is_superhost <lgl>,
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_neighbourhood <chr>, neighbourhood_cleansed <chr>, room_type <chr>,
#   bathrooms <lgl>, bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>, …

As there are no differences in this dataset, we can just remove one variable for now and have a look at the other one (Figure 13.15).

airbnb_data_selected <- 
  airbnb_data_selected |> 

airbnb_data_selected |> 
  ggplot(aes(x = host_total_listings_count)) +
  geom_bar() +

There are a large number who have somewhere in the 2-10 properties range, but the usual long tail. The number with 0 listings is unexpected and worth following up on. And there are a bunch with NA that we will need to deal with.

airbnb_data_selected |> 
  filter(host_total_listings_count == 0) |> 
# A tibble: 6 × 21
  host_id host_since host_response_time host_is_superhost host_total_listings_c…
    <dbl> <date>     <fct>              <lgl>                              <dbl>
1 3783106 2012-10-06 within an hour     FALSE                                  0
2 3814089 2012-10-09 within an hour     FALSE                                  0
3 3827668 2012-10-10 within a day       FALSE                                  0
4 2499198 2012-05-30 within a day       FALSE                                  0
5 3268493 2012-08-15 within a day       FALSE                                  0
6 8675040 2013-09-06 within an hour     TRUE                                   0
# … with 16 more variables: host_neighbourhood <chr>,
#   neighbourhood_cleansed <chr>, room_type <chr>, bathrooms <lgl>,
#   bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
#   has_availability <lgl>, review_scores_rating <dbl>,
#   review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
#   review_scores_checkin <dbl>, review_scores_communication <dbl>,
#   review_scores_location <dbl>, review_scores_value <dbl>, …

There is nothing that immediately jumps out as odd about the people with zero listings, but there must be something going on. For now, we will focus on only those with one property.

airbnb_data_selected <- 
  airbnb_data_selected |> 
  add_count(host_id) |> 
  filter(n == 1) |> 

13.3.2 Explore relationships between variables

We might like to make some graphs to see if there are any relationships that become clear. Some aspects that come to mind is looking at prices and reviews and super hosts, and number of properties and neighborhood.

Look at the relationship between price and reviews, and whether they are a super-host (Figure 13.16).

airbnb_data_selected |>
  ggplot(aes(x = price, y = review_scores_rating, color = host_is_superhost)) +
  geom_point(size = 1, alpha = 0.1) + 
  theme_classic() +
  labs(x = "Price per night",
       y = "Average review score",
       color = "Super host") +
  scale_color_brewer(palette = "Set1")

One of the aspects that may make someone a super host is how quickly they respond to inquiries. One could imagine that being a superhost involves quickly saying yes or no to inquiries. Let us look at the data. First, we want to look at the possible values of superhost by their response times.

airbnb_data_selected |> 
  count(host_is_superhost) |>
  mutate(proportion = n / sum(n),
         proportion = round(proportion, digits = 2))
# A tibble: 2 × 3
  host_is_superhost     n proportion
  <lgl>             <int>      <dbl>
1 FALSE              1677       0.58
2 TRUE               1234       0.42

Fortunately, it looks like when we removed the reviews rows we removed any NAs from whether they were a super host, but if we go back and look into that we may need to check again. We could build a table that looks at a hosts response time by whether they are a superhost using tabyl() from janitor (Firke 2020). It is clear that is a host does not respond within an hour then it is unlikely that they are a super host.

airbnb_data_selected |> 
  tabyl(host_response_time, host_is_superhost) |> 
  adorn_percentages("row") |>
  adorn_pct_formatting(digits = 0) |>
  adorn_ns() |> 
 host_response_time             FALSE      TRUE
 a few days or more         90% (223) 10%  (26)
       within a day         70% (348) 30% (149)
 within a few hours         55% (354) 45% (284)
     within an hour         49% (752) 51% (775)

Finally, we could look at neighborhood. The data provider has attempted to clean the neighborhood variable for us, so will just use that variable for now. Although if we ended up using this variable for our actual analysis we would want to check they had not made any errors.

airbnb_data_selected |> 
  tabyl(neighbourhood_cleansed) |> 
  adorn_totals("row") |> 
  adorn_pct_formatting() |> 
[1] 140
airbnb_data_selected |> 
  tabyl(neighbourhood_cleansed) |> 
  adorn_pct_formatting() |> 
  arrange(-n) |> 
  filter(n > 100) |> 
  adorn_totals("row") |> 
            neighbourhood_cleansed   n percent
 Waterfront Communities-The Island 488   16.8%
                           Niagara 129    4.4%
                             Annex 102    3.5%
                             Total 719       -

13.3.3 Explore multiple relationships with a model

We will now run some models on our dataset. We will cover modeling in more detail in Chapter @ref(ijalm), but we can use models during EDA to help get a better sense of relationships that may exist between multiple variables in a dataset. For instance, we may like to see whether we can forecast whether someone is a super host, and the factors that go into explaining that. As the dependent variable is binary, this is a good opportunity to use logistic regression. We expect that better reviews will be associated with faster responses and higher reviews. Specifically, the model that we estimate is:

\[\mbox{Prob(Is super host} = 1) = \beta_0 + \beta_1 \mbox{Response time} + \beta_2 \mbox{Reviews} + \epsilon\]

We estimate the model using glm in the R language (R Core Team 2021).

logistic_reg_superhost_response_review <- glm(host_is_superhost ~ 
                                                host_response_time + 
                                              data = airbnb_data_selected,
                                              family = binomial

We can have a quick look at the results using modelsummary() from modelsummary (Arel-Bundock 2021)

Model 1
(Intercept) −18.931
host_response_timewithin a day 1.334
host_response_timewithin a few hours 1.932
host_response_timewithin an hour 2.213
review_scores_rating 0.173
Num.Obs. 2911
AIC 3521.4
BIC 3551.3
Log.Lik. −1755.698
F 75.649

13.4 Exercises and tutorial

13.4.1 Exercises

  1. In your own words what is exploratory data analysis (this will be difficult, but please write only one nuanced paragraph)?
  2. In Tukey’s words, what is exploratory data analysis (please write one paragraph)?
  3. Who was Tukey (please write a paragraph or two)?
  4. If you have a dataset called ‘my_data’, which has two columns: ‘first_col’ and ‘second_col’, then could you please write some rough R code that would generate a graph (the type of graph does not matter).
  5. Consider a dataset that has 500 rows and 3 columns, so there are 1,500 cells. If 100 of the cells are missing data for at least one of the columns, then would you remove the whole row your dataset or try to run your analysis on the data as is, or some other procedure? What if your dataset had 10,000 rows instead, but the same number of missing cells?
  6. Please note three ways of identifying unusual values.
  7. What is the difference between a categorical and continuous variable?
  8. What is the difference between a factor and an integer variable?
  9. How can we think about who is systematically excluded from a dataset?
  10. Using the opendatatoronto package, download the data on mayoral campaign contributions for 2014. (note: the 2014 file you will get from get_resource, so just keep the sheet that relates to the Mayor election).
    1. Clean up the data format (fixing the parsing issue and standardizing the column names using janitor)
    2. Summarize the variables in the dataset. Are there missing values, and if so, should we be worried about them? Is every variable in the format it should be? If not, create new variable(s) that are in the right format.
    3. Visually explore the distribution of values of the contributions. What contributions are notable outliers? Do they share a similar characteristic(s)? It may be useful to plot the distribution of contributions without these outliers to get a better sense of the majority of the data.
    4. List the top five candidates in each of these categories: 1) total contributions; 2) mean contribution; and 3) number of contributions.
    5. Repeat that process, but without contributions from the candidates themselves.
    6. How many contributors gave money to more than one candidate?
  11. Name three geoms that produce graphs that have bars on them ggplot().
  12. Consider a dataset 10,000 observations and 27 variables. For each observation, there is at least one missing variable. Please discuss, in a paragraph or two, the steps that you would take to understand what is going on.
  13. Known missing data, are those that leave holes in your dataset. But what about data that were never collected? Please look at McClelland (2019) and Luscombe and McClelland (2020). Look into how they gathered their dataset and what it took to put this together. What is in the dataset and why? What is missing and why? How could this affect the results? How might similar biases enter into other datasets that you have used or read about?

13.4.2 Tutorial

Redo the Airbnb EDA but for Paris. Please submit a PDF.