library(janitor)
library(opendatatoronto)
library(tidyverse)
# We know this unique key by looking the 'id' of the interest.
<-
ttc_resources list_package_resources("996cfe8d-fb35-40ce-b569-698d51fc683b") |>
mutate(year = str_extract(name, "20..?")) |>
filter(year == 2020)
<-
all_2020_ttc_data map_dfr(ttc_resources$id, get_resource)
<- clean_names(all_2020_ttc_data)
all_2020_ttc_data
write_csv(all_2020_ttc_data, "all_2020_ttc_data.csv")
all_2020_ttc_data
13 Exploratory data analysis
Required material
- Read The Future of Data Analysis, Part 1 ‘General Considerations’, (Tukey 1962).
Key concepts and skills
- Quickly coming to terms with a new dataset by constructing graphs and tables.
- Understanding the issues and features of the dataset and how this may affect analysis decisions.
- Thinking about missing values and outliers.
Key packages and functions
broom
ggrepel
here
janitor
lubridate
opendatatoronto
tidymodels
tidyverse
visdat
Key functions
augment()
clean_names()
coord_flip()
count()
distinct()
facet_grid()
facet_wrap()
geom_bar()
geom_col()
geom_density()
geom_histogram()
geom_line()
geom_point()
geom_smooth()
geom_text_repel()
get_dupes()
glance()
if_else()
ifelse()
initial_split()
left_join()
mutate()
mutate_all()
names()
ncol()
nrow()
pivot_wider()
scale_color_brewer()
scale_fill_brewer()
scale_x_log10()
scale_y_log10()
str_detect()
str_extract()
str_remove()
str_split()
str_starts()
summarise()
summarise_all()
theme_classic()
theme_minimal()
vis_dat()
vis_miss()
13.1 Introduction
The future of data analysis can involve great progress, the overcoming of real difficulties, and the provision of a great service to all fields of science and technology. Will it? That remains to us, to our willingness to take up the rocky road of real problems in preference to the smooth road of unreal assumptions, arbitrary criteria, and abstract results without real attachments. Who is for the challenge?
Tukey (1962, 64).
Exploratory data analysis is never finished, you just die. It is the active process of exploring and becoming familiar with our data. Like a farmer with their hands in the earth, we need to know every contour and aspect of our data. We need to know how it changes, what it shows, hides, and what are its limits. Exploratory data analysis (EDA) is the unstructured process of doing this.
EDA is a means to an end. While it will inform the entire paper, especially the data section, it is not typically something that ends up in the final paper. The way to proceed is to make a separate R Markdown file, and add code as well as brief notes on-the-go. Do not delete previous code, just add to it. By the end of it we will have created a useful notebook that captures our exploration of the dataset. This is a document that will guide the subsequent analysis and modelling.
EDA draws on a variety of skills and there are a lot of options for EDA (Staniak and Biecek 2019). Every tool should be considered. Look at the data and scroll through it. Make tables, plots, summary statistics, even some models. The key is to iterate, move quickly rather than perfectly, and come to a thorough understanding of the data.
In this chapter we will go through various examples of EDA including TTC subway delays, and Airbnb.
John Tukey
13.2 Case study: TTC subway delays
We can use opendatatoronto
(Gelfand 2020) and tidyverse
(Wickham et al. 2019) to obtain data about the Toronto subway system, and especially the delays that have occurred. The idea for this case study comes from Monica Alexander.
To begin, we download the data on Toronto Transit Commission (TTC) subway delays in 2020. The data are available as a separate dataset for each month. We are interested in 2020, so we create a column that of the year, and then filter the resources to just those months that were in 2020. We download them using get_resource()
, iterating through each month using map_dfr
from purrr
(Henry and Wickham 2020) which brings each of the twelve datasets together, and then save them. then save them.
The dataset has a variety of columns, and we can find out more about each of them by downloading the codebook. The reason for the delay is coded, and so we can also download the explanations. One particular variable of interest appears to be ‘min_delay’, which gives the extent of the delay in minutes.
# Data codebook
<- get_resource("54247e39-5a7d-40db-a137-82b2a9ab0708")
delay_data_codebook <- clean_names(delay_data_codebook)
delay_data_codebook write_csv(delay_data_codebook, "delay_data_codebook.csv")
# Explanation for delay codes
<- get_resource("fece136b-224a-412a-b191-8d31eb00491e")
delay_codes <- clean_names(delay_codes)
delay_codes write_csv(delay_codes, "delay_codes.csv")
There is no one way to explore a dataset while conducting EDA, but we are usually especially interested in:
- What should the variables look like? For instance, what is their type, what are the values, and what does the distribution of these look like?
- What aspects are surprising, both in terms of data that are there that we do not expect, such as outliers, but also in terms of data that we may expect here but do not have such as missing data.
- Developing a goal for our analysis. For instance, in this case, it might be understanding the factors such as stations and the time of day, that are associated with delays.
It is important to document all aspects as we go through and make note of anything surprising. We are looking to create a record of the steps and assumptions that we made as we were going because these will be important when we come to modelling.
13.2.1 Checking data
We should check that the variables are what they say they are. If they are not, then we need to work out what to do, for instance, should we recode them, or even remove them? It is also important to ensure that the class of the variables is as we expect, for instance variables that should be a factor are a factor and those that should be a character are a character. And also that we do not accidentally have, say, factors as numbers, or vice versa. One way to do this is to use unique()
, and another is to use table()
.
unique(all_2020_ttc_data$day)
[1] "Wednesday" "Thursday" "Friday" "Saturday" "Sunday" "Monday"
[7] "Tuesday"
unique(all_2020_ttc_data$line)
[1] "SRT" "YU" "BD"
[4] "SHP" "YU/BD" "YU / BD"
[7] "999" NA "29 DUFFERIN"
[10] "95 YORK MILLS" "35 JANE" "YU-BD"
[13] "BLOOR - DANFORTH" "YU/BD LINE" "YUS"
[16] "YUS/BD" "40 JUNCTION-DUNDAS WES" "71 RUNNYMEDE"
[19] "BD/YU" "102 MARKHAM ROAD" "YUS/DB"
[22] "YU & BD" "SHEP"
table(all_2020_ttc_data$day)
Friday Monday Saturday Sunday Thursday Tuesday Wednesday
2174 2222 1867 1647 2353 2190 2329
table(all_2020_ttc_data$line)
102 MARKHAM ROAD 29 DUFFERIN 35 JANE
1 1 1
40 JUNCTION-DUNDAS WES 71 RUNNYMEDE 95 YORK MILLS
1 1 1
999 BD BD/YU
2 5473 1
BLOOR - DANFORTH SHEP SHP
1 1 619
SRT YU YU / BD
644 7620 22
YU & BD YU-BD YU/BD
2 1 338
YU/BD LINE YUS YUS/BD
1 2 1
YUS/DB
1
It is clear that we have likely issues in terms of the lines. Some of them have a clear re-code, but not all. One option would be to drop them, but we would need to think about whether these errors might be correlated with something that is of interest, because if they were then we may be dropping important information. There is usually no one right answer, because it will usually depend on what we are using the data for. We would note the issue, as we continued with EDA and then decide later about what to do. For now, we will remove all the lines that are not the ones that we know to be correct.
<-
all_2020_ttc_data |>
all_2020_ttc_data filter(line %in% c("BD", "YU", "SHP", "SRT"))
Exploring missing data could be a course in itself, but the presence, or lack, of missing values can haunt an analysis. To get started we could look at known-unknowns, which are the NAs for each variable. And vis_dat()
and vis_miss()
from visdat
(Tierney 2017) can be useful to get a feel for how the missing values are distributed.
library(visdat)
|>
all_2020_ttc_data summarise_all(list( ~ sum(is.na(.))))
# A tibble: 1 × 10
date time day station code min_delay min_gap bound line vehicle
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 0 0 0 0 0 0 0 3272 0 0
vis_dat(x = all_2020_ttc_data,
palette = "cb_safe"
)
vis_miss(all_2020_ttc_data)
In this case we have many missing values in ‘bound’ and two in ‘line’. For these known-unknowns, we are interested in whether the they are missing at random. We want to, ideally, show that data happened to just drop out. This is unlikely, and so we are usually trying to look at what is systematic about how our data are missing.
Sometime data happen to be duplicated. If we did not notice this then our analysis would be wrong in ways that we’d not be able to consistently expect. There are a variety of ways to look for duplicated rows, but get_dupes()
from janitor
(Firke 2020) is especially useful.
get_dupes(all_2020_ttc_data)
No variable names specified - using all columns.
# A tibble: 37 × 11
date time day station code min_delay min_gap bound line
<dttm> <time> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 2020-02-10 00:00:00 06:00 Monday TORONT… MRO 0 0 <NA> SRT
2 2020-02-10 00:00:00 06:00 Monday TORONT… MRO 0 0 <NA> SRT
3 2020-02-10 00:00:00 06:00 Monday TORONT… MUO 0 0 <NA> SHP
4 2020-02-10 00:00:00 06:00 Monday TORONT… MUO 0 0 <NA> SHP
5 2020-03-10 00:00:00 23:00 Tuesd… YORK M… MUO 0 0 <NA> YU
6 2020-03-10 00:00:00 23:00 Tuesd… YORK M… MUO 0 0 <NA> YU
7 2020-03-26 00:00:00 13:20 Thurs… VAUGHA… MUNOA 3 6 S YU
8 2020-03-26 00:00:00 13:20 Thurs… VAUGHA… MUNOA 3 6 S YU
9 2020-03-26 00:00:00 18:32 Thurs… VAUGHA… MUNOA 3 6 S YU
10 2020-03-26 00:00:00 18:32 Thurs… VAUGHA… MUNOA 3 6 S YU
# … with 27 more rows, and 2 more variables: vehicle <dbl>, dupe_count <int>
This dataset has many duplicates. Again, we are interested in whether there is something systematic going on. Remembering that during EDA we are trying to quickly come to terms with a dataset, one way forward is to flag this as an issue to come back to and explore later, and to just remove duplicates for now using distinct()
.
<-
all_2020_ttc_data |>
all_2020_ttc_data distinct()
The station names are a mess. We could try to quickly bring a little order to the chaos by just taking just the first word (or, the first two if it starts with ‘ST’).
<-
all_2020_ttc_data |>
all_2020_ttc_data mutate(station_clean = if_else(str_starts(station, "ST"),
word(station, 1, 2),
word(station, 1)))
13.2.2 Visualizing data
We need to see the data in its original state to understand it and we use bar charts, scatterplots, line plots and histograms extensively for this. During EDA we are not as concerned with whether the graph is aesthetically pleasing, but are instead trying to acquire a sense of the data as quickly as possible. We can start by looking at the distribution of ‘min_delay’, which is one outcome of interest.
|>
all_2020_ttc_data ggplot(aes(x = min_delay)) +
geom_bar()
The largely empty graph suggests the presence of outliers. There are a variety of ways to try to understand what could be going on, but one quick way to proceed it to use a log, remembering that we would expect values of 0 to drop away.
|>
all_2020_ttc_data ggplot(aes(x = min_delay)) +
geom_bar() +
scale_x_log10()
This initial exploration further hints at an issue that we might like to explore further. We will join this dataset with ‘delay_codes’ to understand what is going on.
<-
all_2020_ttc_data |>
all_2020_ttc_data left_join(
|>
delay_codes rename(code = sub_rmenu_code,
code_desc = code_description_3) |>
select(code, code_desc),
by = "code"
)
<-
all_2020_ttc_data |>
all_2020_ttc_data mutate(code_srt = ifelse(line == "SRT", code, "NA")) |>
left_join(
|>
delay_codes rename(code_srt = sub_rmenu_code, code_desc_srt = code_description_7) |>
select(code_srt, code_desc_srt),
by = "code_srt"
|>
) mutate(
code = ifelse(code_srt == "NA", code, code_srt),
code_desc = ifelse(is.na(code_desc_srt), code_desc, code_desc_srt)
|>
) select(-code_srt,-code_desc_srt)
And so we can see that the 450 minute delay was due to ‘Transit Control Related Problems’, the 446 minute delay was due to ‘Miscellaneous Other’, they seem to be outliers, even among the outliers.
|>
all_2020_ttc_data left_join(delay_codes |>
rename(code = sub_rmenu_code, code_desc = code_description_3) |>
select(code, code_desc)) |>
arrange(-min_delay) |>
select(date, time, station, line, min_delay, code, code_desc)
Joining, by = c("code", "code_desc")
# A tibble: 14,335 × 7
date time station line min_delay code code_desc
<dttm> <time> <chr> <chr> <dbl> <chr> <chr>
1 2020-02-13 00:00:00 05:30 ST GEORGE YUS STA… YU 450 TUCC Transit …
2 2020-05-08 00:00:00 16:16 ST CLAIR STATION YU 446 MUO Miscella…
3 2020-01-22 00:00:00 05:57 KEELE STATION BD 258 EUTR Trucks
4 2020-03-19 00:00:00 11:26 ROYAL YORK STATION BD 221 MUPR1 Priority…
5 2020-11-12 00:00:00 23:10 SHEPPARD STATION YU 197 PUCSC Signal C…
6 2020-12-13 00:00:00 21:37 MCCOWAN STATION SRT 167 PRSP <NA>
7 2020-12-04 00:00:00 16:23 MCCOWAN STATION SRT 165 PRSP <NA>
8 2020-01-18 00:00:00 05:48 SCARBOROUGH RAPID… SRT 162 PRSL <NA>
9 2020-02-22 00:00:00 05:16 SPADINA TO OSGOODE YU 159 PUSWZ Work Zon…
10 2020-09-03 00:00:00 14:35 CASTLE FRANK STAT… BD 150 MUPR1 Priority…
# … with 14,325 more rows
13.2.3 Groups of small counts
Another thing that we are looking for is various groupings of the data, especially where sub-groups may end up with only a small numbers of observations in them. This is because any analysis could be especially influenced by them. One quick way to do this is to group the data by a variable that is of interest, for instance ‘line’, using color.
ggplot(data = all_2020_ttc_data) +
geom_histogram(aes(x = min_delay, y = ..density.., fill = line),
position = 'dodge',
bins = 10) +
scale_x_log10()
Figure 13.1 uses density so that we can look at the the distributions more comparably, but we should also be aware of differences in frequency (Figure 13.2)). In this case, we will see that ‘SHP’ and ‘SRT’ have much smaller counts.
ggplot(data = all_2020_ttc_data) +
geom_histogram(aes(x = min_delay, fill = line),
position = 'dodge',
bins = 10) +
scale_x_log10()
To group by another variable we can add facets (Figure 13.3).
ggplot(data = all_2020_ttc_data) +
geom_density(aes(x = min_delay, color = line),
bw = .08) +
scale_x_log10() +
facet_wrap(vars(day))
We can now plot the top five stations by mean delay.
|>
all_2020_ttc_data group_by(line, station_clean) |>
summarise(mean_delay = mean(min_delay), n_obs = n()) |>
filter(n_obs>1) |>
arrange(line, -mean_delay) |>
slice(1:5) |>
ggplot(aes(station_clean, mean_delay)) +
geom_col() +
coord_flip() +
facet_wrap(vars(line),
scales = "free_y")
`summarise()` has grouped output by 'line'. You can override using the
`.groups` argument.
13.2.4 Dates
Dates are often difficult to work with because they are so prone to having issues. For this reason, it is especially important to consider them during EDA. We could create a graph by week, to see if there is any seasonality. When using dates, lubridate
(Grolemund and Wickham 2011) is especially useful. For instance, we can look at the average delay, of those that were delayed, by week drawing on week()
to construct the weeks (Figure 13.4).
library(lubridate)
|>
all_2020_ttc_data filter(min_delay > 0) |>
mutate(week = week(date)) |>
group_by(week, line) |>
summarise(mean_delay = mean(min_delay)) |>
ggplot(aes(week, mean_delay, color = line)) +
geom_point() +
geom_smooth() +
facet_wrap(vars(line),
scales = "free_y"
)
Now let us look at the proportion of delays that were greater than 10 minutes (Figure 13.5).
|>
all_2020_ttc_data mutate(week = week(date)) |>
group_by(week, line) |>
summarise(prop_delay = sum(min_delay>10)/n()) |>
ggplot(aes(week, prop_delay, color = line)) +
geom_point() +
geom_smooth() +
facet_wrap(vars(line),
scales = "free_y"
)
These figures, tables, and analysis have no place in a final paper. Instead, they allow us to become comfortable with the data. We note aspects about each that stand out, as well as the warnings and any implications or aspects to return to.
13.2.5 Relationships
We are also interested in looking at the relationship between two variables. Scatter plots are especially useful here for continuous variables, and are a good precursor to modeling. For instance, we may be interested in the relationship between the delay and the gap (Figure 13.6).
|>
all_2020_ttc_data ggplot(aes(x = min_delay, y = min_gap)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
The relationship between categorical variables takes more work, but we could also, for instance, look at the top five reasons for delay by station. In particular, we may be interested in whether they differ, and how any difference could be modeled (Figure 13.7).
|>
all_2020_ttc_data group_by(line, code_desc) |>
summarise(mean_delay = mean(min_delay)) |>
arrange(-mean_delay) |>
slice(1:5) |>
ggplot(aes(x = code_desc,
y = mean_delay)) +
geom_col() +
facet_wrap(vars(line),
scales = "free_y",
nrow = 4) +
coord_flip()
Principal components analysis (PCA) is another powerful exploratory tool. It allows us to pick up potential clusters and outliers that can help to inform modeling. To see this, we can look at the types of delay by station. The delay categories are messy and there a lot of them, but as we are trying to come to terms with the dataset, we will just take the first word.
<-
all_2020_ttc_data |>
all_2020_ttc_data mutate(code_red = case_when(
str_starts(code_desc, "No") ~ word(code_desc, 1, 2),
str_starts(code_desc, "Operator") ~ word(code_desc, 1, 2),
TRUE ~ word(code_desc, 1)
))
Let us also just restrict the analysis to causes that happen at least 50 times over 2019. To do the PCA, the dataframe also needs to be switched to wide format.
<-
dwide |>
all_2020_ttc_data group_by(line, station_clean) |>
mutate(n_obs = n()) |>
filter(n_obs > 1) |>
group_by(code_red) |>
mutate(tot_delay = n()) |>
arrange(tot_delay) |>
filter(tot_delay > 50) |>
group_by(line, station_clean, code_red) |>
summarise(n_delay = n()) |>
pivot_wider(names_from = code_red, values_from = n_delay) |>
mutate_all(.funs = funs(ifelse(is.na(.), 0, .)))
`summarise()` has grouped output by 'line', 'station_clean'. You can override
using the `.groups` argument.
`mutate_all()` ignored the following grouping variables:
Warning: `funs()` was deprecated in dplyr 0.8.0.
Please use a list of either functions or lambdas:
# Simple named list:
list(mean = mean, median = median)
# Auto named with `tibble::lst()`:
tibble::lst(mean, median)
# Using lambdas
list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
Now we can quickly do some PCA.
<- prcomp(dwide[,3:ncol(dwide)])
delay_pca
<- as_tibble(delay_pca$x)
df_out <- bind_cols(dwide |> select(line, station_clean), df_out)
df_out head(df_out)
# A tibble: 6 × 32
# Groups: line, station_clean [6]
line station_clean PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BD BATHURST 10.9 17.2 2.04 12.9 4.60 -2.98 2.31 5.02
2 BD BAY 14.6 6.54 4.76 14.4 0.406 3.09 -0.144 4.68
3 BD BLOOR 22.8 -18.6 19.7 -7.37 -1.54 -8.60 -1.36 1.08
4 BD BLOOR-DANFORTH 23.4 -20.2 20.4 -4.85 -0.429 -6.77 -0.562 1.19
5 BD BROADVIEW 9.29 22.0 -0.0365 6.72 4.31 1.73 0.304 4.71
6 BD CASTLE 15.1 5.21 7.62 11.6 -1.17 2.77 -1.71 4.93
# … with 22 more variables: PC9 <dbl>, PC10 <dbl>, PC11 <dbl>, PC12 <dbl>,
# PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>, PC18 <dbl>,
# PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>, PC24 <dbl>,
# PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>, PC30 <dbl>
We can plot the first two principal components, and add labels to some of the outlying stations.
library(ggrepel)
ggplot(df_out,aes(x=PC1,y=PC2,color=line )) +
geom_point() +
geom_text_repel(data = df_out |> filter(PC2>100|PC1<100*-1),
aes(label = station_clean)
)
We could also plot the factor loadings. We see some evidence that perhaps one is to do with the public, compared with another to do with the operator.
<- as_tibble(delay_pca$rotation)
df_out_r $feature <- colnames(dwide[,3:ncol(dwide)])
df_out_r
df_out_r
# A tibble: 30 × 31
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.0279 0.125 -0.0576 0.0679 -0.0133 0.0307 -0.00449 0.134 -0.0472
2 -0.108 0.343 -0.150 0.205 -0.100 0.317 -0.116 0.252 -0.407
3 -0.0196 0.0604 -0.0207 0.0195 0.0189 0.0574 -0.0803 -0.0467 -0.146
4 -0.0244 0.0752 -0.0325 -0.0237 0.00121 -0.0263 -0.0137 0.0251 0.104
5 -0.0128 0.0113 -0.00340 -0.00977 -0.0255 0.0186 -0.0645 -0.0552 -0.0302
6 -0.231 0.650 -0.309 0.245 0.222 -0.309 0.172 -0.0711 0.363
7 -0.0871 0.233 -0.0904 -0.692 -0.311 -0.414 -0.216 0.00703 -0.116
8 -0.00377 0.0193 -0.00201 -0.0140 -0.0424 0.0751 -0.146 -0.0712 -0.0203
9 -0.0167 0.120 -0.0367 -0.578 0.336 0.563 0.207 0.226 0.289
10 -0.0708 0.276 -0.118 0.116 -0.368 0.435 -0.178 0.0583 -0.0173
# … with 20 more rows, and 22 more variables: PC10 <dbl>, PC11 <dbl>,
# PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
# PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>,
# PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
# PC30 <dbl>, feature <chr>
ggplot(df_out_r,aes(x=PC1,y=PC2,label=feature )) + geom_text_repel()
Warning: ggrepel: 22 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
13.3 Case study: Airbnb listing in Toronto
In this case study we look at Airbnb listings in Toronto. The dataset is from Inside Airbnb (Cox 2021) and we will read it from their website, and then save a local copy. We can give read_csv()
a link to where the dataset is and it will download it. This helps with reproducibility because the source is clear. But, as that link could change at any time, longer-term reproducibility, as well as wanting to minimize the effect on the Inside Airbnb servers, suggest that we should also save a local copy of the data and then use that. As the original data is not ours, we should not make that public without first getting written permission. The ‘guess_max’ option in read_csv helps us avoid having to specify the column types. Usually read_csv()
takes a best guess at the column types based on the first few rows. But sometimes those first ones are misleading and so ‘guess_max’ forces it to look at a larger number of rows to try to work out what is going on.
library(tidyverse)
<-
airbnb_data read_csv("http://data.insideairbnb.com/canada/on/toronto/2021-01-02/data/listings.csv.gz",
guess_max = 20000)
write_csv(airbnb_data, "airbnb_data.csv")
airbnb_data
13.3.1 Explore individual variables
There are a large number of columns, so we will just select a few.
names(airbnb_data) |>
length()
[1] 74
<-
airbnb_data_selected |>
airbnb_data select(host_id,
host_since,
host_response_time,
host_is_superhost,
host_listings_count,
host_total_listings_count,
host_neighbourhood,
host_listings_count,
neighbourhood_cleansed,
room_type,
bathrooms,
bedrooms,
price,
number_of_reviews,
has_availability,
review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location,
review_scores_value
)
airbnb_data_selected
# A tibble: 18,265 × 21
host_id host_since host_response_time host_is_superhost host_listings_count
<dbl> <date> <chr> <lgl> <dbl>
1 1565 2008-08-08 N/A FALSE 1
2 22795 2009-06-22 N/A FALSE 2
3 48239 2009-10-25 N/A FALSE 1
4 93825 2010-03-15 N/A FALSE 2
5 118124 2010-05-04 within a day FALSE 1
6 22795 2009-06-22 N/A FALSE 2
7 174063 2010-07-20 within an hour TRUE 3
8 183071 2010-07-28 within an hour TRUE 2
9 187320 2010-08-01 within a few hours TRUE 13
10 192364 2010-08-05 N/A FALSE 1
# … with 18,255 more rows, and 16 more variables:
# host_total_listings_count <dbl>, host_neighbourhood <chr>,
# neighbourhood_cleansed <chr>, room_type <chr>, bathrooms <lgl>,
# bedrooms <dbl>, price <chr>, number_of_reviews <dbl>,
# has_availability <lgl>, review_scores_rating <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
# review_scores_checkin <dbl>, review_scores_communication <dbl>, …
First we might be interested in price. It is a character at the moment and so we need to convert it to a numeric. This is a common problem, and we need to be a little careful that it does not all just convert to NAs. In our case if we just force the price data to be a numeric then it will go to NA because there are a lot of characters where it is unclear what the numeric equivalent is, such as ‘$’. So we need to remove those characters first.
$price |> head() airbnb_data_selected
[1] "$469.00" "$96.00" "$64.00" "$70.00" "$45.00" "$127.00"
$price |> str_split("") |> unlist() |> unique() airbnb_data_selected
[1] "$" "4" "6" "9" "." "0" "7" "5" "1" "2" "3" "8" ","
|>
airbnb_data_selected select(price) |>
filter(str_detect(price, ","))
# A tibble: 145 × 1
price
<chr>
1 $1,724.00
2 $1,000.00
3 $1,100.00
4 $1,450.00
5 $1,019.00
6 $1,000.00
7 $1,300.00
8 $2,142.00
9 $2,000.00
10 $1,200.00
# … with 135 more rows
<-
airbnb_data_selected |>
airbnb_data_selected mutate(price = str_remove(price, "\\$"),
price = str_remove(price, ","),
price = as.integer(price)
)
Now we can have a look at the distribution of prices (Figure 13.8).
|>
airbnb_data_selected ggplot(aes(x = price)) +
geom_histogram(binwidth = 10) +
theme_classic() +
labs(x = "Price per night",
y = "Number of properties")
It is clear that there are outliers, so again we might like to consider it on the log scale (Figure 13.9).
|>
airbnb_data_selected filter(price > 1000) |>
ggplot(aes(x = price)) +
geom_histogram(binwidth = 10) +
theme_classic() +
labs(x = "Price per night",
y = "Number of properties") +
scale_y_log10()
If we focus on prices that are less than $1,000 then we see that the majority of properties have a nightly price less than $250. Interestingly it looks like there is some bunching of prices, possible around numbers ending in zero or nine. Let us just zoom in on prices between $90 and $210, out of interest, but change the bins to be smaller (?fig-airbnbpricesbunch).
|>
airbnb_data_selected filter(price < 1000) |>
ggplot(aes(x = price)) +
geom_histogram(binwidth = 10) +
theme_classic() +
labs(x = "Price per night",
y = "Number of properties")
|>
airbnb_data_selected filter(price > 90) |>
filter(price < 210) |>
ggplot(aes(x = price)) +
geom_histogram(binwidth = 1) +
theme_classic() +
labs(x = "Price per night",
y = "Number of properties")
For now, we will just remove all prices that are more than $999.
<-
airbnb_data_selected |>
airbnb_data_selected filter(price < 1000)
Superhosts are especially experienced Airbnb hosts, and we might be interested to learn more about them. For instance, a host either is or is not a superhost, and so we would not expect any NAs. But we can see that there are. It might be that the host removed a listing or similar. For now, we will remove them.
|>
airbnb_data_selected filter(is.na(host_is_superhost))
# A tibble: 11 × 21
host_id host_since host_response_time host_is_superhost host_listings_count
<dbl> <date> <chr> <lgl> <dbl>
1 23472830 NA <NA> NA NA
2 31675651 NA <NA> NA NA
3 75779190 NA <NA> NA NA
4 47614482 NA <NA> NA NA
5 201103629 NA <NA> NA NA
6 111820893 NA <NA> NA NA
7 23472830 NA <NA> NA NA
8 196269219 NA <NA> NA NA
9 23472830 NA <NA> NA NA
10 266594170 NA <NA> NA NA
11 118516038 NA <NA> NA NA
# … with 16 more variables: host_total_listings_count <dbl>,
# host_neighbourhood <chr>, neighbourhood_cleansed <chr>, room_type <chr>,
# bathrooms <lgl>, bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
# has_availability <lgl>, review_scores_rating <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
# review_scores_checkin <dbl>, review_scores_communication <dbl>,
# review_scores_location <dbl>, review_scores_value <dbl>
We will also want to create a binary variable from this. It is true/false at the moment, which is fine for the modelling, but there are a handful of situations where it will be easier if we have a 0/1. And for now we will just remove anyone with an NA for whether they are a super host.
<-
airbnb_data_selected |>
airbnb_data_selected filter(!is.na(host_is_superhost)) |>
mutate(host_is_superhost_binary = if_else(
== TRUE, 1, 0)
host_is_superhost )
On Airbnb, guests can give 1-5 star ratings across a variety of different aspects, including cleanliness, accuracy, value, and others. But when we look at the reviews in our dataset, it is less clear how this is being constructed, because it appears to be a rating between 20 and 100 and there are also NA values (Figure 13.12).
|>
airbnb_data_selected ggplot(aes(x = review_scores_rating)) +
geom_bar() +
theme_classic() +
labs(x = "Reviews scores rating",
y = "Number of properties")
We would like to deal with the NAs in ‘review_scores_rating’, but this is more complicated as there are a lot of them. It may be that this is just because they do not have any reviews.
|>
airbnb_data_selected filter(is.na(review_scores_rating)) |>
nrow()
[1] 4308
|>
airbnb_data_selected filter(is.na(review_scores_rating)) |>
select(number_of_reviews) |>
table()
0 1 2 3 4
4046 226 23 10 3
So these properties do not have a review rating yet because they do not have enough reviews. It is a large proportion of the total, at almost a fifth of them so we might like to look at this in more detail using vis_miss()
from visdat
(Tierney 2017). We are interested to see whether there is something systematic happening with these properties. For instance, if the NAs were being driven by, say, some requirement of a minimum number of reviews, then we would expect they would all be missing.
library(visdat)
|>
airbnb_data_selected select(review_scores_rating,
review_scores_accuracy,
review_scores_cleanliness,
review_scores_checkin,
review_scores_communication,
review_scores_location, |>
review_scores_value) vis_miss()
Given it looks convincing that in almost all cases, the different types of reviews are missing for the same observation. One approach would be to just focus on those that are not missing and the main review score. It is clear that almost all the reviews are more than 80. Let us just zoom in on that 60 to 80 range to check what the distribution looks like in that range (Figure 13.13).
|>
airbnb_data_selected filter(!is.na(review_scores_rating)) |>
filter(review_scores_rating > 60) |>
filter(review_scores_rating < 80) |>
ggplot(aes(x = review_scores_rating)) +
geom_histogram(binwidth = 1) +
theme_classic() +
labs(x = "Average review score",
y = "Number of properties")
For now, we will remove anyone with an NA in their main review score, even though this will remove roughly 20 per cent of observations. And we will also focus only on those hosts with a main review score of at least 70. If we ended up using this dataset for actual analysis, then we would want to justify this decision in an appendix or similar.
<-
airbnb_data_selected |>
airbnb_data_selected filter(!is.na(review_scores_rating)) |>
filter(review_scores_rating >= 70)
Another important factor is how quickly a host responds to an enquiry. Airbnb allows hosts up to 24 hours to respond, but encourages responses within an hour.
|>
airbnb_data_selected count(host_response_time)
# A tibble: 5 × 2
host_response_time n
<chr> <int>
1 a few days or more 494
2 N/A 5708
3 within a day 952
4 within a few hours 1649
5 within an hour 4668
It is unclear a host could have a response time of NA. It may be that they are related to some other variable. Interestingly it seems like what looks like ‘NAs’ in ‘host_response_time’ variable are not coded as proper NAs, but are instead being treated as another category. We will recode them to be actual NAs and also change the variable to be a factor.
<-
airbnb_data_selected |>
airbnb_data_selected mutate(host_response_time = if_else(host_response_time == "N/A", NA_character_, host_response_time),
host_response_time = factor(host_response_time)
)
There is clearly an issue with NAs as there are a lot of them. For instance, we might be interested to see if there is a relationship with the review score (Figure 13.14). There are a lot that have an overall review of 100.
|>
airbnb_data_selected filter(is.na(host_response_time)) |>
ggplot(aes(x = review_scores_rating)) +
geom_histogram(binwidth = 1) +
theme_classic() +
labs(x = "Average review score",
y = "Number of properties")
For now, we will remove anyone with a NA in their response time. This will again removes roughly another 20 per cent of the observations.
<-
airbnb_data_selected |>
airbnb_data_selected filter(!is.na(host_response_time))
There are two versions of a variable that suggest how many properties a host has on Airbnb. We might be interested to know whether there is a difference between them.
|>
airbnb_data_selected mutate(listings_count_is_same = if_else(host_listings_count == host_total_listings_count, 1, 0)) |>
filter(listings_count_is_same == 0)
# A tibble: 0 × 23
# … with 23 variables: host_id <dbl>, host_since <date>,
# host_response_time <fct>, host_is_superhost <lgl>,
# host_listings_count <dbl>, host_total_listings_count <dbl>,
# host_neighbourhood <chr>, neighbourhood_cleansed <chr>, room_type <chr>,
# bathrooms <lgl>, bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
# has_availability <lgl>, review_scores_rating <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>, …
As there are no differences in this dataset, we can just remove one variable for now and have a look at the other one (Figure 13.15).
<-
airbnb_data_selected |>
airbnb_data_selected select(-host_listings_count)
|>
airbnb_data_selected ggplot(aes(x = host_total_listings_count)) +
geom_bar() +
scale_x_log10()
There are a large number who have somewhere in the 2-10 properties range, but the usual long tail. The number with 0 listings is unexpected and worth following up on. And there are a bunch with NA that we will need to deal with.
|>
airbnb_data_selected filter(host_total_listings_count == 0) |>
head()
# A tibble: 6 × 21
host_id host_since host_response_time host_is_superhost host_total_listings_c…
<dbl> <date> <fct> <lgl> <dbl>
1 3783106 2012-10-06 within an hour FALSE 0
2 3814089 2012-10-09 within an hour FALSE 0
3 3827668 2012-10-10 within a day FALSE 0
4 2499198 2012-05-30 within a day FALSE 0
5 3268493 2012-08-15 within a day FALSE 0
6 8675040 2013-09-06 within an hour TRUE 0
# … with 16 more variables: host_neighbourhood <chr>,
# neighbourhood_cleansed <chr>, room_type <chr>, bathrooms <lgl>,
# bedrooms <dbl>, price <int>, number_of_reviews <dbl>,
# has_availability <lgl>, review_scores_rating <dbl>,
# review_scores_accuracy <dbl>, review_scores_cleanliness <dbl>,
# review_scores_checkin <dbl>, review_scores_communication <dbl>,
# review_scores_location <dbl>, review_scores_value <dbl>, …
There is nothing that immediately jumps out as odd about the people with zero listings, but there must be something going on. For now, we will focus on only those with one property.
<-
airbnb_data_selected |>
airbnb_data_selected add_count(host_id) |>
filter(n == 1) |>
select(-n)
13.3.2 Explore relationships between variables
We might like to make some graphs to see if there are any relationships that become clear. Some aspects that come to mind is looking at prices and reviews and super hosts, and number of properties and neighborhood.
Look at the relationship between price and reviews, and whether they are a super-host (Figure 13.16).
|>
airbnb_data_selected ggplot(aes(x = price, y = review_scores_rating, color = host_is_superhost)) +
geom_point(size = 1, alpha = 0.1) +
theme_classic() +
labs(x = "Price per night",
y = "Average review score",
color = "Super host") +
scale_color_brewer(palette = "Set1")
One of the aspects that may make someone a super host is how quickly they respond to inquiries. One could imagine that being a superhost involves quickly saying yes or no to inquiries. Let us look at the data. First, we want to look at the possible values of superhost by their response times.
|>
airbnb_data_selected count(host_is_superhost) |>
mutate(proportion = n / sum(n),
proportion = round(proportion, digits = 2))
# A tibble: 2 × 3
host_is_superhost n proportion
<lgl> <int> <dbl>
1 FALSE 1677 0.58
2 TRUE 1234 0.42
Fortunately, it looks like when we removed the reviews rows we removed any NAs from whether they were a super host, but if we go back and look into that we may need to check again. We could build a table that looks at a hosts response time by whether they are a superhost using tabyl()
from janitor
(Firke 2020). It is clear that is a host does not respond within an hour then it is unlikely that they are a super host.
|>
airbnb_data_selected tabyl(host_response_time, host_is_superhost) |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 0) |>
adorn_ns() |>
adorn_title()
host_is_superhost
host_response_time FALSE TRUE
a few days or more 90% (223) 10% (26)
within a day 70% (348) 30% (149)
within a few hours 55% (354) 45% (284)
within an hour 49% (752) 51% (775)
Finally, we could look at neighborhood. The data provider has attempted to clean the neighborhood variable for us, so will just use that variable for now. Although if we ended up using this variable for our actual analysis we would want to check they had not made any errors.
|>
airbnb_data_selected tabyl(neighbourhood_cleansed) |>
adorn_totals("row") |>
adorn_pct_formatting() |>
nrow()
[1] 140
|>
airbnb_data_selected tabyl(neighbourhood_cleansed) |>
adorn_pct_formatting() |>
arrange(-n) |>
filter(n > 100) |>
adorn_totals("row") |>
head()
neighbourhood_cleansed n percent
Waterfront Communities-The Island 488 16.8%
Niagara 129 4.4%
Annex 102 3.5%
Total 719 -
13.3.3 Explore multiple relationships with a model
We will now run some models on our dataset. We will cover modeling in more detail in Chapter @ref(ijalm), but we can use models during EDA to help get a better sense of relationships that may exist between multiple variables in a dataset. For instance, we may like to see whether we can forecast whether someone is a super host, and the factors that go into explaining that. As the dependent variable is binary, this is a good opportunity to use logistic regression. We expect that better reviews will be associated with faster responses and higher reviews. Specifically, the model that we estimate is:
\[\mbox{Prob(Is super host} = 1) = \beta_0 + \beta_1 \mbox{Response time} + \beta_2 \mbox{Reviews} + \epsilon\]
We estimate the model using glm
in the R language (R Core Team 2021).
<- glm(host_is_superhost ~
logistic_reg_superhost_response_review +
host_response_time
review_scores_rating,data = airbnb_data_selected,
family = binomial
)
We can have a quick look at the results using modelsummary()
from modelsummary
(Arel-Bundock 2021)
library(modelsummary)
modelsummary(logistic_reg_superhost_response_review)
Model 1 | |
---|---|
(Intercept) | −18.931 |
(1.273) | |
host_response_timewithin a day | 1.334 |
(0.235) | |
host_response_timewithin a few hours | 1.932 |
(0.227) | |
host_response_timewithin an hour | 2.213 |
(0.219) | |
review_scores_rating | 0.173 |
(0.013) | |
Num.Obs. | 2911 |
AIC | 3521.4 |
BIC | 3551.3 |
Log.Lik. | −1755.698 |
F | 75.649 |
13.4 Exercises and tutorial
13.4.1 Exercises
- In your own words what is exploratory data analysis (this will be difficult, but please write only one nuanced paragraph)?
- In Tukey’s words, what is exploratory data analysis (please write one paragraph)?
- Who was Tukey (please write a paragraph or two)?
- If you have a dataset called ‘my_data’, which has two columns: ‘first_col’ and ‘second_col’, then could you please write some rough R code that would generate a graph (the type of graph does not matter).
- Consider a dataset that has 500 rows and 3 columns, so there are 1,500 cells. If 100 of the cells are missing data for at least one of the columns, then would you remove the whole row your dataset or try to run your analysis on the data as is, or some other procedure? What if your dataset had 10,000 rows instead, but the same number of missing cells?
- Please note three ways of identifying unusual values.
- What is the difference between a categorical and continuous variable?
- What is the difference between a factor and an integer variable?
- How can we think about who is systematically excluded from a dataset?
- Using the
opendatatoronto
package, download the data on mayoral campaign contributions for 2014. (note: the 2014 file you will get fromget_resource
, so just keep the sheet that relates to the Mayor election).- Clean up the data format (fixing the parsing issue and standardizing the column names using
janitor
) - Summarize the variables in the dataset. Are there missing values, and if so, should we be worried about them? Is every variable in the format it should be? If not, create new variable(s) that are in the right format.
- Visually explore the distribution of values of the contributions. What contributions are notable outliers? Do they share a similar characteristic(s)? It may be useful to plot the distribution of contributions without these outliers to get a better sense of the majority of the data.
- List the top five candidates in each of these categories: 1) total contributions; 2) mean contribution; and 3) number of contributions.
- Repeat that process, but without contributions from the candidates themselves.
- How many contributors gave money to more than one candidate?
- Clean up the data format (fixing the parsing issue and standardizing the column names using
- Name three geoms that produce graphs that have bars on them
ggplot()
. - Consider a dataset 10,000 observations and 27 variables. For each observation, there is at least one missing variable. Please discuss, in a paragraph or two, the steps that you would take to understand what is going on.
- Known missing data, are those that leave holes in your dataset. But what about data that were never collected? Please look at McClelland (2019) and Luscombe and McClelland (2020). Look into how they gathered their dataset and what it took to put this together. What is in the dataset and why? What is missing and why? How could this affect the results? How might similar biases enter into other datasets that you have used or read about?
13.4.2 Tutorial
Redo the Airbnb EDA but for Paris. Please submit a PDF.