17  Text as data

Required material

Key concepts and skills

Key packages and functions

17.1 Introduction

Text can be considered an unwieldy, but in general similar, version of the datasets that we have used throughout this book. The main difference is that we will typically begin with very wide data, insofar as often each column is a word, or token more generally. Each entry is then often a count. We would then typically transform this into rather long data, with a column of the word and another column of the count.

The larger size of text datasets means that it is especially important to simulate, and start small, when it comes to their analysis. Using text as data is exciting because of the quantity and variety of text that is available to us. In general, dealing with text datasets is messy. There is a lot of cleaning and preparation that is typically required. Often text datasets are large. As such, having a workflow in place, in which you work in a reproducible way, simulating data first, and then clearly communicating your findings becomes critical, if only to keep everything organized in your own mind. Nonetheless, it is an exciting area.

In terms of next steps there are two, related, concerns: data and analysis.

In terms of data there are many places to get large amounts of text data relatively easily, from sources that we have already used, including:

  • Accessing the Twitter API using rtweet (Kearney 2019).
  • Using the Inside Airbnb, which provides text from reviews.
  • Getting the text of out-of-copyright books using gutenbergr (Robinson 2021).
  • And finally, scraping Wikipedia.

For instance text analysis of state-run newspapers in African countries can identify manipulation by governments (Hassan 2022).

In this chapter we first consider preparing text datasets. We then consider logistic and lasso regression. We finally consider topic models.

17.2 TF-IDF

Inspired by Gelfand (2019) and following Amaka and Thomas (2021), we will draw on the dataset they put together of makeup names and descriptions from Sephora and Ulta. We are interested in the counts of each word. We can read in the data using read_csv().


makeup <-
    file =

# A tibble: 3,117 × 9
   brand             product   name  speci…¹ light…² hex   light…³ numbers    id
   <chr>             <chr>     <chr> <chr>     <dbl> <chr> <lgl>     <dbl> <dbl>
 1 Makeup Revolution Conceal … <NA>  F0        0.949 #F2F… TRUE        0       1
 2 HOURGLASS         Veil Flu… Porc… No. 0     0.818 #F6D… TRUE        0       2
 3 TOM FORD          Traceles… Pearl 0.0       0.851 #F0D… TRUE        0       3
 4 Armani Beauty     Neo Nude… <NA>  0         0.912 #F0E… TRUE        0       4
 5 TOM FORD          Traceles… Pearl 0.0       0.912 #FDE… TRUE        0       5
 6 Charlotte Tilbury Magic Fo… <NA>  0         0.731 #D9A… TRUE        0       6
 7 Bobbi Brown       Skin Wei… Porc… 0         0.822 #F3C… TRUE        0       7
 8 Givenchy          Matissim… <NA>  N00       0.831 #F5D… TRUE        0       8
 9 Smashbox          Studio S… <NA>  0.1       0.814 #F8C… TRUE        0.1     9
10 Smashbox          Studio S… <NA>  0.1       0.910 #F9E… TRUE        0.1    10
# … with 3,107 more rows, and abbreviated variable names ¹​specific, ²​lightness,
#   ³​lightToDark
# ℹ Use `print(n = ...)` to see more rows

We will focus on “product”, which provides the name of the item, and “lightness” which is a value between 0 and 1. We are interested in whether products with lightness values that are less than 0.5, typically use different words to those with lightness values that are at least 0.5.

makeup <-
  makeup |>
  select(product, lightness) |>
  mutate(lightness_above_half = if_else(lightness >= 0.5, "Yes", "No"))


  No  Yes 
 702 2415 

In this example we are going to split everything into separate words. When we do this, it is just searching for a space. And so, it will be the case that more than just words will be considered “words”, for instance, numbers. We use unnest_tokens() from tidytext (Silge and Robinson 2016) to do this.


makeup_by_words <-
  makeup |>
    output = word,
    input = product,
    token = "words"

# A tibble: 6 × 3
  lightness lightness_above_half word      
      <dbl> <chr>                <chr>     
1     0.949 Yes                  conceal   
2     0.949 Yes                  define    
3     0.949 Yes                  full      
4     0.949 Yes                  coverage  
5     0.949 Yes                  foundation
6     0.818 Yes                  veil      

We now want to count the number of times each word is used by each of the lightness classifications.

makeup_by_words <-
  makeup_by_words |>
  count(lightness_above_half, word, sort = TRUE)

makeup_by_words |>
  filter(lightness_above_half == "Yes") |>
# A tibble: 5 × 3
  lightness_above_half word           n
  <chr>                <chr>      <int>
1 Yes                  foundation  2214
2 Yes                  skin         452
3 Yes                  spf          443
4 Yes                  matte        422
5 Yes                  powder       327
makeup_by_words |>
  filter(lightness_above_half == "No") |>
# A tibble: 5 × 3
  lightness_above_half word           n
  <chr>                <chr>      <int>
1 No                   foundation   674
2 No                   matte        106
3 No                   spf          103
4 No                   skin          98
5 No                   liquid        89

We can see that the most popular words appear to be similar between the two categories. At this point, we could use the data in a variety of ways. We might be interested to know which words characterize each group—that is to say, which words are commonly used only in each group. We can do that by first looking at a word’s term frequency (tf), which is how many times a word is used in the product name. The issue is that there are a lot of words that are commonly used regardless of context. As such, we may also like to look at the inverse document frequency (idf) in which we “penalize” words that occur in both groups. For instance, we have seen that “foundation” occurs in both products with high and low lightness values. And so, its idf would be lower than another word which only occurred in products that did not have a lightness value above half. The term frequency–inverse document frequency (tf-idf) is then the product of these.

We can create this value using bind_tf_idf() from tidytext. It will create a bunch of new columns, one for each word and star combination.

makeup_by_words_tf_idf <-
  makeup_by_words |>
    term = word,
    document = lightness_above_half,
    n = n
  ) |>

# A tibble: 505 × 6
   lightness_above_half word        n       tf   idf   tf_idf
   <chr>                <chr>   <int>    <dbl> <dbl>    <dbl>
 1 Yes                  cushion    25 0.00187  0.693 0.00130 
 2 Yes                  combo      16 0.00120  0.693 0.000831
 3 Yes                  custom     16 0.00120  0.693 0.000831
 4 Yes                  oily       16 0.00120  0.693 0.000831
 5 Yes                  perfect    16 0.00120  0.693 0.000831
 6 Yes                  refill     16 0.00120  0.693 0.000831
 7 Yes                  50         14 0.00105  0.693 0.000727
 8 Yes                  compact    13 0.000974 0.693 0.000675
 9 Yes                  lifting    12 0.000899 0.693 0.000623
10 Yes                  satte      12 0.000899 0.693 0.000623
# … with 495 more rows
# ℹ Use `print(n = ...)` to see more rows
makeup_by_words_tf_idf |>
  group_by(lightness_above_half) |>
# A tibble: 10 × 6
# Groups:   lightness_above_half [2]
   lightness_above_half word            n       tf   idf   tf_idf
   <chr>                <chr>       <int>    <dbl> <dbl>    <dbl>
 1 No                   able            2 0.000525 0.693 0.000364
 2 No                   concentrate     2 0.000525 0.693 0.000364
 3 No                   marc            2 0.000525 0.693 0.000364
 4 No                   re              2 0.000525 0.693 0.000364
 5 No                   look            1 0.000263 0.693 0.000182
 6 Yes                  cushion        25 0.00187  0.693 0.00130 
 7 Yes                  combo          16 0.00120  0.693 0.000831
 8 Yes                  custom         16 0.00120  0.693 0.000831
 9 Yes                  oily           16 0.00120  0.693 0.000831
10 Yes                  perfect        16 0.00120  0.693 0.000831

17.3 Lasso regression

One of the nice aspects of text is that we can adapt our existing methods to use it as an input. Here we are going to use logistic regression, along with text inputs, to forecast. Inspired by Silge (2018) we are going to have two different text inputs, train a model on a sample of text from each of them, and then try to use that model to forecast the text in a training set. Although this is a arbitrary example, we could imagine many real-world applications. For instance, we may be interested in whether some text was likely written by a bot or a human.

Shoulders of giants

Rob Tibshirani - link to lasso

First we need to get some data. We use books from Project Gutenberg using gutenberg_download() from gutenbergr (Robinson 2021). We will consider Jane Eyre (Brontë 1847) and Alice’s Adventures in Wonderland (Carroll 1865).


# The books that we are interested in have the keys of 1260 and 11, respectively.
alice_and_jane <-
    gutenberg_id = c(1260, 11),
    meta_fields = "title"

write_csv(alice_and_jane, "alice_and_jane.csv")


One of the great things about this is that the dataset is a tibble. So we can work with all our familiar skills. Each line of the book is read in as a different row in the dataset. Notice that we have downloaded two books here at once, and so we added the title. The two books are one after each other.

By looking at the number of lines in each, it looks like Jane Eyre is much longer than Alice in Wonderland. We will start by getting rid of blank lines using remove_empty() from janitor (Firke 2020).


Attaching package: 'janitor'
The following objects are masked from 'package:stats':

    chisq.test, fisher.test
alice_and_jane <-
  alice_and_jane |>
  mutate(blank_line = if_else(text == "", 1, 0)) |>
  filter(blank_line == 0) |>


Alice's Adventures in Wonderland      Jane Eyre: An Autobiography 
                            2481                            16395 

There is still an overwhelming amount of Jane Eyre, compared with Alice in Wonderland, so we will sample from Jane Eyre to make it more equal.


alice_and_jane$rows <- c(1:nrow(alice_and_jane))
sample_from_me <- alice_and_jane |> filter(title == "Jane Eyre: An Autobiography")
keep_me <- sample(x = sample_from_me$rows, size = 2481, replace = FALSE)

alice_and_jane <-
  alice_and_jane |>
  filter(title == "Alice's Adventures in Wonderland" | rows %in% keep_me) |>


Alice's Adventures in Wonderland      Jane Eyre: An Autobiography 
                            2481                             2481 

There are a variety of issues here, for instance, we have the whole of Alice, but we only have random bits of Jane, but nonetheless we will continue and add a counter with the line number for each book.

alice_and_jane <-
  alice_and_jane |>
  group_by(title) |>
  mutate(line_number = paste(gutenberg_id, row_number(), sep = "_")) |>

We now want to unnest the tokes. We will use unnest_tokens() from tidytext (Silge and Robinson 2016).


alice_and_jane_by_word <-
  alice_and_jane |>
  unnest_tokens(word, text) |>
  group_by(word) |>
  filter(n() > 10) |>

We remove any word that was not used more than 10 times. Nonetheless we still have more than 500 unique words. (If we did not require that the word be used by the author at least 10 times then we end up with more than 6,000 words.)

The reason this is relevant is because these are our independent variables. So where we may be used to having something less than 10 explanatory variables, in this case we are going to have 585 As such, we need a model that can handle this.

However, as mentioned before, we are going to have some rows that essentially just had one word. And so we filter for that also, which ensures that the model will have at least some words to work with.

alice_and_jane_by_word <-
  alice_and_jane_by_word |>
  group_by(title, line_number) |>
  mutate(number_of_words_in_line = n()) |>
  ungroup() |>
  filter(number_of_words_in_line > 2) |>

We will create a test/training split, and load in tidymodels.



alice_and_jane_by_word_split <-
  alice_and_jane_by_word |>
  select(title, line_number) |>
  distinct() |>
  initial_split(prop = 3 / 4, strata = title)

Then we can use cast_dtm() to create a document-term matrix. This provides a count of how many times each word is used.

alice_and_jane_dtm_training <-
  alice_and_jane_by_word |>
  count(line_number, word) |>
  inner_join(training(alice_and_jane_by_word_split) |> select(line_number)) |>
  cast_dtm(term = word, document = line_number, value = n)
Joining, by = "line_number"
[1] 3413  585

So we have our independent variables sorted, now we need our binary dependent variable, which is whether the book is Alice in Wonderland or Jane Eyre.

response <-
  data.frame(id = dimnames(alice_and_jane_dtm_training)[[1]]) |>
  separate(id, into = c("book", "line", sep = "_")) |>
  mutate(is_alice = if_else(book == 11, 1, 0))
Warning: Expected 3 pieces. Missing pieces filled with `NA` in 3413 rows [1, 2,
3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
predictor <- alice_and_jane_dtm_training[] |> as.matrix()

Now we can run our model.


model <- cv.glmnet(
  x = predictor,
  y = response$is_alice,
  family = "binomial",
  keep = TRUE

save(model, file = "alice_vs_jane.rda")
Loading required package: Matrix

Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':

    expand, pack, unpack
Loaded glmnet 4.1-4

coefs <- model$glmnet.fit |>
  tidy() |>
  filter(lambda == model$lambda.1se)

coefs |> head()
# A tibble: 6 × 5
  term         step estimate  lambda dev.ratio
  <chr>       <dbl>    <dbl>   <dbl>     <dbl>
1 (Intercept)    36 -0.335   0.00597     0.562
2 in             36 -0.144   0.00597     0.562
3 she            36  0.390   0.00597     0.562
4 so             36  0.00249 0.00597     0.562
5 a              36 -0.117   0.00597     0.562
6 about          36  0.279   0.00597     0.562
coefs |>
  group_by(estimate > 0) |>
  top_n(10, abs(estimate)) |>
  ungroup() |>
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  theme_minimal() +
    x = "Coefficient",
    y = "Word"
  ) +
  scale_fill_brewer(palette = "Set1")

Perhaps unsurprisingly, if a line mentions Alice then it is likely to be a Alice in Wonderland and if it mention Jane then it is likely to be Jane Eyre.

17.4 Topic models

Sometimes we have a statement and we want to know what it is about. Sometimes this will be easy, but we do not always have titles for statements, and even when we do, sometimes we do not have titles that define topics in a well-defined and consistent way. One way to get consistent estimates of the topics of each statement is to use topic models. While there are many variants, one way is to use the latent Dirichlet allocation (LDA) method of Blei, Ng, and Jordan (2003), as implemented by topicmodels (Grün and Hornik 2011).

The key assumption behind the LDA method is that each statement, “a document”, is made by a person who decides the topics they would like to talk about in that document, and then chooses words, “terms”, that are appropriate to those topics. A topic could be thought of as a collection of terms, and a document as a collection of topics. The topics are not specified ex ante; they are an outcome of the method. Terms are not necessarily unique to a particular topic, and a document could be about more than one topic. This provides more flexibility than other approaches such as a strict word count method. The goal is to have the words found in documents group themselves to define topics.

LDA considers each statement to be a result of a process where a person first chooses the topics they want to speak about. After choosing the topics, the person then chooses appropriate words to use for each of those topics. More generally, the LDA topic model works by considering each document as having been generated by some probability distribution over topics. For instance, if there were five topics and two documents, then the first document may be comprised mostly of the first few topics; the other document may be mostly about the final few topics (Figure 17.1).

(a) Distribution for Document 1

(b) Distribution for Document 2

Figure 17.1: Probability distributions over topics

Similarly, each topic could be considered a probability distribution over terms. To choose the terms used in each document the speaker picks terms from each topic in the appropriate proportion. For instance, if there were ten terms, then one topic could be defined by giving more weight to terms related to immigration; and some other topic may give more weight to terms related to the economy (Figure 17.2).

(a) Distribution for Topic 1

(b) Distribution for Topic 2

Figure 17.2: Probability distributions over terms

Following Blei and Lafferty (2009), Blei (2012) and Griffiths and Steyvers (2004), the process by which a document is generated is more formally considered to be:

  1. There are \(1, 2, \dots, k, \dots, K\) topics and the vocabulary consists of \(1, 2, \dots, V\) terms. For each topic, decide the terms that the topic uses by randomly drawing distributions over the terms. The distribution over the terms for the \(k\)th topic is \(\beta_k\). Typically a topic would be a small number of terms and so the Dirichlet distribution with hyperparameter \(0<\eta<1\) is used: \(\beta_k \sim \mbox{Dirichlet}(\eta)\).[^Dirichletfootnote] Strictly, \(\eta\) is actually a vector of hyperparameters, one for each \(K\), but in practice they all tend to be the same value.
  2. Decide the topics that each document will cover by randomly drawing distributions over the \(K\) topics for each of the \(1, 2, \dots, d, \dots, D\) documents. The topic distributions for the \(d\)th document are \(\theta_d\), and \(\theta_{d,k}\) is the topic distribution for topic \(k\) in document \(d\). Again, the Dirichlet distribution with the hyperparameter \(0<\alpha<1\) is used here because usually a document would only cover a handful of topics: \(\theta_d \sim \mbox{Dirichlet}(\alpha)\). Again, strictly \(\alpha\) is vector of length \(K\) of hyperparameters, but in practice each is usually the same value.
  3. If there are \(1, 2, \dots, n, \dots, N\) terms in the \(d\)th document, then to choose the \(n\)th term, \(w_{d, n}\):
    1. Randomly choose a topic for that term \(n\), in that document \(d\), \(z_{d,n}\), from the multinomial distribution over topics in that document, \(z_{d,n} \sim \mbox{Multinomial}(\theta_d)\).
    2. Randomly choose a term from the relevant multinomial distribution over the terms for that topic, \(w_{d,n} \sim \mbox{Multinomial}(\beta_{z_{d,n}})\).

The Dirichlet distribution is a variation of the beta distribution that is commonly used as a prior for categorical and multinomial variables. If there are just two categories, then the Dirichlet and the beta distributions are the same. In the special case of a symmetric Dirichlet distribution, \(\eta=1\), it is equivalent to a uniform distribution. If \(\eta<1\), then the distribution is sparse and concentrated on a smaller number of the values, and this number decreases as \(\eta\) decreases. A hyperparameter is a parameter of a prior distribution.

Given this set-up, the joint distribution for the variables is (Blei (2012), p.6):

\[p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}) = \prod^{K}_{i=1}p(\beta_i) \prod^{D}_{d=1}p(\theta_d) \left(\prod^N_{n=1}p(z_{d,n}|\theta_d)p\left(w_{d,n}|\beta_{1:K},z_{d,n}\right) \right).\]

Based on this document generation process the analysis problem, discussed in the next section, is to compute a posterior over \(\beta_{1:K}\) and \(\theta_{1:D}\), given \(w_{1:D, 1:N}\). This is intractable directly, but can be approximated (Griffiths and Steyvers (2004) and Blei (2012)).

After the documents are created, they are all that we have to analyze. The term usage in each document, \(w_{1:D, 1:N}\), is observed, but the topics are hidden, or “latent”. We do not know the topics of each document, nor how terms defined the topics. That is, we do not know the probability distributions of Figures Figure 17.1 or Figure 17.2. In a sense we are trying to reverse the document generation process – we have the terms and we would like to discover the topics.

If the earlier process around how the documents were generated is assumed and we observe the terms in each document, then we can obtain estimates of the topics (Steyvers and Griffiths (2006)). The outcomes of the LDA process are probability distributions and these define the topics. Each term will be given a probability of being a member of a particular topic, and each document will be given a probability of being about a particular topic. That is, we are trying to calculate the posterior distribution of the topics given the terms observed in each document (Blei (2012), p.7):

\[p(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N} | w_{1:D, 1:N}) = \frac{p\left(\beta_{1:K}, \theta_{1:D}, z_{1:D, 1:N}, w_{1:D, 1:N}\right)}{p(w_{1:D, 1:N})}.\]

The initial practical step when implementing LDA given a corpus of documents is to remove “stop words”. These are words that are common, but that do not typically help to define topics. There is a general list of stop words such as: “a”; “a’s”; “able”; “about”; “above”… We also remove punctuation and capitalization. The documents need to then be transformed into a document-term-matrix. This is essentially a table with a column of the number of times each term appears in each document.

After the dataset is ready, topicmodels (Grün and Hornik 2011) can be used to implement LDA and approximate the posterior. It does this using Gibbs sampling or the variational expectation-maximization algorithm. Following Steyvers and Griffiths (2006) and Darling (2011), the Gibbs sampling process attempts to find a topic for a particular term in a particular document, given the topics of all other terms for all other documents. Broadly, it does this by first assigning every term in every document to a random topic, specified by Dirichlet priors with \(\alpha = \frac{50}{K}\) and \(\eta = 0.1\) (Steyvers and Griffiths (2006) recommends \(\eta = 0.01\)), where \(\alpha\) refers to the distribution over topics and \(\eta\) refers to the distribution over terms (Grün and Hornik (2011), p.7). It then selects a particular term in a particular document and assigns it to a new topic based on the conditional distribution where the topics for all other terms in all documents are taken as given (Grün and Hornik (2011), p.6):

\[p(z_{d, n}=k | w_{1:D, 1:N}, z'_{d, n}) \propto \frac{\lambda'_{n\rightarrow k}+\eta}{\lambda'_{.\rightarrow k}+V\eta} \frac{\lambda'^{(d)}_{n\rightarrow k}+\alpha}{\lambda'^{(d)}_{-i}+K\alpha} \]

where \(z'_{d, n}\) refers to all other topic assignments; \(\lambda'_{n\rightarrow k}\) is a count of how many other times that term has been assigned to topic \(k\); \(\lambda'_{.\rightarrow k}\) is a count of how many other times that any term has been assigned to topic \(k\); \(\lambda'^{(d)}_{n\rightarrow k}\) is a count of how many other times that term has been assigned to topic \(k\) in that particular document; and \(\lambda'^{(d)}_{-i}\) is a count of how many other times that term has been assigned in that document. Once \(z_{d,n}\) has been estimated, then estimates for the distribution of words into topics and topics into documents can be backed out.

This conditional distribution assigns topics depending on how often a term has been assigned to that topic previously, and how common the topic is in that document (Steyvers and Griffiths (2006)). The initial random allocation of topics means that the results of early passes through the corpus of document are poor, but given enough time the algorithm converges to an appropriate estimate.

The choice of the number of topics, k, affects the results, and must be specified a priori. If there is a strong reason for a particular number, then this can be used. Otherwise, one way to choose an appropriate number is to use a test and training set process. Essentially, this means running the process on a variety of possible values for k and then picking an appropriate value that performs well.

One weakness of the LDA method is that it considers a “bag of words” where the order of those words does not matter (Blei (2012)). It is possible to extend the model to reduce the impact of the bag-of-words assumption and add conditionality to word order. Additionally, alternatives to the Dirichlet distribution can be used to extend the model to allow for correlation. For instance, in Hansard topics related the army may be expected to be more commonly found with topics related to the navy, but less commonly with topics related to banking.

17.5 Exercises and tutorial


  1. (Plan) Consider the following scenario: TBD. Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation with every variable independent of each other.
  3. (Acquire) Please describe three possible sources of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched.
  5. (Communicate) Please write two paragraphs about what you did.


Consider a situation where we run a news website and are trying to understand whether to allow anonymous comments. We decide to do an A/B test, where we will keep everything the same, but only allow anonymous comments on one version of the site. Please simulate some text data that we may obtain from our test. And then build a model that could test whether there is a difference between the situations.