16  Text as data

Required material

Key concepts and skills

Key packages and functions

16.1 Introduction

Text is all around us. In many cases text is the earliest type of data that we are exposed to. Increases in computational power, the development of new methods, and the enormous availability of text, means that there has been a great deal of interest in using text as data. And they are an exciting option to have. For instance, text analysis of state-run newspapers in African countries can identify manipulation by governments (Hassan 2022). The analysis of notes in Electronic Health Records (EHR) can improve the efficiency of disease prediction (Gronsbell et al. 2019). And analysis of US congressional records indicates just how much women legislators are interrupted by men (Miller and Sutherland 2022).

Earlier approaches to the analysis of text tend to convert words into numbers, divorced of context. They could then be analyzed using traditional approaches, such as variants of logistic regression. More recent methods try to take advantage of the structure inherent in text, which can bring additional meaning. The difference is perhaps akin to a child who can group similar colors, compared with a child who knows what objects are; although both crocodiles and trees are green, and you can do something with that knowledge, you can do more by knowing that a crocodile could eat you while a tree probably would not.

Text can be considered an unwieldy, but in general similar, version of the datasets that we have used throughout this book. The main difference is that we will typically begin with wide data, insofar as often each variable is a word, or token more generally. Each entry is then often a count. We would then typically transform this into rather long data, with one variable of words and another of the counts. Considering text as data naturally requires some abstraction from its context. But it is important it is not entirely separated as this can perpetuate historical inequities. For instance, Koenecke et al. (2020) find that automated speech recognition systems perform much worse for Black compared with white speakers, and Davidson, Bhattacharya, and Weber (2019) find that tweets that use Black American English, which is a specifically defined technical term, are classified at hate speech at higher rates than similar tweets in Standard American English, which again is a technical term.

In this chapter we cover a variety of approaches that enable us to consider text as data. One exciting aspect of text data is that it is typically not generated for the purposes of our analysis. The trade-off is that we typically must do a bunch more work to get it into a form that we can work with. And there are a lot of decisions to be made in the data cleaning and preparation stages.

The larger size of text datasets means that it is especially important to simulate, and start small, when it comes to their analysis. Using text as data is exciting because of the quantity and variety of text that is available to us. In general, dealing with text datasets is messy. There is a lot of cleaning and preparation that is typically required. Often text datasets are large. As such, having a workflow in place, in which you work in a reproducible way, simulating data first, and then clearly communicating your findings becomes critical, if only to keep everything organized in your own mind. Nonetheless, it is an exciting area.

In this chapter we first consider preparing text datasets. We then consider logistic and lasso regression. We finally consider topic models and word embedding.

16.2 Text cleaning and preparation

Text modelling is an exciting area of research. But, and this is true more generally, the cleaning and preparation aspect is at least as difficult as the modelling. We will cover some essentials and provide a foundation that can be built on.

The first step is to get some data. We discussed data gathering in Chapter 7 and mentioned in passing many sources including:

  • Accessing the Twitter API using rtweet (Kearney 2019).
  • Using Inside Airbnb, which provides text from reviews.
  • Getting the text from out-of-copyright books using gutenbergr (Johnston and Robinson 2022).
  • Scraping Wikipedia or other websites.

The workhorse packages that we need for text cleaning and preparation are stringr (Wickham 2022), which is part of the tidyverse (Wickham et al. 2019), and quanteda (Benoit et al. 2018).

For illustrative purposes we construct a corpus of the first sentence, or two, from three books: “Beloved”, “Don Quixote”, and “Jane Eyre”.


don_quixote <-
    "In a village of La Mancha, the name of which I have no desire to ",
    "call to mind, there lived not long since one of those gentlemen ",
    "that keep a lance in the lance-rack, an old buckler, a lean hack, ",
    "and a greyhound for coursing."

beloved <- "124 was spiteful. Full of Baby's venom."

jane_eyre <- "There was no possibility of taking a walk that day."

bookshelf <-
    book = c("Don Quixote", "Beloved", "Jane Eyre"),
    first_sentence = c(don_quixote, beloved, jane_eyre)

# A tibble: 3 × 2
  book        first_sentence                                                    
  <chr>       <chr>                                                             
1 Don Quixote In a village of La Mancha, the name of which I have no desire to …
2 Beloved     124 was spiteful. Full of Baby's venom.                           
3 Jane Eyre   There was no possibility of taking a walk that day.               

We typically want to construct a document-feature matrix, which has documents in each observation, words in each column, and a count for each combination, along with associated metadata. For instance, if our corpus was the text from Airbnb reviews, then each document may be a review, and typical features could include “The”, “Airbnb”, “was”, “great”. Notice here that the sentence has been split into different words. We typically talk of “tokens” to generalize away from words, because of the variety of aspects we may be interested in, but words are commonly used.

books_corpus <-
  corpus(bookshelf, docid_field = "book", text_field = "first_sentence")

Corpus consisting of 3 documents.
Don Quixote :
"In a village of La Mancha, the name of which I have no desir..."

Beloved :
"124 was spiteful. Full of Baby's venom."

Jane Eyre :
"There was no possibility of taking a walk that day."

We use the tokens in the corpus to construct a document-feature matrix using dfm() from quanteda (Benoit et al. 2018).

books_dfm <-
  books_corpus |>
  tokens() |>

Document-feature matrix of: 3 documents, 49 features (60.54% sparse) and 0 docvars.
docs          in a village of la mancha , the name which
  Don Quixote  2 4       1  3  1      1 5   2    1     1
  Beloved      0 0       0  1  0      0 0   0    0     0
  Jane Eyre    0 1       0  1  0      0 0   0    0     0
[ reached max_nfeat ... 39 more features ]

While this is relatively straightforward, there are many decisions that will need to be made as part of this process, which we now consider. There is no definitive right or wrong answer. Instead, we make those decisions based on what we will be using the dataset for.

16.2.1 Stop words

Stop words are words such as “the”, “and”, and “a”. For a long time, such words were thought to not convey much meaning, and there was often a memory of computation constraint. A common step of preparing a text dataset was to remove them. We now know that stop words can have a great deal of meaning (Schofield, Magnusson, and Mimno 2017). The decision to remove them is a nuanced one that depends on circumstances.

We can get a list of stop words using stopwords() from quanteda (Benoit et al. 2018), and then do a crude removal using str_replace_all().

stopwords(source = "snowball")[1:10]
 [1] "i"         "me"        "my"        "myself"    "we"        "our"      
 [7] "ours"      "ourselves" "you"       "your"     
stop_word_list <-
  paste(stopwords(source = "snowball"), collapse = " | ")

bookshelf |>
  mutate(extra_space = str_replace_all(
    string = first_sentence,
    pattern = "([ ,.])", # Spaces and punctuation marks present in text
    replacement = " \\1" # Add a space before any of the detected characters
  )) |>
  mutate(no_stops = str_replace_all(
    # Remove stop words
    string = extra_space,
    pattern = stop_word_list,
    replacement = ""
  )) |>
  mutate(no_stops = str_squish( # Remove extra spaces within string
    string = no_stops
  )) |>
  select(no_stops, first_sentence)
# A tibble: 3 × 2
  no_stops                                                               first…¹
  <chr>                                                                  <chr>  
1 In village La Mancha , name I desire call mind , lived long since one… In a v…
2 124 spiteful . Full Baby's venom .                                     124 wa…
3 There possibility taking walk day .                                    There …
# … with abbreviated variable name ¹​first_sentence

There are many different lists of stop words that have been put together by others. For instance, stopwords() can use lists including: “snowball”, “stopwords-iso”, “smart”, “marimo”, “ancient”, and “nltk”. More generally, if we decide to use stop words then we often need to augment such lists with project-specific words. We can do this by creating a count of individual words in the corpus, and then sorting by the most common and adding those to the stop words list as appropriate.

stop_word_list_updated <-
    "village |",
    "spiteful |",
    "possibility |",
    collapse = " | "

bookshelf |>
  mutate(extra_space = str_replace_all(
    # Add double spaces
    string = first_sentence,
    pattern = " ",
    replacement = "  "
  )) |>
  mutate(no_stops = str_replace_all(
    # Remove stop words
    string = extra_space,
    pattern = stop_word_list_updated,
    replacement = ""
  )) |>
  mutate(no_stops = str_squish( # Remove extra spaces within string
    string = no_stops
  )) |>
# A tibble: 3 × 1
1 In La Mancha, name I desire call mind, lived long since one gentlemen keep la…
2 124 spiteful. Full Baby's venom.                                              
3 There taking walk day.                                                        

We can integrate the removal of stop words into our construction of the DFM with dfm_remove().

books_dfm |>
  dfm_remove(stopwords(source = "snowball"))
Document-feature matrix of: 3 documents, 32 features (64.58% sparse) and 0 docvars.
docs          village la mancha , name desire call mind lived long
  Don Quixote       1  1      1 5    1      1    1    1     1    1
  Beloved           0  0      0 0    0      0    0    0     0    0
  Jane Eyre         0  0      0 0    0      0    0    0     0    0
[ reached max_nfeat ... 22 more features ]

When we remove stop words we artificially adjust our dataset. Sometimes there may be a good reason to do that. But it must not be done unthinkingly. For instance, in Chapter 6 we discussed how sometimes datasets may need to be censored, truncated, or manipulated in other similar ways, to preserve the privacy of respondents. It is possible that the integration of the removal of stop words as a default step in natural language processing was due to computational power, which may have been more limited when these methods were developed. In any case, Jurafsky and Martin (2022, 62) conclude that removing stop words does not improve performance for text classification. And Schofield, Magnusson, and Mimno (2017) find that inference from topic models is not improved by the removal of anything other than the most frequent words. If stop words are to be removed, they recommend doing this after topics are constructed.

16.2.2 Case, numbers, and punctuation

There are times when all we care about is the word, not the case nor punctuation. There are a variety of circumstances in which this may be appropriate. For instance, if the text corpus was particularly messy, or the existence of particular words was informative. We trade-off the loss of much information, for the benefit of making things more simple. We can convert to lower case with str_to_lower(), and use str_replace_all() to remove punctuation with “[:punct:]”, and numbers with “[:digit:]”.

bookshelf |>
  mutate(lower_sentence = str_to_lower(string = first_sentence)) |>
# A tibble: 3 × 1
1 in a village of la mancha, the name of which i have no desire to call to mind…
2 124 was spiteful. full of baby's venom.                                       
3 there was no possibility of taking a walk that day.                           
bookshelf |>
  mutate(no_punctuation = str_replace_all(
    string = first_sentence,
    pattern = "[:punct:]",
    replacement = " "
  )) |>
# A tibble: 3 × 1
1 "In a village of La Mancha  the name of which I have no desire to call to min…
2 "124 was spiteful  Full of Baby s venom "                                     
3 "There was no possibility of taking a walk that day "                         
bookshelf |>
    no_numbers = str_replace_all(
      string = first_sentence,
      pattern = "[:digit:]",
      replacement = " "
  ) |>
# A tibble: 3 × 1
1 "In a village of La Mancha, the name of which I have no desire to call to min…
2 "    was spiteful. Full of Baby's venom."                                     
3 "There was no possibility of taking a walk that day."                         

As an aside, we can remove letters, numbers, and punctuation with “[:graph:]” in str_replace_all(). While this is rarely needed in textbook examples, it is especially useful with real datasets, because they will typically have a small number of unexpected symbols that we need to identify and then remove. We use it to remove everything that we are used to, leaving only that which we are not.

bookshelf |>
    remove_obvious = str_replace_all(
      string = first_sentence,
      pattern = "[:graph:]",
      replacement = " "
  ) |>
# A tibble: 3 × 1
1 "                                                                            …
2 "                                       "                                     
3 "                                                   "                         

More generally, we can use arguments in tokens() from quanteda() to do this.

books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE)
Tokens consisting of 3 documents.
Don Quixote :
 [1] "In"      "a"       "village" "of"      "La"      "Mancha"  "the"    
 [8] "name"    "of"      "which"   "I"       "have"   
[ ... and 33 more ]

Beloved :
[1] "was"      "spiteful" "Full"     "of"       "Baby's"   "venom"   

Jane Eyre :
 [1] "There"       "was"         "no"          "possibility" "of"         
 [6] "taking"      "a"           "walk"        "that"        "day"        

16.2.3 Typos and uncommon words

Then we need to decide what to do about typos and other minor issues. Firstly, every real-world text has typos. Sometimes these should clearly be fixed. But if they are made in a systematic way, for instance, a certain writer always makes the same mistakes, then they would have value if we were interested in grouping by the writer. The use of OCR will introduce common issues as well, as was seen in Chapter 7. For instance, “the” is commonly incorrectly changed to “thc”.

We could fix typos in the same way that we fixed stop words, i.e. with lists of corrections. When it comes to uncommon words, we can build this into our document-feature matrix creation with dfm_trim(). For instance, we could use “min_termfreq = 2” to remove any word that does not occur at least twice, or “min_docfreq = 0.05” to remove any word that is not in at least 5 per cent of documents or “max_docfreq = 0.90” to remove any word that is in at least 90 per cent of documents.

books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE) |>
  dfm(tolower = TRUE) |>
  dfm_trim(min_termfreq = 2)
Document-feature matrix of: 3 documents, 9 features (40.74% sparse) and 0 docvars.
docs          in a of the no to there that was
  Don Quixote  2 4  3   2  1  2     1    1   0
  Beloved      0 0  1   0  0  0     0    0   1
  Jane Eyre    0 1  1   0  1  0     1    1   1

16.2.4 Tuples

A tuple is an ordered list of elements, and in the context of text, it is a series of words. If it is two words then we term this a “bi-gram”, three words is a “tri-gram”, etc. These are an issue when it comes to text cleaning and preparation because we typically need to separate terms based on a space, and this would result in inappropriate separation.

This is a clear issue when it comes to place names. For instance, consider “British Columbia”, “New Hampshire”, “United Kingdom”, and “Port Hedland”. One way forward is to create a list of such places and then use str_replace_all() to add an underbar, for instance, “British_Columbia”, “New_Hampshire”, “United_Kingdom”, and “Port_Hedland”. Another option, is to use tokens_compound() from quanteda.

some_places <- c("British Columbia", 
                 "New Hampshire", 
                 "United Kingdom", 
                 "Port Hedland")
a_sentence <-
c("Vancouver is in British Columbia and New Hampshire is not")

tokens(a_sentence) |>
  tokens_compound(pattern = phrase(some_places))
Tokens consisting of 1 document.
text1 :
[1] "Vancouver"        "is"               "in"               "British_Columbia"
[5] "and"              "New_Hampshire"    "is"               "not"             

In that case, we knew what the tuples were. But it might be that we were not sure what the common tuples were in the corpus. We could then use tokens_ngrams() to identify them. We could ask for, say, all bi-grams in an excerpt from the book “Don Quixote”.


don_quixote <-
    gutenberg_id = 996
  ) |>
  filter(text != "") |>
  slice_sample(n = 1000)

write_csv(don_quixote, "books-don_quixote.csv")
don_quixote <- read_csv(
  col_types = cols(
    gutenberg_id = col_integer(),
    text = col_character()
don_q_text <- tibble(
  book = "Don Quixote",
  text = paste(don_quixote$text, collapse = " ") |>
      pattern = "[:punct:]",
      replacement = " "
    ) |>
      # Add double spaces
      pattern = " ",
      replacement = "  "
    ) |>
      # Remove stop words
      pattern = stop_word_list,
      replacement = ""
    ) |>

don_q_corpus <-
  corpus(don_q_text, docid_field = "book", text_field = "text")

ngrams <- tokens_ngrams(tokens(don_q_corpus), n = 2)

ngram_counts <-
  tibble(ngrams = unlist(ngrams)) |>
  count(ngrams, sort = TRUE)

# A tibble: 6 × 2
  ngrams            n
  <chr>         <int>
1 Don_Quixote      62
2 said_Don         17
3 Full_Size        14
4 I_know            9
5 Sancho_said       9
6 knight_errant     8

Having identifying some common bi-grams, we could add them to the list to be changed. This example includes names like Don Quixote and Sancho Panza which would need to remain together for analysis.

16.2.5 Stemming and lemmatizing

Stemming and lemmatizing words is another common approach for reducing the dimensionality of a text dataset. Stemming means to remove the last part of the word, in the expectation that this will result in more general words. For instance, “Canadians”, “Canadian”, and “Canada” all stem to “Canad”. Lemmatizing is similar, but is more involved. It means that changing words, not just on their spelling, but on their canonical form (Grimmer, Roberts, and Stewart 2022, 54). For instance, “Canadians”, “Canadian”, “Canucks”, and “Canuck”, may all be changed to “Canada”.

We can do this with dfm_wordstem().

char_wordstem(c("Canadians", "Canadian", "Canada"))
[1] "Canadian" "Canadian" "Canada"  
books_corpus |>
  tokens(remove_numbers = TRUE, remove_punct = TRUE) |>
  dfm(tolower = TRUE) |>
Document-feature matrix of: 3 documents, 46 features (61.59% sparse) and 0 docvars.
docs          in a villag of la mancha the name which i
  Don Quixote  2 4      1  3  1      1   2    1     1 1
  Beloved      0 0      0  1  0      0   0    0     0 0
  Jane Eyre    0 1      0  1  0      0   0    0     0 0
[ reached max_nfeat ... 36 more features ]

Again, while this is a common step in using text as data, Schofield et al. (2017) find that in the context of LDA, which we cover later, stemming has little effect and there is little need to do it.

16.2.6 Duplication

Duplication is a major concern with text datasets because of their size. Bandy and Vincent (2021) showed that around 30 per cent of the data were inappropriately duplicated in a large text dataset commonly used in computer science. And Schofield, Thompson, and Mimno (2017) show that this is a major concern and could substantially affect results. However, it can be a subtle and difficult to diagnose problem. For instance, in Chapter 12 when we considered counts of page numbers for various authors in the context of Poisson regression, we could easily have accidentally included each Shakespeare entry twice because not only are there entries for each play, but also many anthologies that contained all of them. Careful consideration of our dataset identified the issue, but that would be difficult at scale.

16.3 TF-IDF

We now use astrologer (Gelfand 2022), which is a dataset of horoscopes to explore a real dataset. This package is not on CRAN, so we need to install it using devtools (Wickham et al. 2022).



We can then access the “horoscopes” dataset.


# A tibble: 1,272 × 4
   startdate  zodiacsign  horoscope                                        url  
   <date>     <fct>       <chr>                                            <chr>
 1 2015-01-05 Aries       Considering the fact that this past week (espec… http…
 2 2015-01-05 Taurus      It's time Taurus. You aren't one to be rushed a… http…
 3 2015-01-05 Gemini      Soon it will be time to review what you know, t… http…
 4 2015-01-05 Cancer      Feeling  feelings and being full of flavorful s… http…
 5 2015-01-05 Leo         Look, listen, watch, meditate and engage in pra… http…
 6 2015-01-05 Virgo       Last week's astrology is still reverberating th… http…
 7 2015-01-05 Libra       Get out your markers and your glue sticks. Get … http…
 8 2015-01-05 Scorpio     Time to pay extra attention to the needs of you… http…
 9 2015-01-05 Sagittarius Everything right now is about how you say it, h… http…
10 2015-01-05 Capricorn   The full moon on January 4th/5th was a healthy … http…
# … with 1,262 more rows

There are four variables: “startdate”, “zodiacsign”, “horoscope”, and “url” (note that URL is out-of-date because the website has been updated, for instance, the first one refers to here). We are interested in the words that are used to distinguish the horoscope of each zodiac sign.

horoscopes |>
# A tibble: 12 × 2
   zodiacsign      n
   <fct>       <int>
 1 Aries         106
 2 Taurus        106
 3 Gemini        106
 4 Cancer        106
 5 Leo           106
 6 Virgo         106
 7 Libra         106
 8 Scorpio       106
 9 Sagittarius   106
10 Capricorn     106
11 Aquarius      106
12 Pisces        106

We can see that there are 106 horoscopes for each zodiac sign. In this example we first tokenize by word, and then create counts based on zodiac sign only, not date. We use tidytext (Silge and Robinson 2016), as it is used extensively in Hvitfeldt and Silge (2021).


horoscopes_by_word <-
  horoscopes |>
  select(-startdate, -url) |>
    output = word,
    input = horoscope,
    token = "words"

horoscopes_counts_by_word <-
  horoscopes_by_word |>
  count(zodiacsign, word, sort = TRUE)

# A tibble: 6 × 3
  zodiacsign  word      n
  <fct>       <chr> <int>
1 Cancer      to     1440
2 Sagittarius to     1377
3 Aquarius    to     1357
4 Aries       to     1335
5 Pisces      to     1313
6 Leo         to     1302

We can see that the most popular words appear to be similar for the different zodiacs. At this point, we could use the data in a variety of ways. We might be interested to know which words characterize each group—that is to say, which words are commonly used only in each group. We can do that by first looking at a word’s term frequency (TF), which is how many times a word is used in the horoscopes for each zodiac sign. The issue is that there are a lot of words that are commonly used regardless of context. As such, we may also like to look at the inverse document frequency (IDF) in which we “penalize” words that occur in the horoscopes for many zodiac signs. A word that occurs in the horoscopes of many zodiac signs would have a lower IDF than a word that only occurs in the horoscopes of one. The term frequency–inverse document frequency (tf-idf) is then the product of these.

We can create this value using bind_tf_idf() from tidytext. It will create new variables for each of these measures.

horoscopes_counts_by_word_tf_idf <-
  horoscopes_counts_by_word |>
    term = word,
    document = zodiacsign,
    n = n
  ) |>

# A tibble: 41,850 × 6
   zodiacsign  word            n       tf   idf   tf_idf
   <fct>       <chr>       <int>    <dbl> <dbl>    <dbl>
 1 Capricorn   goat            6 0.000236  2.48 0.000585
 2 Pisces      pisces         14 0.000531  1.10 0.000584
 3 Sagittarius sagittarius    10 0.000357  1.39 0.000495
 4 Cancer      cancer         10 0.000348  1.39 0.000483
 5 Gemini      gemini          7 0.000263  1.79 0.000472
 6 Taurus      bulls           5 0.000188  2.48 0.000467
 7 Aries       warns           5 0.000186  2.48 0.000463
 8 Cancer      organize        7 0.000244  1.79 0.000437
 9 Cancer      overwork        5 0.000174  2.48 0.000433
10 Taurus      let's          10 0.000376  1.10 0.000413
# … with 41,840 more rows

In Table 16.1 we look at the words that distinguish the horoscopes of each zodiac sign. The first thing to notice is that some of them have their own zodiac sign. On the one hand, there is an argument for removing this, but on the other hand, the fact that it does not happen for all of them is perhaps informative of the nature of the horoscopes for each sign.

horoscopes_counts_by_word_tf_idf |>
  group_by(zodiacsign) |>
  slice(1:10) |>
  select(zodiacsign, word) |>
  group_by(zodiacsign) |>
  summarise(all = paste0(word, collapse = "; ")) |>
    col.names = c(
      "Zodiac sign",
      "Most common words unique to that sign"
    booktabs = TRUE
Table 16.1: Most common words in horoscopes that are unique to a particular Zodiac sign
Zodiac sign Most common words unique to that sign
Aries warns; vesta; aries; fearful; chase; bait; dragons; façade; hostile; laughing
Taurus bulls; let’s; painfully; virgin; taurus; divest; fights; 15th; advances; brightly
Gemini gemini; mood; output; admit; faces; harrowing; modeling; warning; wink; await
Cancer cancer; organize; overwork; procrastinate; scuttle; unrelenting; vessels; offended; schedules; uncles
Leo trines; blessed; regrets; leo; agree; danger; increased; sector; hearth; loyal
Virgo digesting; trace; liberate; someone’s; final; narratives; adept; adrenaline; assimilating; avenue
Libra proof; inevitably; recognizable; reference; disguise; she; harmony; missed; domestic; ail
Scorpio skate; advocate; knots; bottle; meditating; oneself; pleasant; 2012; cazimi; fortify
Sagittarius sagittarius; rolodex; distorted; coat; reinvest; benefactors; blazing; constriction; determining; diversify
Capricorn goat; capricorn; capricorns; signify; neighborhood; funny; noticing; rested; amidst; existed
Aquarius saves; consult; yearnings; sexy; athene; enjoyable; pallas; recalibrate; amusing; appease
Pisces pisces; wasted; missteps; node; shoes; prayer; site; deities; expenses; fishes

16.4 Topic models

Sometimes we have a statement, and we want to know what it is about. Sometimes this will be easy, but we do not always have titles for statements, and even when we do, sometimes we do not have titles that define topics in a well-defined and consistent way. One way to get consistent estimates of the topics of each statement is to use topic models. While there are many variants, one way is to use the latent Dirichlet allocation (LDA) method of Blei, Ng, and Jordan (2003), as implemented by stm (Roberts, Stewart, and Tingley 2019).

The key assumption behind the LDA method is that each statement, “a document”, is made by a person who decides the topics they would like to talk about in that document, and who then chooses words, “terms”, that are appropriate to those topics. A topic could be thought of as a collection of terms, and a document as a collection of topics. The topics are not specified ex ante; they are an outcome of the method. Terms are not necessarily unique to a particular topic, and a document could be about more than one topic. This provides more flexibility than other approaches such as a strict word count method. The goal is to have the words found in documents group themselves to define topics.

LDA considers each statement to be a result of a process where a person first chooses the topics they want to speak about. After choosing the topics, the person then chooses appropriate words to use for each of those topics. More generally, the LDA topic model works by considering each document as having been generated by some probability distribution over topics. For instance, if there were five topics and two documents, then the first document may be comprised mostly of the first few topics; the other document may be mostly about the final few topics (Figure 16.1).

(a) Distribution for Document 1

(b) Distribution for Document 2

Figure 16.1: Probability distributions over topics

Similarly, each topic could be considered a probability distribution over terms. To choose the terms used in each document the speaker picks terms from each topic in the appropriate proportion. For instance, if there were ten terms, then one topic could be defined by giving more weight to terms related to immigration; and some other topic may give more weight to terms related to the economy (Figure 16.2).

(a) Distribution for Topic 1

(b) Distribution for Topic 2

Figure 16.2: Probability distributions over terms

By way of background, the Dirichlet distribution is a variation of the beta distribution that is commonly used as a prior for categorical and multinomial variables. If there are just two categories, then the Dirichlet and the beta distributions are the same. In the special case of a symmetric Dirichlet distribution, \(\eta=1\), it is equivalent to a uniform distribution. If \(\eta<1\), then the distribution is sparse and concentrated on a smaller number of the values, and this number decreases as \(\eta\) decreases. A hyperparameter, in this usage, is a parameter of a prior distribution.

After the documents are created, they are all that we can analyze. The term usage in each document is observed, but the topics are hidden, or “latent”. We do not know the topics of each document, nor how terms defined the topics. That is, we do not know the probability distributions of Figure 16.1 or Figure 16.2. In a sense we are trying to reverse the document generation process—we have the terms, and we would like to discover the topics.

If the earlier process around how the documents were generated is assumed and we observe the terms in each document, then we can obtain estimates of the topics (Steyvers and Griffiths 2006). The outcomes of the LDA process are probability distributions and these define the topics. Each term will be given a probability of being a member of a particular topic, and each document will be given a probability of being about a particular topic.

The initial practical step when implementing LDA given a corpus of documents is usually to remove stop words. Although, as mentioned earlier, this is not necessary, and may be better done after the groups are created. We often also remove punctuation and capitalization. We then construct our document-feature matrix using dfm() from quanteda (Benoit et al. 2018).

After the dataset is ready, stm (Roberts, Stewart, and Tingley 2019) can be used to implement LDA and approximate the posterior. The process attempts to find a topic for a particular term in a particular document, given the topics of all other terms for all other documents. Broadly, it does this by first assigning every term in every document to a random topic, specified by Dirichlet priors. It then selects a particular term in a particular document and assigns it to a new topic based on the conditional distribution where the topics for all other terms in all documents are taken as given. (Grün and Hornik 2011, 6): Once this has been estimated, then estimates for the distribution of words into topics and topics into documents can be backed out.

This conditional distribution assigns topics depending on how often a term has been assigned to that topic previously, and how common the topic is in that document (Steyvers and Griffiths 2006). The initial random allocation of topics means that the results of early passes through the corpus of document are poor, but given enough time the algorithm converges to an appropriate estimate.

The choice of the number of topics, k, affects the results, and must be specified a priori. If there is a strong reason for a particular number, then this can be used. Otherwise, one way to choose an appropriate number is to use a test and training set process. Essentially, this means running the process on a variety of possible values for k and then picking an appropriate value that performs well.

One weakness of the LDA method is that it considers a “bag of words” where the order of those words does not matter (Blei 2012). It is possible to extend the model to reduce the impact of the bag-of-words assumption and add conditionality to word order. Additionally, alternatives to the Dirichlet distribution can be used to extend the model to allow for correlation.

16.4.1 What is talked about in Canadian parliament?

Following the example of the British, the written record of what is said in the Canadian parliament is called “Hansard”. It is not completely verbatim, but is very close. It is available in CSV format from LiPaD, which was constructed by Beelen et al. (2017).

We are interested in what was talked about in the Canadian parliament in 2018. To get started we can download the entire corpus from here, and then discard all of the years apart from 2018. If the datasets are in a folder called “2018”, we can use read_csv() to read and combine all the CSVs.


files_of_interest <-
    path = "2018/",
    glob = "*.csv",
    recurse = 2

hansard_canada_2018 <-
    col_types = cols(
      basepk = col_integer(),
      hid = col_character(),
      speechdate = col_date(),
      pid = col_character(),
      opid = col_integer(),
      speakeroldname = col_character(),
      speakerposition = col_character(),
      maintopic = col_character(),
      subtopic = col_character(),
      subsubtopic = col_character(),
      speechtext = col_character(),
      speakerparty = col_character(),
      speakerriding = col_character(),
      speakername = col_character(),
      speakerurl = col_character()
    col_select = c(
  ) |>

# A tibble: 33,105 × 6
    basepk speechdate speechtext                         speak…¹ speak…² speak…³
     <int> <date>     <chr>                              <chr>   <chr>   <chr>  
 1 4732776 2018-01-29 "Mr. Speaker, I would like to wis… Julie … Liberal Toront…
 2 4732777 2018-01-29 "Mr. Speaker, I want to thank my … Matthe… New De… Beloei…
 3 4732778 2018-01-29 "Mr. Speaker, I am here today to … Stepha… Conser… Calgar…
 4 4732779 2018-01-29 "Resuming debate.\nThere being no… Anthon… Liberal Nipiss…
 5 4732780 2018-01-29 "Mr. Speaker, we are nearing the … Alain … Conser… Richmo…
 6 4732781 2018-01-29 "The question is on the motion. I… Anthon… Liberal Nipiss…
 7 4732782 2018-01-29 "Agreed.\n No."                    Some h… <NA>    <NA>   
 8 4732783 2018-01-29 "All those in favour of the motio… Anthon… Liberal Nipiss…
 9 4732784 2018-01-29 "Yea."                             Some h… <NA>    <NA>   
10 4732785 2018-01-29 "All those opposed will please sa… Anthon… Liberal Nipiss…
# … with 33,095 more rows, and abbreviated variable names ¹​speakername,
#   ²​speakerparty, ³​speakerriding

The use of filter() at the end is needed because sometime aspects such as “directions” and similar non-speech aspects, are included in the Hansard. For instance, if we do not include that filter() then the first line is “The House resumed from November 9, 2017, consideration of the motion.” We can then construct a corpus.

hansard_canada_2018_corpus <-
  corpus(hansard_canada_2018, docid_field = "basepk", text_field = "speechtext")

Corpus consisting of 33,105 documents and 4 docvars.
4732776 :
"Mr. Speaker, I would like to wish everyone in this place a h..."

4732777 :
"Mr. Speaker, I want to thank my colleague from Richmond—Arth..."

4732778 :
"Mr. Speaker, I am here today to discuss a motion that asks t..."

4732779 :
"Resuming debate. There being no further debate, the hon. mem..."

4732780 :
"Mr. Speaker, we are nearing the end of the discussion and de..."

4732781 :
"The question is on the motion. Is the pleasure of the House ..."

[ reached max_ndoc ... 33,099 more documents ]

We use the tokens in the corpus to construct a document-feature matrix. To make our life a little easier, computationally, we remove any word that does not occur at least twice, and any word that does not occur in at least two documents.

hansard_dfm <-
  hansard_canada_2018_corpus |>
    remove_punct = TRUE,
    remove_symbols = TRUE
  ) |>
  dfm() |>
  dfm_trim(min_termfreq = 2, min_docfreq = 2) |>
  dfm_remove(stopwords(source = "snowball"))

Document-feature matrix of: 33,105 documents, 29,245 features (99.76% sparse) and 4 docvars.
docs      mr speaker like wish everyone place happy new year great
  4732776  1       1    2    1        1     4     2   3    5     1
  4732777  1       1    5    0        1     1     0   0    0     1
  4732778  1       1    2    0        0     1     0   0    4     1
  4732779  0       0    0    0        0     0     0   0    0     0
  4732780  1       1    4    0        1     1     0   0    2     0
  4732781  0       0    0    0        0     0     0   0    0     0
[ reached max_ndoc ... 33,099 more documents, reached max_nfeat ... 29,235 more features ]

At this point we can use stm() from stm (Roberts, Stewart, and Tingley 2019) to implement an LDA model. We need to specify a document-feature matrix and the number of topics. Topic models are essentially just summaries. Instead of a document becoming a collection of words, they become a collection of topics with some probability associated with each topic. But because it is just providing a collection of words that tend to be used at similar times, rather than actual underlying meaning, we need to specify the number of topics that we are interested in. This decision will have a big impact, and it is important to consider a few different numbers.


hansard_topics <- stm(documents = hansard_dfm, K = 10)


  file = "hansard_topics.rda"

This will take some time, likely 15-30 min, so it is useful to save the model when it is done using write_rds(), and use beep to get a notification when it is done. We could then read the results back in with read_rds().

hansard_topics <- read_rds(
  file = "hansard_topics.rda"

We can look at the words in each topic with labelTopics().

Topic 1 Top Words:
     Highest Prob: carbon, energy, environment, government, pipeline, canada, climate 
     FREX: energy, pipeline, climate, oil, environmental, gas, pollution 
     Lift: 1.5-billion, barrels, cetaceans, coastline, environmentalists, fossil, genetically 
     Score: carbon, oil, pipeline, energy, emissions, pollution, climate 
Topic 2 Top Words:
     Highest Prob: minister, prime, government, liberals, mr, speaker, liberal 
     FREX: prime, aluminum, phoenix, nay, trump, fundraisers, aga 
     Lift: cue, insiders, mongolia, aluminum, fundraisers, giants, nay 
     Score: prime, liberals, minister, liberal, ethics, deficits, nay 
Topic 3 Top Words:
     Highest Prob: bill, canadians, canada, information, elections, security, election 
     FREX: elections, electoral, firearms, gun, cannabis, c-76, marijuana 
     Lift: landlords, att, c-23, c-76, cigarettes, csis, e-cigarettes 
     Score: firearms, elections, c-76, gun, tobacco, electoral, cannabis 
Topic 4 Top Words:
     Highest Prob: canada, community, people, mr, speaker, many, today 
     FREX: latin, refugees, organ, asylum, refugee, celebrate, iran 
     Lift: asylum, church, multiculturalism, #makeanimpact, #myactionsmatter, #myfeminism, #thankscoach 
     Score: latin, organ, filipino, sikh, iran, celebrate, refugees 
Topic 5 Top Words:
     Highest Prob: people, government, one, can, going, want, get 
     FREX: veterans, something, things, lot, really, conservative, conservatives 
     Lift: vets, -plus, 1,000-plus, 30-year-olds, 887, alcona, anachronistic 
     Score: veterans, going, conservatives, get, people, things, think 
Topic 6 Top Words:
     Highest Prob: indigenous, canada, government, canadians, work, women, support 
     FREX: peoples, accessibility, cptpp, indigenous, disabilities, gender, neutrality 
     Lift: 1.84, 10.7, 12-page, 13.2, 13.5, 2,560, 2016-19 
     Score: indigenous, peoples, women, housing, disabilities, sustainable, cptpp 
Topic 7 Top Words:
     Highest Prob: member, speaker, mr, house, hon, colleague, question 
     FREX: hon, member, yea, resuming, unanimous, comment, colleague 
     Lift: transcona, 100-plus, 1316, 1324, 15-minute, 1625, 1642 
     Score: hon, member, question, colleague, pursuant, yea, resuming 
Topic 8 Top Words:
     Highest Prob: justice, system, criminal, victims, violence, court, rights 
     FREX: sexual, offences, c-75, correctional, prison, segregation, inmates 
     Lift: accused, adjourned, complainant, garnier, guards, iii, imprisonment 
     Score: criminal, victims, justice, c-75, correctional, harassment, violence 
Topic 9 Top Words:
     Highest Prob: tax, canada, budget, government, canadians, workers, families 
     FREX: income, postal, cra, banking, unemployment, ei, low-income 
     Lift: ccb, maternity, cupw, 0.2, 0.3, 0.6, 0.9 
     Score: tax, budget, taxes, workers, billion, economy, housing 
Topic 10 Top Words:
     Highest Prob: bill, committee, act, legislation, process, report, amendments 
     FREX: committees, committee, witnesses, amendments, divorce, reading, process 
     Lift: 200-mile, 673, 883, allot, applicability, auspices, besc 
     Score: bill, amendments, committee, legislation, amendment, act, process 

16.5 Exercises


  1. (Plan) Consider the following scenario: You run a news website and are trying to understand whether to allow anonymous comments. You decide to do an A/B test, where we keep everything the same, but only allow anonymous comments on one version of the site. All you will have to decide is the text data that you obtain from the test. Please sketch out what that dataset could look like and then sketch a graph that you could build to show all observations.
  2. (Simulate) Please further consider the scenario described and simulate the situation. Please include at least ten tests based on the simulated data.
  3. (Acquire) Please describe one possible source of such a dataset.
  4. (Explore) Please use ggplot2 to build the graph that you sketched.
  5. (Communicate) Please write two paragraphs about what you did.



Please follow the code of Hvitfeldt and Silge (2021) in Supervised Machine Learning for Text Analysis in R, Chapter 5.2 “Understand word embeddings by finding them yourself”, freely available here, to implement your own word embeddings for one year’s worth of data from LiPaD.