6.1 The Corpus
Within quanteda
, the main way to store documents is in the form of a corpus
object. This object contains all the information that comes with the texts and does not change during our analysis. Instead, we make copies of the main corpus, convert them to the type we need and run our analyses on them. The advantage of this is that we can always go back to our original data.
There are various ways to make a corpus. One we already saw in the 4 chapter were we used readtext
to generate a data frame, with both a variable that gave us a doc_id and another that gave us the text. As readtext
and quanteda
work close together, we can directly change this to a corpus. Alternatively, we can take a character vector, where each element of the vector is taken as an individual document. If the vector is named, those names will be used as document names - if not, new ones are generated.
Here, let us go back to the Cold War page we scraped earlier and use the resulting text from that as our input:
url <- "https://en.wikipedia.org/wiki/Cold_War"
url <- rvest::read_html(url)
data_coldwar <- url %>%
html_nodes("p") %>%
html_text2() %>%
as.data.frame()
data_coldwar <- data_coldwar$.
data_coldwar <- data_coldwar[-c(1, 2)]
Note that the resulting data_coldwar
vector is unnamed, so quanteda
will generate those names for us. Now, we simply put this into the corpus
command:
## Warning: package 'quanteda' was built under R version 4.3.3
## Package version: 4.0.2
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
Apart from importing texts ourselves, quanteda
contains several corpora as well. Here, we use one of these, which contains the inaugural speeches of all the US Presidents. For this, we first have to load the main package and then load the data into R:
Now we have our corpus, we can start with the analysis. As noted, we try not to carry out any analysis on the corpus itself. Instead, we keep it as it is and work on its copies. Often, this means transforming the data into another shape. One of the more popular shapes is the data frequency matrix (dfm). This is a matrix that contains the documents in the rows and the word counts for each word in the columns.
Before we can do so, we have to split up our texts into unique words. To do this, we first have to construct a tokens
object. In the command that we use to do this, we can specify how we want to split our texts (here we use the standard option) and how we want to clean our data. For example, we can specify that we want to convert all the texts into lowercase and remove any numbers and special characters.
data_tokens <- tokens(
data_corpus,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = TRUE
)
We can also remove certain stopwords so that words like “and” or “the” do not influence our analysis too much. We can either specify these words ourselves or we can use a list that is already present in R. To see this list, type stopwords("english")
in the console. Stopwords for other languages are also available (such as German, French and Spanish). There are even more stopwords in the stopwords
package, which works well with quanteda
. For now, we will use the English ones. As all the stopwords here are lower-case, we will have to lower case our words as well. Also notice that we also do this for any acronyms in our text (so, “NATO” will become “nato”):
data_tokens <- tokens_tolower(data_tokens,
keep_acronyms = FALSE)
data_tokens <- tokens_select(data_tokens,
stopwords("english"),
selection = "remove")
Then, we can construct our dfm: