4.1 Corpus and DFM

In quanteda, the primary object for storing and managing text data is the corpus. A corpus object holds your documents and any associated metadata, known as document-level variables or docvars. The key characteristic of a corpus object is that it remains immutable during your analysis. Instead of modifying the original corpus, you create derivative objects (like tokens or document-feature matrices) for analysis. This approach ensures reproducibility and allows you to easily return to your original data for different analyses or pre-processing steps.

Creating a corpus in quanteda is straightforward. As seen in the 3 chapter, one standard method is to use the readtext package, which reads various file formats and creates a data frame with document IDs and text. This data frame can be directly converted into a corpus object using the corpus() function.

Alternatively, you can create a corpus from a simple character vector, where each element represents a document. If the vector elements are named, these names will be used as document identifiers (doc_id); otherwise, quanteda will generate default IDs.

Incorporating document-level variables (docvars) is highly recommended. These variables store crucial metadata about each document, such as the author, publication date, source, or other relevant categorical or numerical information. Docvars are essential for grouping, filtering, and conducting analyses that relate textual features to external characteristics of the documents. When you create a corpus from a data frame, quanteda automatically attempts to include other columns as docvars. You can add or modify docvars later using the docvars() function.

library(quanteda)  # Core text analysis functions
library(quanteda.corpora)  # Access to built-in text corpora
library(quanteda.textstats)  # Text statistics
library(quanteda.textplots)  # Visualizing text statistics
library(ggplot2)  # Plots
library(reshape2)  # For melting data frames 

data(data_corpus_ukmanifestos)  # Load the UK political manifestos corpus

selected_parties <- c("Con", "Lab", "LD", "UKIP", "SNP", "PCy", "SF")  # Filter the corpus to include only documents from the year 1997 onwards and belonging to a selected set of political parties
data_corpus_ukmanifestos <- corpus_subset(data_corpus_ukmanifestos, Year > 1996 &
    Party %in% selected_parties)

The document-feature matrix (DFM) is the core data structure for many quantitative text analysis methods. It represents the corpus as a matrix, where documents are represented as rows and features (typically words or n-grams, after pre-processing) are represented as columns. The cell values are usually the counts of each feature in each document. The DFM is created from a tokens object (which identifies the individual words in a document) using the dfm() function.

data_tokens <- tokens(data_corpus_ukmanifestos)
data_dfm <- dfm(data_tokens)

head(data_dfm)

By default, dfm() counts the occurrences of each feature (term frequency). If you want to, you can apply different weighting schemes like TF-IDF using dfm_tfidf() to emphasise words that are more important to a document relative to the corpus. However, at this point, the DFM contains much information we do not need, such as words like ‘the’ and ‘1997’. Because of this, we rarely generate the DFM directly, but we first carry out some pre-processing, to which we will now turn.