6.2 Wordfish

Unlike Wordscores, Wordfish is an unsupervised scaling method. It does not require pre-scored reference texts. Instead, it models word frequencies in documents based on a Poisson distribution while simultaneously estimating document positions and word-specific parameters. The model assumes that a word’s frequency in a document is related to the document’s position on a single latent dimension, the word’s overall tendency to occur in the document, and its specific association with the latent dimension.

The output of Wordfish includes * \(\theta\) (theta): The estimated position of each document on the latent dimension. * \(\alpha\) (alpha): The estimated intercept for each word, representing its overall frequency. * \(\beta\) (beta): The estimated weight for each word, representing its association with the latent dimension. Words with large positive beta values are more likely to appear in documents with high theta values, and vice versa for large negative beta values. * \(\psi\) (psi): A document-specific fixed effect that captures variations in document length.

Wordfish estimates a single latent dimension. The direction of this dimension is arbitrary (e.g. left to right or right to left). We can orient the dimension by specifying two anchor texts or by selecting a direction using the dir argument in textmodel_wordfish().

Let’s apply Wordfish to a corpus of US presidential inaugural speeches. We will use speeches after 1900 and preprocess the texts similarly to the Wordscores example. We could choose speeches from presidents typically considered at opposite ends of a relevant dimension (such as a left-right scale) to orient the scale. For example, we could use the 1965 Johnson and the 1985 Reagan speeches to define the direction, arbitrarily assigning one speech to one end of the scale and the other to the other.

library(quanteda)
library(quanteda.textmodels)
data(data_corpus_inaugural)
set.seed(42)

corpus_inaugural <- corpus_subset(data_corpus_inaugural, Year > 1900)

# Tokenise and preprocess the corpus
data_inaugural_tokens <- tokens(
 corpus_inaugural,
 what = "word",
 remove_punct = TRUE, # Remove punctuation
 remove_symbols = TRUE, # Remove symbols
 remove_numbers = TRUE, # Remove numbers
 remove_url = TRUE, # Remove URLs
 remove_separators = TRUE, # Remove separators
 split_hyphens = FALSE, # Do not split hyphenated words
 include_docvars = TRUE # Include document variables (metadata)
)

data_inaugural_tokens <- tokens_tolower(data_inaugural_tokens) 
data_inaugural_tokens <- tokens_select(data_inaugural_tokens, stopwords("english"), selection = "remove")

data_inaugural_dfm <- dfm(data_inaugural_tokens)

# Print document names to identify indices for direction. We needed the order of documents to specify anchor texts by index
data_inaugural_dfm@Dimnames$docs

##  [1] "1901-McKinley"   "1905-Roosevelt"  "1909-Taft"       "1913-Wilson"    
##  [5] "1917-Wilson"     "1921-Harding"    "1925-Coolidge"   "1929-Hoover"    
##  [9] "1933-Roosevelt"  "1937-Roosevelt"  "1941-Roosevelt"  "1945-Roosevelt" 
## [13] "1949-Truman"     "1953-Eisenhower" "1957-Eisenhower" "1961-Kennedy"   
## [17] "1965-Johnson"    "1969-Nixon"      "1973-Nixon"      "1977-Carter"    
## [21] "1981-Reagan"     "1985-Reagan"     "1989-Bush"       "1993-Clinton"   
## [25] "1997-Clinton"    "2001-Bush"       "2005-Bush"       "2009-Obama"     
## [29] "2013-Obama"      "2017-Trump"      "2021-Biden"      "2025-Trump"

# Identify the indices of the anchor documents (1965 Johnson and 1985 Reagan)
johnson_index <- as.numeric(which(docnames(data_inaugural_dfm) == "1965-Johnson"))
reagan_index <- as.numeric(which(docnames(data_inaugural_dfm) == "1985-Reagan"))

wordfish <- textmodel_wordfish(data_inaugural_dfm, dir = c(johnson_index, reagan_index))
summary(wordfish)

The summary() output for a Wordfish model provides information about the model fit and the estimated parameters (\(\theta\), \(\alpha\), \(\beta\), \(\psi\)). The \(\theta\) values are the estimated positions of the documents on the latent dimension. Like Wordscores, we can use the predict() function to obtain the estimated document positions with confidence intervals. The estimated position (theta) is called the fit in the output.

pred_wordfish <- predict(wordfish, interval = "confidence")

Using the textplot_scale1d() function, similar to Wordscores, we can visualise the estimated document positions and the word parameters. Plotting the word parameters (margin = "features") shows which words are associated with which end of the latent dimension.

library(quanteda.textplots)
textplot_scale1d(wordfish,
                 margin = "features", # Plot features (words)
                 highlighted = c("america", "great", "freedom", "government", "taxes", "people", "world")
                 )

A word’s position on this scale corresponds to its \(\beta\) value. Words at one end of the scale are more likely to appear in documents with high \(\theta\) values, and words at the other extreme are more likely to appear in documents with low \(\theta\) values. How we interpret the scale depends on the words we find at the extremes and any anchor texts used for orientation. Plotting document positions (margin = "documents" ) visualises the estimated values of \(\theta\).

# Plot the distribution of document positions (theta values) with confidence
# intervals. Theta values are the estimated document scores on the latent
# dimension.
textplot_scale1d(wordfish, margin = "documents"  # Plot documents
)

This graph shows the estimated position of each inaugural address on the latent dimension, ordered by year. The confidence intervals indicate the uncertainty of these estimates. Interpreting this dimension requires careful consideration of the anchor texts used and the words that load highly on the dimension (from the word plot). For example, suppose we anchor with a president who is typically considered ‘liberal’ at one end and ‘conservative’ at the other, and the word plot shows terms related to social programmes at one end and terms related to individual freedom at the other. In that case, we might interpret this as a left-right political dimension. However, Wordfish can uncover any dominant latent dimension in the text, which may not always conform to preconceived notions such as a simple left-right scale.

Wordfish is a valuable tool for discovering latent dimensions in text data without relying on external scores. Its unsupervised nature can be both a strength (no need for reference data) and a weakness (latent dimension interpretation is not always straightforward and requires careful analysis of word loadings).