8.2 Seeded Latent Dirichlet Allocation (sLDA)
An alternative to the above approach is one known as seeded-LDA. This approach uses seed words to steer the LDA in the right direction. One origin of these seed words can be a dictionary that tells the algorithm which words belong together in various categories. To use it, we will first load the packages and set a seed:
Next, we need to specify a selection of seed words in dictionary form. While we can construct a dictionary ourselves, we use the Laver and Garry dictionary we saw earlier. We then use this dictionary to run our seeded LDA:
dictionary_LaverGarry <- dictionary(data_dictionary_LaverGarry)
seededmodel <- textmodel_seededlda(data_inaugural_dfm, dictionary = dictionary_LaverGarry)Note that using the dictionary has ensured that we only use the categories in the dictionary. This means we can look at which topics are in each inaugural speech and which terms were most likely for each. Let us start with the topics first:
topics <- topics(seededmodel)
topics_table <- ftable(topics)
topics_prop_table <- as.data.frame(prop.table(topics_table))
ggplot(data = topics_prop_table, aes(x = topics, y = Freq)) + geom_bar(stat = "identity") +
labs(x = "Topics", y = "Topic Percentage") + scale_y_continuous(expand = c(0,
0)) + theme_classic() + theme(axis.text.x = element_text(size = 10, angle = 90,
hjust = 1))
Here, we find that Culture was the most favoured topic, followed by the Economy and Values. Finally, we can then have a look at the most likely terms for each topic, sorted by each of the categories in the dictionary:
terms <- terms(seededmodel)
terms_table <- ftable(terms)
terms_df <- as.data.frame(terms_table)
head(terms_df)## Var1 Var2 Freq
## 1 A CULTURE people
## 2 B CULTURE american
## 3 C CULTURE citizenship
## 4 D CULTURE service
## 5 E CULTURE presence
## 6 F CULTURE demand
Here, we find that in the first cluster (denoted as ‘A’), the word ‘people’ was most likely (from all words that belonged to Culture). Thus, within this cluster, talking about culture often references the people. In the same way, we can make similar observations for the other categories.