8.5 Exercises
Apply standard LDA to a different corpus (e.g. UK party manifestos or movie reviews). Experiment with varying numbers of topics (k) and evaluate the topics qualitatively based on the most frequent words and quantitatively using semantic coherence and perplexity if these are available in the package used. Select an appropriate k and interpret the main topics found.
For the LDA model fitted in Exercise 1, visualise the document-topic distributions for a few selected documents. Which documents have the highest proportions of the most interesting topics? Validate your interpretation by examining the content of these documents. As part of your validation process, calculate and compare the log-likelihood or perplexity for models with different numbers of topics.
Create a custom dictionary with seed words relevant to a research question, then apply seeded LDA to a suitable corpus. Interpret the seeded topics and examine the dominant topic assignments per document. If available, compare the resulting top terms and document assignments to those from a standard LDA model on the same corpus as a form of validation.
Apply STM to a corpus with rich metadata (e.g. a dataset of news articles containing the date and source or a corpus of speeches containing speaker attributes). Select a set of topics using searchK, carefully evaluating the diagnostic plots. Fit the STM model to the prevalence formula, including relevant metadata. Use estimateEffect() to analyse how the metadata affects the prevalence of specific topics, visualising the results and interpreting the significance of the effects as part of the validation process. Examine the ‘labelTopics()’ output for qualitative validation of the topics.
Apply LSA to a corpus using the textmodel_lsa function. Experiment with different numbers of dimensions (nd). Examine the term and document loadings for a selected number of dimensions, then try to interpret the latent concepts discovered by LSA. Calculate and visualise the cumulative explained variance by the dimensions to help select nd.
For your LSA model from Exercise 5, calculate and examine the cosine similarity between a few pairs of selected documents or terms that you expect to be similar or dissimilar based on your knowledge of the corpus. Assess whether the LSA similarity scores align with your expectations to validate the results.
Research and implement a method to compute topic coherence (e.g. pointwise mutual information between top words) for an LDA or STM model, in cases where either
searchKorlabelTopicsdo not provide sufficient detail, or where another package is being used. Use this to quantitatively validate your chosen topic models and compare models with different k.