8.5 Exercises

  1. Apply standard LDA to a different corpus (e.g. UK party manifestos or movie reviews). Experiment with varying numbers of topics (k) and evaluate the topics qualitatively based on the most frequent words and quantitatively using semantic coherence and perplexity if these are available in the package used. Select an appropriate k and interpret the main topics found.

  2. For the LDA model fitted in Exercise 1, visualise the document-topic distributions for a few selected documents. Which documents have the highest proportions of the most interesting topics? Validate your interpretation by examining the content of these documents. As part of your validation process, calculate and compare the log-likelihood or perplexity for models with different numbers of topics.

  3. Create a custom dictionary with seed words relevant to a research question, then apply seeded LDA to a suitable corpus. Interpret the seeded topics and examine the dominant topic assignments per document. If available, compare the resulting top terms and document assignments to those from a standard LDA model on the same corpus as a form of validation.

  4. Apply STM to a corpus with rich metadata (e.g. a dataset of news articles containing the date and source or a corpus of speeches containing speaker attributes). Select a set of topics using searchK, carefully evaluating the diagnostic plots. Fit the STM model to the prevalence formula, including relevant metadata. Use estimateEffect() to analyse how the metadata affects the prevalence of specific topics, visualising the results and interpreting the significance of the effects as part of the validation process. Examine the ‘labelTopics()’ output for qualitative validation of the topics.

  5. Apply LSA to a corpus using the textmodel_lsa function. Experiment with different numbers of dimensions (nd). Examine the term and document loadings for a selected number of dimensions, then try to interpret the latent concepts discovered by LSA. Calculate and visualise the cumulative explained variance by the dimensions to help select nd.

  6. For your LSA model from Exercise 5, calculate and examine the cosine similarity between a few pairs of selected documents or terms that you expect to be similar or dissimilar based on your knowledge of the corpus. Assess whether the LSA similarity scores align with your expectations to validate the results.

  7. Research and implement a method to compute topic coherence (e.g. pointwise mutual information between top words) for an LDA or STM model, in cases where either searchK or labelTopics do not provide sufficient detail, or where another package is being used. Use this to quantitatively validate your chosen topic models and compare models with different k.

Albaugh, Q., Sevenans, J., Soroka, S., & Loewen, P. J. (2013). The Automated Coding of Policy Agendas: A Dictionary-based Approach. 6th Annual Comparative Agendas Conference, Antwerp, Belgium.
Bakker, R., Vries, C. de, Edwards, E., Hooghe, L., Jolly, S., Marks, G., Polk, J., Rovny, J., Steenbergen, M. R., & Vachudova, M. A. (2012). Measuring party positions in europe: The chapel hill expert survey trend file, 1999-2010. Party Politics, 21(1), 1–15. https://doi.org/10.1177/1354068812462931
Benoit, K., Laver, M., & Mikhaylov, S. (2009). Treating words as data with error: Uncertainty in text statements of policy positions. American Journal of Political Science, 53(2), 495–513. https://doi.org/10.1111/j.1540-5907.2009.00383.x
Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). Quanteda: An r package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. https://doi.org/10.21105/joss.00774
Bruinsma, B., & Gemenis, K. (2019). Validating Wordscores: The Promises and Pitfalls of Computational Text Scaling. Communication Methods and Measures, 13(3), 212–227. https://doi.org/10.1080/19312458.2019.1594741
Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Sage. https://doi.org/10.4135/9781412985642
Clarke, I., & Grieve, J. (2019). Stylistic variation on the donald trump twitter account: A linguistic analysis of tweets posted between 2009 and 2018. PLOS ONE, 14(9), 1–27. https://doi.org/10.1371/journal.pone.0222062
Denny, M. J., & Spirling, A. (2018). Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis, 26(2), 168–189. https://doi.org/10.1017/pan.2017.44
Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.
Grimmer, J., & Stewart, B. M. (2013). Text as data: The promise and pitfals of automatic content analysis methods for political texts. Political Analysis, 21(3), 267–297. https://doi.org/10.1093/pan/mps028
Haselmayer, M., & Jenny, M. (2017). Sentiment analysis of political communication: Combining a dictionary approach with crowdcoding. Quality & Quantity, 51(6), 2623–2646. https://doi.org/10.1007/s11135-016-0412-4
Hutto, C., & Gilbert, E. (2014). VADER: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8(1), 216–225. https://doi.org/10.1609/icwsm.v8i1.14550
Krippendorff, K. (2019). Content Analysis - An Introduction to Its Methodology (4th ed.). Sage. https://doi.org/10.4135/9781071878781
Laver, M., Benoit, K., & Garry, J. (2003). Extracting policy positions from political texts using words as data. The American Political Science Review, 97(2), 311–331. https://doi.org/10.1017/S0003055403000698
Laver, M., & Garry, J. (2000). Estimating policy positions from political texts. American Journal of Political Science, 44(3), 619–634. https://doi.org/10.2307/2669268
Lê, S., Josse, J., & Husson, F. (2008). Factominer: An r package for multivariate analysis. Journal of Statistical Software, 25(1). https://doi.org/10.18637/jss.v025.i01
Lin, L. (1989). A concordance correlation coefficient to evaluate reproducibility. Biometrics, 45, 255–268. https://doi.org/10.2307/2532051
Lind, F., Eberl, J.-M., Heidenreich, T., & Boomgaarden, H. G. (2019). When the journey is as important as the goal: A roadmap to multilingual dictionary construction. International Journal of Communication, 13, 4000–4020.
Lowe, W., & Benoit, K. (2011). Estimating uncertainty in quantitative text analysis. Annual Meeting of the Midwest Political Science Association.
Martin, L. W., & Vanberg, G. (2008). Reply to benoit and laver. Political Analysis, 16(1), 112–114. https://doi.org/10.1093/pan/mpm018
Mikhaylov, S., Laver, M., & Benoit, K. (2012). Coder reliability and misclassification in the human coding of party manifestos. Political Analysis, 20(1), 78–91. https://doi.org/10.1093/pan/mpr047
Neuendorf, K. A. (2016). The Content Analysis Guidebook (2nd ed.). Sage. https://doi.org/10.4135/9781071802878
Slapin, J. B., & Proksch, S.-O. (2008). A scaling model for estimating time-series party positions from texts. American Journal of Political Science, 52(3), 705–722. https://doi.org/10.1111/j.1540-5907.2008.00338.x
Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2nd ed.). https://r4ds.hadley.nz/
Young, L., & Soroka, S. (2012). Lexicoder sentiment dictionary. http://www.snsoroka.com/data-lexicoder/