Chapter 8 Unsupervised Methods

Unlike supervised methods, which require labelled data to train a model for classifying or predicting values for new text, unsupervised learning methods in text analysis aim to discover inherent patterns and structures within text data without relying on pre-assigned labels. These methods are useful for exploring large unannotated corpora to identify recurring themes (topics), group similar documents (clustering) and reduce the dimensionality of the data.

Supervised models excel at well-defined classification tasks where labelled data is available, but unsupervised methods allow us to uncover latent structures that we may not have anticipated or would be prohibitively expensive to label. This chapter focuses on probabilistic topic modelling, a popular suite of unsupervised methods for identifying abstract ‘topics’ within a body of text. Each document is treated as a mixture of these topics, with a word distribution characterising each. We will cover Latent Dirichlet Allocation (LDA), seeded LDA, which incorporates prior knowledge, the Structural Topic Model (STM), which allows the inclusion of document metadata to model topic prevalence and content, and Latent Semantic Analysis (LSA) as a dimensionality reduction technique.

The original document provided a solid introduction to these methods. I have expanded upon the explanations of each technique, particularly focusing on the rationale behind key steps and the interpretation of results. I have also significantly extended the sections on model validation for LDA, STM, and LSA, as understanding how to evaluate these models is crucial for their practical application.