Chapter 7 Supervised Methods

Supervised learning methods use this pre-labelled data to recognise patterns. Think of it like a student learning from examples with answers provided by a teacher. Labelled data, also known as the training set, consists of a collection of texts, each of which has a known category or value assigned to it. Examples include movie reviews labelled as ‘positive’ or ‘negative’, news articles categorised by topic (e.g. ‘sports’, ‘politics’, ‘finance’), and survey responses assigned a satisfaction score. A supervised learning algorithm examines the features of these labelled texts — typically, which words appear and how often — and tries to determine the relationship between the words and the labels. For instance, it may recognise that reviews containing words such as ‘amazing’, ‘love’ and ‘excellent’ are often categorised as ‘positive’. Once the algorithm has learned these patterns from the training set, it can predict the labels of new, unlabelled texts — usually called the test set.

Supervised methods are effective when manual labelling of a subset of data is feasible, but automatic classification of larger volumes of text is required. Supervised methods differ from dictionary methods, which rely on fixed lists of words, and unsupervised methods, such as topic modelling or clustering, which discover patterns without needing pre-assigned labels.

Several algorithms for supervised text classification are available within the R ecosystem, particularly those integrated with quanteda. We will cover two widely used, relatively simple yet effective algorithms: Support Vector Machines (SVM) and Naive Bayes (NB). SVM is a powerful discriminative classifier that identifies an optimal boundary, or hyperplane, to distinguish between different categories in a high-dimensional feature space. Naive Bayes is a probabilistic classifier based on Bayes’ theorem, which simplifies the assumption that features are independent of each other.

We will also explore methods for visualising model performance and implementing cross-validation using the caret package, which provides a unified interface for many machine-learning algorithms and validation techniques.