Chapter 5 Dictionary Analysis
One of the simplest forms of quantitative text analysis is dictionary analysis. The idea here is to count the presence of pre-defined categories of words or phrases within texts to classify documents or measure the extent to which documents relate to particular topics or sentiments. By relying on a fixed set of terms and associated categories, dictionary methods provide a transparent and computationally efficient approach to text analysis. Unlike statistical or machine learning methods that learn patterns from data, dictionary methods are non-statistical and depend entirely on the quality and relevance of the dictionary used. A well-known application is measuring the tone of newspaper articles, speeches, children’s writing, etc., using sentiment analysis dictionaries. These dictionaries categorise words as positive, negative or sometimes neutral, allowing a sentiment score to be calculated for a text. Another example is measuring the policy content of different documents, as illustrated by the Policy Agendas Project dictionary (Albaugh et al., 2013), which assigns words to various policy domains.
While straightforward, dictionary analysis has limitations. To begin with, it relies on the assumption that the meaning of a word is relatively stable across different contexts and documents. This can be a strong assumption, as the same word can have different meanings (polysemy), or its meaning can be negated or altered by surrounding words (e.g. sarcasm). Besides, they also cannot identify concepts or themes not explicitly included in the dictionary. Despite these limitations, dictionary analysis remains a valuable tool, especially for exploratory analysis, when validated dictionaries are available or combined with other methods.
Here, we will conduct three analyses using dictionaries: the first is a standard content analysis approach using a political dictionary, and the other two focus on sentiment. For the former, we will use the same political party manifestos we used in the previous chapter, while for the latter, we will use film reviews and Twitter data.