1 Getting Started with Quantitative Text Analysis

Quantitative text analysis, like many other techniques, is at its core a method. This means that while it provides you with the tools to answer a particular question, it does not provide you with a theoretical framework. Nor is there any reality to be discovered: the only thing we can do with QTA methods is to provide (hopefully) accurate summaries of our texts.

With this in mind, there are five questions that we can hope to answer with QTA (Grimmer et al., 2022):

What do our texts look like?
What are our texts about?
What do our texts measure?
What can our texts predict?
What can our texts prove?

In the case of the first question, we might simply be interested in how many words we have in different documents, which authors wrote together, or whether certain texts have a distinctive type of wording. Questions like these help us to get a better idea of the type of data we are dealing with, to work out what might be interesting to look at and to identify potential problems early on. We can then ask ourselves what the texts are actually about. Here we could run topic models to look at (a representation of) the underlying structure of our texts. We could look at the sentiment of the texts using different dictionaries, or calculate different readability statistics. In each case, we get a better understanding of what our texts represent and what they might be about. For example, we might discover that some texts cluster together unexpectedly, or have more themes in common than we expected. This might then lead us to focus on them exclusively, to collect more texts on the same topic, or to focus on different documents altogether.

Now that we have our texts and know what we want from them, we can start to use them to measure a concept we are interested in. For example, we could use the codes from the Manifesto Project to measure the political left-right position of our texts. Or we could measure the occurrence of different issues in these documents to find out the agendas of different parties. Once this is done, we could then use these measurements to predict what kind of agenda a new party might have, or a step further, we could use them to estimate what effect a particular text will have on, say, voters or legislators.

Note that we can stop this process and any point and do interesting work. That is, if we go about rigorously collecting texts from different sources, structuring them, cleaning them, and describing what they are like, this can be interesting in itself. Similarly, measuring the positions of texts on a variety of scales can be enough to make for interesting research. Ultimately, how far you go depends on the questions you want to ask and the problems you want to solve.

References

Grimmer, J., Roberts, M. E., & Stewart, B. M. (2022). Text as Data: A New Framework for Machine Learning and the Social Sciences. Princeton University Press.