2.1 Concepts
First of all, let us define what Quantitative Text Analysis actually is. While there are various definitions, all agree that with QTA, we focus on text as data. Rather than taking the word as it is, we transform it (and other textual features) into numerical representations, which we then use as the input for various applications. Although this transformation means that we have to abstract the meaning of a word into numerical values, by doing so, we can easily examine (very) large collections of documents – better known as corpora – and can test pre-defined hypotheses, generalise findings across large datasets, synthesise information, and uncover broad trends that might not be apparent from close reading alone. This ability to work with large datasets is one of the reasons for the explosive growth and interest in QTA in recent years.
We can divide the different applications of QTA into four groups, each with its own focus and preferred techniques:
Describe: What are the characteristics of our corpus? In other words, how many documents do we have? What is their word count and vocabulary size? What are the most common terms, and how are they distributed? And what about the metadata - who are the authors of our texts, and when were they written? This descriptive information is crucial for familiarising ourselves with the data, identifying areas for further research and spotting anomalies early on.
Explore: What are the main ideas, opinions, perspectives and themes embedded in our texts? Methods such as topic modelling can reveal the underlying thematic structure of a corpus; dictionary-based approaches, machine learning classifiers or sentiment analysis can quantify emotional tone; and readability statistics can provide insights into the complexity and accessibility of the language used. All of these can improve our understanding of the content of the texts and reveal unexpected relationships between them.
Measure: Can we measure or create (latent) concepts with our texts? For example, we can use codes from dictionary-based approaches to measure the political left-right position of a party manifesto, or use the frequency of topics in the texts to reveal the agenda priorities of authors or organisations.
Predict: How can our data predict future events or test hypotheses about causal effects? Because we see our texts as data, we can use our texts as part of statistical models, such as a regression, to predict outcomes. For example, we could analyse how much a new party manifesto is likely to change the behaviour of voters or legislators, or whether debates on social media affected a previous election.
Note that in most cases, a single research study will combine them to answer its research question. They also build on each other. For example, a descriptive analysis usually precedes an exploratory one, and both are needed to measure or predict.
Often, Quantitative Text Analysis is contrasted with Qualitative Text Analysis, which focuses on the in-depth interpretation and nuanced understanding of texts, typically working with smaller numbers of documents. Compared to its quantitative counterpart, it emphasises deep immersion in the texts in order to understand their meaning, context and subtleties, allowing us to explore complex phenomena, develop new theoretical insights and provide rich and thick descriptions of our texts. Central to this approach are techniques such as coding to identify key concepts, thematic analysis to uncover recurring patterns of meaning, discourse analysis to study language in its social context, and narrative analysis to understand stories and accounts. Thus, while QTA often focuses on the ‘what’ aspect of a text, the qualitative approach focuses on answering the ‘how’ and ‘why’ questions related to a text.
However, rather than seeing them as opposing methodologies, we should see them as complementary. That is, we can use the two approaches to inform and enrich each other in many ways. For example, a deep qualitative understanding of the context and nuances of a text is invaluable in developing a robust coding scheme or specialised dictionaries that we can use in a quantitative analysis, while quantitative methods can efficiently scan large corpora to find overarching patterns, anomalies or specific subsets of texts that deserve more focused, in-depth qualitative study. As a result, combining the two approaches can lead to more convincing and interesting research and help to address the shortcomings of the other.
\(~\)
| Aspect | Quantitative Text Analysis | Qualitative Text Analysis |
|---|---|---|
| Goal | To measure, count, and identify patterns, frequencies, and statistical relationships. To test hypotheses. | To understand, interpret, and explore meanings, themes, and context. To generate hypotheses or theories. |
| Data | Numerical data derived from text (e.g., word counts, frequencies, coded categories represented numerically). | Textual data (e.g., interview transcripts, documents, open-ended survey responses, field notes). |
| Approach | Objective; aims for generalizability; deductive (tests pre-defined hypotheses). | Subjective; focuses on depth and richness of specific cases; inductive (develops understanding from the data). |
| Methods | Statistical analysis, content analysis (frequency-based), computational linguistics, automated sentiment analysis, topic modelling. | Thematic analysis, discourse analysis, narrative analysis, interpretative phenomenological analysis, grounded theory, close reading. |
| Analysis | Statistical tests, identifying correlations, creating visualisations of numerical patterns. | Coding (assigning labels to text segments), identifying themes, interpreting meanings, and building narratives. |
| Size | Large datasets to ensure statistical significance and generalizability. | Small, purposefully selected datasets to allow for in-depth analysis. |
| Focus | Breadth of information across many texts; identifying what and how much/often. | Depth of understanding within texts; exploring why and how. |
| Outcomes | Summaries, statistical significance, generalisable findings, identification of trends. | Rich descriptions, in-depth understanding of context, identification of themes, and development of concepts or theories. |
\(~\)
For now, let us return to QTA. As we have already noted, the increased availability of (digital) texts and advances in computing power have led to a sharp increase in the popularity of QTA. As a result, there has also been a growing interest in its theoretical underpinnings, such as the work of Grimmer & Stewart (2013) and Grimmer et al. (2022). They have sought to identify what QTA can (and cannot) do, and how best to do it. Overall, they stress the importance of six points:
- Theory: Although QTA is data-driven, theory is inevitable. Thus, any QTA needs some form of theoretical guidance to tell us which texts to choose, how many features to extract, and what questions to ask. Indeed, the question of which method to use depends on our theoretical framework, and any findings we make will only make sense within a particular theoretical context.
- Models are models: Language is complex. Models, no matter how elaborate and well-constructed, will always be simplifications. As such, we must always remember that we are not working with a text, but with a model of that text.
- Augmentation: Computers - and thus QTA methods - are very good at organising, calculating and processing large amounts of text quickly, but not so good at understanding, interpreting and reasoning, which is where humans excel. So, a good QTA does not replace us, but complements us.
- No best method: Our method will always depend on our research question, theoretical framework, and the type of data we are working with. This means that it is perfectly acceptable to choose a simpler method over a more complicated and complex one, as long as it is the right choice for our problem.
- Iteration: A QTA analysis is rarely straightforward. Instead, the process is often iterative, where we explore the data, apply methods, evaluate results, and frequently return to earlier steps to try other ideas.
- Validation: Because we are dealing with models, validation is essential to avoid misleading conclusions. We should validate as often and in as many ways as possible, including comparing automated results with human judgment, checking the face validity of results, and testing the predictive power of our measures.