1.1 QTA in steps

So what does a QTA look like in practice? Let’s say you already have an idea of what you want to do and what questions you want to ask. Then you need to go through the following steps:

Choose and select. If you want to look at political manifestos, do you want to see all of them or just those of the major parties? And do you want only the most recent ones or all of them?
Find and collect. Find all the texts you need and save them somewhere. Make sure that everything you want is included, that you have the right version of the document and that the documents are in a readable format (pdf, txt or .doc format).
Check. If your documents are in .txt format, are there any conversion errors? For example, is a letter like “Ü” visible in the document, or do you see something like “™” instead? Note that most researchers work in English, and non-English and non-Latin alphabets can cause problems. The best option is to ensure that all your documents are in UTF-8 (more on this here).
Create a corpus. Load all the texts you want to analyse and associate them with any metadata you want to include. Then transform the texts into a data frequency matrix (DFM). This matrix has your individual texts in the rows and all possible words in the columns - this turns your corpus of textual data into a matrix of numbers.
Preprocess. Remove words you do not need, such as stop words, remove punctuation, and apply stemming or lemmatisation algorithms.
Describe. Check your data. Are there any words that occur a lot (and which you might want to remove?) Are there any strange patterns? Is the data in the right form?
Run your model. Select your model, run it and check that all the hyperparameter settings are correct. Check that all the steps are correct and repeat the process at least once to ensure future reproducibility.

Visualise and interpret. Look at your results using tables and graphs and try to see if you can answer your research question.

Of all these steps, you will often find that the last two are the most commonly covered. However, it is equally important to ensure that your data is correct and of sufficient quality to be of real use. Often problems later in the analysis are caused by problems in the data early on.