5.1 Classical Dictionary Analysis

We start with the classical version of dictionary analysis. As for these dictionaries, we can either create them ourselves or use an off-the-shelf version. As with the data, quanteda provides access to several off-the-shelf dictionaries relevant to the social sciences:

library(quanteda.dictionaries)

We then apply one of these dictionaries to a document feature matrix (DFM), which is typically created from our corpus after appropriate pre-processing steps, such as tokenisation and lowercase and stopword removal, as we discussed in the previous chapter. As a dictionary, we will use the one created by Laver & Garry (2000), which is designed to estimate political positions from political texts. We first load this dictionary into R and then run it on the DFM using the dfm_lookup command. Here, you can use the DFM we created in the previous chapter, though, for this example, we make it from scratch (without any pre-processing):

library(quanteda)
library(quanteda.corpora)
library(quanteda.dictionaries)

data(data_corpus_ukmanifestos)
data_corpus_ukmanifestos

data_tokens <- tokens(data_corpus_ukmanifestos)
data_dfm <- dfm(data_tokens)

First, let us have a look at the dictionary:

data_dictionary_LaverGarry  # Display information about the dictionary object
## Dictionary object with 9 primary key entries and 2 nested levels.
## - [CULTURE]:
##   - people, war_in_iraq, civil_war
##   - [CULTURE-HIGH]:
##     - art, artistic, dance, galler*, museum*, music*, opera*, theatre*
##   - [CULTURE-POPULAR]:
##     - media
##   - [SPORT]:
##     - angler*
## - [ECONOMY]:
##   - [+STATE+]:
##     - accommodation, age, ambulance, assist, benefit, care, carer*, child*, class, classes, clinics, collective*, contribution*, cooperative*, co-operative*, deprivation, disabilities, disadvantaged, educat*, elderly [ ... and 30 more ]
##   - [=STATE=]:
##     - accountant, accounting, accounts, advert*, airline*, airport*, audit*, bank*, bargaining, breadwinner*, budget*, buy*, cartel*, cash*, charge*, commerce*, compensat*, consum*, cost*, credit* [ ... and 51 more ]
##   - [-STATE-]:
##     - assets, autonomy, barrier*, bid, bidders, bidding, burden*, charit*, choice*, compet*, confidence, confiscatory, constrain*, contracting*, contractor*, controlled, controlling, controls, corporate, corporation* [ ... and 42 more ]
## - [ENVIRONMENT]:
##   - [CON ENVIRONMENT]:
##     - produc*
##   - [PRO ENVIRONMENT]:
##     - car, catalytic, chemical*, chimney*, clean*, congestion, cyclist*, deplet*, ecolog*, emission*, energy-saving, environment*, fur, green, habitat*, hedgerow*, husbanded, litter*, opencast, open-cast* [ ... and 8 more ]
## - [GROUPS]:
##   - [ETHNIC]:
##     - asian*, buddhist*, ethnic*, race, raci*
##   - [WOMEN]:
##     - girls, woman, women
## - [INSTITUTIONS]:
##   - [CONSERVATIVE]:
##     - authority, continu*, disrupt*, inspect*, jurisdiction*, legitimate, manag*, moratorium, rul*, strike*, whitehall
##   - [NEUTRAL]:
##     - administr*, advis*, agenc*, amalgamat*, appoint*, assembly, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, headquarters, legislat*, mechanism*, minister*, office, offices [ ... and 18 more ]
##   - [RADICAL]:
##     - abolition, accountable, answerable, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap, scrap* [ ... and 3 more ]
## - [LAW_AND_ORDER]:
##   - [LAW-CONSERVATIVE]:
##     - assaults, bail, burglar*, constab*, convict*, court, courts, custod*, dealing, delinquen*, deter, deter*, disorder, drug*, fine, fines, firmness, force*, fraud*, guard* [ ... and 32 more ]
##   - [LAW-LIBERAL]:
##     - harassment, non-custodial
## [ reached max_nkey ... 3 more keys ]

Then, we apply it to the DFM:

dictionary_results <- dfm_lookup(data_dfm, data_dictionary_LaverGarry)
dictionary_results
## Document-feature matrix of: 101 documents, 20 features (17.23% sparse) and 6 docvars.
##                      features
## docs                  CULTURE.CULTURE-HIGH CULTURE.CULTURE-POPULAR
##   UK_natl_1945_en_Con                    5                       0
##   UK_natl_1945_en_Lab                    3                       0
##   UK_natl_1945_en_Lib                    5                       0
##   UK_natl_1950_en_Con                    2                       0
##   UK_natl_1950_en_Lab                    1                       0
##   UK_natl_1950_en_Lib                    2                       0
##                      features
## docs                  CULTURE.SPORT CULTURE ECONOMY.+STATE+ ECONOMY.=STATE=
##   UK_natl_1945_en_Con             0      14              66             119
##   UK_natl_1945_en_Lab             0      24              61             150
##   UK_natl_1945_en_Lib             0       8              43             106
##   UK_natl_1950_en_Con             0      11              74             217
##   UK_natl_1950_en_Lab             0      23              56             146
##   UK_natl_1950_en_Lib             0       3              44              99
##                      features
## docs                  ECONOMY.-STATE- ENVIRONMENT.CON ENVIRONMENT
##   UK_natl_1945_en_Con              61                          12
##   UK_natl_1945_en_Lab              67                          18
##   UK_natl_1945_en_Lib              42                           6
##   UK_natl_1950_en_Con              86                          22
##   UK_natl_1950_en_Lab              77                          15
##   UK_natl_1950_en_Lib              48                          11
##                      features
## docs                  ENVIRONMENT.PRO ENVIRONMENT GROUPS.ETHNIC
##   UK_natl_1945_en_Con                           0             0
##   UK_natl_1945_en_Lab                           1             0
##   UK_natl_1945_en_Lib                           4             0
##   UK_natl_1950_en_Con                           3             0
##   UK_natl_1950_en_Lab                           2             0
##   UK_natl_1950_en_Lib                           0             0
## [ reached max_ndoc ... 95 more documents, reached max_nfeat ... 10 more features ]

Here, we see that – for example – the 1945 Conservative Party manifesto – contained 5 words related to High Culture while it contained none for Popular Culture. Overall, the dfm_lookup() function takes a DFM and a dictionary object as input and returns a new DFM where the features are the categories defined in the dictionary and the values are the aggregated counts of terms belonging to each category within each document.

We can create our own dictionaries that better suit our research question or context. For this, we draw on our theoretical framework to develop different categories and their associated words. Another approach is using reference texts or expert knowledge to identify relevant category terms. We can also combine different dictionaries, as illustrated by Young & Soroka (2012), or integrate keywords from manual coding schemes (Lind et al., 2019). In addition, we can use techniques involving expert or crowd-coding assessments to refine dictionaries or determine the words that best fit different categories (Haselmayer & Jenny, 2017).

If we want to create our dictionary in quanteda, we use the dictionary() command. To do this, we specify the words in a named list. This list contains keys (the names of the categories) and the values, which are character vectors containing the words or phrases belonging to each category. We can use wildcard characters (such as * for glob matching) to include word variations. We then convert this list into a dictionary object. Here, we choose some words that we think will allow us to identify different political stances or issues:

dic_list <- list(
    economy = c("tax*", "invest*", "trade", "fiscal policy"), # Include a multi-word phrase
    war = c("army", "troops", "fight*", "military"),
    diplomacy = c("nato", "un", "international relations"), 
    government = c("london", "commons", "downing street", "westminster")
)

# tolower = TRUE is often recommended unless you have a specific reason not to lowercase

dic_created <- dictionary(dic_list, tolower = TRUE)
dic_created
## Dictionary object with 4 key entries.
## - [economy]:
##   - tax*, invest*, trade, fiscal policy
## - [war]:
##   - army, troops, fight*, military
## - [diplomacy]:
##   - nato, un, international relations
## - [government]:
##   - london, commons, downing street, westminster

If you compare the structure of dic_list and the resulting dic_created object with that of data_dictionary_LaverGarry, you will see that they have a similar structure, defining categories and associated terms. To then apply our created dictionary to the dfm, we use the same `dfm_lookup’ command:

# Ensure the dfm used here is preprocessed with tolower = TRUE if the
# dictionary is lowercased

dictionary_dfm <- dfm_lookup(data_dfm, dic_created)
dictionary_dfm
## Document-feature matrix of: 101 documents, 4 features (16.58% sparse) and 6 docvars.
##                      features
## docs                  economy war diplomacy government
##   UK_natl_1945_en_Con      24   4         0          0
##   UK_natl_1945_en_Lab      13   5         0          0
##   UK_natl_1945_en_Lib      12   5         0          1
##   UK_natl_1950_en_Con      31   4         0          4
##   UK_natl_1950_en_Lab      11   1         0          0
##   UK_natl_1950_en_Lib      20   3         0          1
## [ reached max_ndoc ... 95 more documents ]

Note that dfm_lookup() matches individual features (words) in the dfm to the dictionary entries. This means that it will correctly match "tax", "taxes", "taxation" to the “economy” category if "tax*" is in the dictionary. However, if our dictionary contains multi-word expressions (like "fiscal policy" or "international relations" in our example dic_list), dfm_lookup() will not find them because the dfm loses word order information.

To correctly count multi-word expressions defined in a dictionary, we should apply the dictionary before creating the dfm directly to the tokens' object using thetokens_lookup()function. tokens_lookup() preserves the order of the tokens and can therefore match multi-word phrases. The output of tokens_lookup() is a tokens object where the original tokens are replaced by their dictionary categories. We can then convert the resulting token object into a dfm if necessary:

# Use tokens_lookup to handle multi-word expressions

dictionary_tokens <- tokens_lookup(data_tokens, dic_created, exclusive = FALSE)  # exclusive=FALSE allows tokens to match multiple categories if applicable

dictionary_tokens_dfm <- dfm(dictionary_tokens)

Comparing dictionary_created_dfm and dictionary_created_dfm_from_tokens shows that the latter correctly identifies and counts the multi-word expressions defined in dic_created. Using tokens_lookup() with exclusive = FALSE means a token can be assigned to multiple categories if it matches entries in more than one. Setting exclusive = TRUE would assign a token to only one category (the first found match). Furthermore, while we can view the resulting dfm by calling it in the console or viewing it in the environment, we can also convert this dfm into a regular data frame for easier manipulation and visualisation. For this, we can use the convert command included in quanteda:

dictionary_df <- convert(dictionary_tokens_dfm, to = "data.frame")

You can then use this data frame to normalise these raw counts and compare dictionary results across documents of different lengths by dividing the category counts by either the total number of tokens or the total number of dictionary words in each document.

References

Haselmayer, M., & Jenny, M. (2017). Sentiment analysis of political communication: Combining a dictionary approach with crowdcoding. Quality & Quantity, 51(6), 2623–2646. https://doi.org/10.1007/s11135-016-0412-4
Laver, M., & Garry, J. (2000). Estimating policy positions from political texts. American Journal of Political Science, 44(3), 619–634. https://doi.org/10.2307/2669268
Lind, F., Eberl, J.-M., Heidenreich, T., & Boomgaarden, H. G. (2019). When the journey is as important as the goal: A roadmap to multilingual dictionary construction. International Journal of Communication, 13, 4000–4020.
Young, L., & Soroka, S. (2012). Lexicoder sentiment dictionary. http://www.snsoroka.com/data-lexicoder/