5.1 Classical Dictionary Analysis
We start with the classical version of dictionary analysis. As for these dictionaries, we can either create them ourselves or use an off-the-shelf version. As with the data, quanteda provides access to several off-the-shelf dictionaries relevant to the social sciences:
We then apply one of these dictionaries to a document feature matrix (DFM), which is typically created from our corpus after appropriate pre-processing steps, such as tokenisation and lowercase and stopword removal, as we discussed in the previous chapter. As a dictionary, we will use the one created by Laver & Garry (2000), which is designed to estimate political positions from political texts. We first load this dictionary into R and then run it on the DFM using the dfm_lookup command. Here, you can use the DFM we created in the previous chapter, though, for this example, we make it from scratch (without any pre-processing):
library(quanteda)
library(quanteda.corpora)
library(quanteda.dictionaries)
data(data_corpus_ukmanifestos)
data_corpus_ukmanifestos
data_tokens <- tokens(data_corpus_ukmanifestos)
data_dfm <- dfm(data_tokens)First, let us have a look at the dictionary:
## Dictionary object with 9 primary key entries and 2 nested levels.
## - [CULTURE]:
## - people, war_in_iraq, civil_war
## - [CULTURE-HIGH]:
## - art, artistic, dance, galler*, museum*, music*, opera*, theatre*
## - [CULTURE-POPULAR]:
## - media
## - [SPORT]:
## - angler*
## - [ECONOMY]:
## - [+STATE+]:
## - accommodation, age, ambulance, assist, benefit, care, carer*, child*, class, classes, clinics, collective*, contribution*, cooperative*, co-operative*, deprivation, disabilities, disadvantaged, educat*, elderly [ ... and 30 more ]
## - [=STATE=]:
## - accountant, accounting, accounts, advert*, airline*, airport*, audit*, bank*, bargaining, breadwinner*, budget*, buy*, cartel*, cash*, charge*, commerce*, compensat*, consum*, cost*, credit* [ ... and 51 more ]
## - [-STATE-]:
## - assets, autonomy, barrier*, bid, bidders, bidding, burden*, charit*, choice*, compet*, confidence, confiscatory, constrain*, contracting*, contractor*, controlled, controlling, controls, corporate, corporation* [ ... and 42 more ]
## - [ENVIRONMENT]:
## - [CON ENVIRONMENT]:
## - produc*
## - [PRO ENVIRONMENT]:
## - car, catalytic, chemical*, chimney*, clean*, congestion, cyclist*, deplet*, ecolog*, emission*, energy-saving, environment*, fur, green, habitat*, hedgerow*, husbanded, litter*, opencast, open-cast* [ ... and 8 more ]
## - [GROUPS]:
## - [ETHNIC]:
## - asian*, buddhist*, ethnic*, race, raci*
## - [WOMEN]:
## - girls, woman, women
## - [INSTITUTIONS]:
## - [CONSERVATIVE]:
## - authority, continu*, disrupt*, inspect*, jurisdiction*, legitimate, manag*, moratorium, rul*, strike*, whitehall
## - [NEUTRAL]:
## - administr*, advis*, agenc*, amalgamat*, appoint*, assembly, chair*, commission*, committee*, constituen*, council*, department*, directorate*, executive*, headquarters, legislat*, mechanism*, minister*, office, offices [ ... and 18 more ]
## - [RADICAL]:
## - abolition, accountable, answerable, consult*, corrupt*, democratic*, elect*, implement*, modern*, monitor*, rebuild*, reexamine*, reform*, re-organi*, repeal*, replace*, representat*, scandal*, scrap, scrap* [ ... and 3 more ]
## - [LAW_AND_ORDER]:
## - [LAW-CONSERVATIVE]:
## - assaults, bail, burglar*, constab*, convict*, court, courts, custod*, dealing, delinquen*, deter, deter*, disorder, drug*, fine, fines, firmness, force*, fraud*, guard* [ ... and 32 more ]
## - [LAW-LIBERAL]:
## - harassment, non-custodial
## [ reached max_nkey ... 3 more keys ]
Then, we apply it to the DFM:
## Document-feature matrix of: 101 documents, 20 features (17.23% sparse) and 6 docvars.
## features
## docs CULTURE.CULTURE-HIGH CULTURE.CULTURE-POPULAR
## UK_natl_1945_en_Con 5 0
## UK_natl_1945_en_Lab 3 0
## UK_natl_1945_en_Lib 5 0
## UK_natl_1950_en_Con 2 0
## UK_natl_1950_en_Lab 1 0
## UK_natl_1950_en_Lib 2 0
## features
## docs CULTURE.SPORT CULTURE ECONOMY.+STATE+ ECONOMY.=STATE=
## UK_natl_1945_en_Con 0 14 66 119
## UK_natl_1945_en_Lab 0 24 61 150
## UK_natl_1945_en_Lib 0 8 43 106
## UK_natl_1950_en_Con 0 11 74 217
## UK_natl_1950_en_Lab 0 23 56 146
## UK_natl_1950_en_Lib 0 3 44 99
## features
## docs ECONOMY.-STATE- ENVIRONMENT.CON ENVIRONMENT
## UK_natl_1945_en_Con 61 12
## UK_natl_1945_en_Lab 67 18
## UK_natl_1945_en_Lib 42 6
## UK_natl_1950_en_Con 86 22
## UK_natl_1950_en_Lab 77 15
## UK_natl_1950_en_Lib 48 11
## features
## docs ENVIRONMENT.PRO ENVIRONMENT GROUPS.ETHNIC
## UK_natl_1945_en_Con 0 0
## UK_natl_1945_en_Lab 1 0
## UK_natl_1945_en_Lib 4 0
## UK_natl_1950_en_Con 3 0
## UK_natl_1950_en_Lab 2 0
## UK_natl_1950_en_Lib 0 0
## [ reached max_ndoc ... 95 more documents, reached max_nfeat ... 10 more features ]
Here, we see that – for example – the 1945 Conservative Party manifesto – contained 5 words related to High Culture while it contained none for Popular Culture. Overall, the dfm_lookup() function takes a DFM and a dictionary object as input and returns a new DFM where the features are the categories defined in the dictionary and the values are the aggregated counts of terms belonging to each category within each document.
We can create our own dictionaries that better suit our research question or context. For this, we draw on our theoretical framework to develop different categories and their associated words. Another approach is using reference texts or expert knowledge to identify relevant category terms. We can also combine different dictionaries, as illustrated by Young & Soroka (2012), or integrate keywords from manual coding schemes (Lind et al., 2019). In addition, we can use techniques involving expert or crowd-coding assessments to refine dictionaries or determine the words that best fit different categories (Haselmayer & Jenny, 2017).
If we want to create our dictionary in quanteda, we use the dictionary() command. To do this, we specify the words in a named list. This list contains keys (the names of the categories) and the values, which are character vectors containing the words or phrases belonging to each category. We can use wildcard characters (such as * for glob matching) to include word variations. We then convert this list into a dictionary object. Here, we choose some words that we think will allow us to identify different political stances or issues:
dic_list <- list(
economy = c("tax*", "invest*", "trade", "fiscal policy"), # Include a multi-word phrase
war = c("army", "troops", "fight*", "military"),
diplomacy = c("nato", "un", "international relations"),
government = c("london", "commons", "downing street", "westminster")
)
# tolower = TRUE is often recommended unless you have a specific reason not to lowercase
dic_created <- dictionary(dic_list, tolower = TRUE)
dic_created## Dictionary object with 4 key entries.
## - [economy]:
## - tax*, invest*, trade, fiscal policy
## - [war]:
## - army, troops, fight*, military
## - [diplomacy]:
## - nato, un, international relations
## - [government]:
## - london, commons, downing street, westminster
If you compare the structure of dic_list and the resulting dic_created object with that of data_dictionary_LaverGarry, you will see that they have a similar structure, defining categories and associated terms. To then apply our created dictionary to the dfm, we use the same `dfm_lookup’ command:
# Ensure the dfm used here is preprocessed with tolower = TRUE if the
# dictionary is lowercased
dictionary_dfm <- dfm_lookup(data_dfm, dic_created)
dictionary_dfm## Document-feature matrix of: 101 documents, 4 features (16.58% sparse) and 6 docvars.
## features
## docs economy war diplomacy government
## UK_natl_1945_en_Con 24 4 0 0
## UK_natl_1945_en_Lab 13 5 0 0
## UK_natl_1945_en_Lib 12 5 0 1
## UK_natl_1950_en_Con 31 4 0 4
## UK_natl_1950_en_Lab 11 1 0 0
## UK_natl_1950_en_Lib 20 3 0 1
## [ reached max_ndoc ... 95 more documents ]
Note that dfm_lookup() matches individual features (words) in the dfm to the dictionary entries. This means that it will correctly match "tax", "taxes", "taxation" to the “economy” category if "tax*" is in the dictionary. However, if our dictionary contains multi-word expressions (like "fiscal policy" or "international relations" in our example dic_list), dfm_lookup() will not find them because the dfm loses word order information.
To correctly count multi-word expressions defined in a dictionary, we should apply the dictionary before creating the dfm directly to the tokens' object using thetokens_lookup()function. tokens_lookup() preserves the order of the tokens and can therefore match multi-word phrases. The output of tokens_lookup() is a tokens object where the original tokens are replaced by their dictionary categories. We can then convert the resulting token object into a dfm if necessary:
# Use tokens_lookup to handle multi-word expressions
dictionary_tokens <- tokens_lookup(data_tokens, dic_created, exclusive = FALSE) # exclusive=FALSE allows tokens to match multiple categories if applicable
dictionary_tokens_dfm <- dfm(dictionary_tokens)Comparing dictionary_created_dfm and dictionary_created_dfm_from_tokens shows that the latter correctly identifies and counts the multi-word expressions defined in dic_created. Using tokens_lookup() with exclusive = FALSE means a token can be assigned to multiple categories if it matches entries in more than one. Setting exclusive = TRUE would assign a token to only one category (the first found match). Furthermore, while we can view the resulting dfm by calling it in the console or viewing it in the environment, we can also convert this dfm into a regular data frame for easier manipulation and visualisation. For this, we can use the convert command included in quanteda:
You can then use this data frame to normalise these raw counts and compare dictionary results across documents of different lengths by dividing the category counts by either the total number of tokens or the total number of dictionary words in each document.