4.4 Text Statistics

Apart from graphics, we can also calculate a wide range of statistics about our texts, and the quanteda.textstats package offers a range of functions for this. We will use the pre-processed DFM (data_dfm_trimmed) and tokens (data_tokens_stemmed) for these calculations. We will go through the various options in turn.

4.4.1 Summary

textstat_summary() provides basic summary statistics for each document in the corpus, such as the number of characters, tokens, types, sentences, and paragraphs. Note that this command works on the original corpus, and not on the cleaned DFM:

corpus_summary <- textstat_summary(data_corpus_ukmanifestos)
head(corpus_summary)
##               document  chars sents tokens types puncts numbers symbols urls
## 1  UK_natl_1997_en_Con 131975  1188  23398  3174   2256     346       0    0
## 2  UK_natl_1997_en_Lab 111444   822  19372  3007   1787     114      15    0
## 3   UK_natl_1997_en_LD  90883   852  15988  2472   1766     104      38    0
## 4  UK_natl_1997_en_PCy 103411   765  17892  2969   1743      72       2    0
## 5   UK_natl_1997_en_SF  37998   254   6540  1629    620      55       8    0
## 6 UK_natl_1997_en_UKIP  72973   488  13103  2631   1195      61       8    0
##   tags emojis
## 1    0      0
## 2    0      0
## 3    0      0
## 4    0      0
## 5    0      0
## 6    0      0

4.4.2 Frequencies

textstat_frequency() provides detailed frequency counts for features in a DFM, including term frequency (total occurrences, like we already saw earlier) and document frequency (number of documents the term appears in). It can also group frequencies by document variables, allowing for comparison of term usage across different categories of documents.

feature_frequencies <- textstat_frequency(data_dfm_trimmed)
head(feature_frequencies, 10)
##    feature frequency rank docfreq group
## 1   govern       657    1      20   all
## 2     work       645    2      20   all
## 3     need       613    3      20   all
## 4  support       596    4      20   all
## 5  develop       589    5      19   all
## 6  increas       475    6      20   all
## 7  britain       465    7      16   all
## 8   provid       462    8      20   all
## 9    ensur       461    9      20   all
## 10  school       449   10      18   all
party_frequencies <- textstat_frequency(data_dfm_trimmed, groups = data_dfm_trimmed@docvars$Party)  # Group by Party
head(party_frequencies, 10)
##    feature frequency rank docfreq group
## 1  britain       141    1       3   Con
## 2   govern       124    2       3   Con
## 3   school        97    3       3   Con
## 4     give        87    4       3   Con
## 5     work        82    5       3   Con
## 6    peopl        79    6       3   Con
## 7      can        74    7       3   Con
## 8  continu        73    8       3   Con
## 9   labour        73    8       3   Con
## 10 pension        72   10       3   Con

4.4.3 Lexical diversity

Lexical diversity measures the variety of vocabulary in a text. textstat_lexdiv() calculates various measures like the Type-Token Ratio (TTR), which is the number of types (unique tokens) divided by the total number of tokens. A higher TTR generally indicates greater lexical diversity. This function operates on a tokens object:

corpus_lexdiv <- textstat_lexdiv(data_tokens_stemmed, measure = "TTR")
corpus_lexdiv
##                document       TTR
## 1   UK_natl_1997_en_Con 0.2733320
## 2   UK_natl_1997_en_Lab 0.3136331
## 3    UK_natl_1997_en_LD 0.3274956
## 4   UK_natl_1997_en_PCy 0.3124604
## 5    UK_natl_1997_en_SF 0.4374573
## 6  UK_natl_1997_en_UKIP 0.3505444
## 7   UK_natl_2001_en_Con 0.3433852
## 8   UK_natl_2001_en_Lab 0.2420945
## 9    UK_natl_2001_en_LD 0.2798416
## 10  UK_natl_2001_en_PCy 0.4446326
## 11   UK_natl_2001_en_SF 0.4259938
## 12  UK_natl_2001_en_SNP 0.3553472
## 13  UK_natl_2005_en_Con 0.4268849
## 14  UK_natl_2005_en_Lab 0.2891995
## 15   UK_natl_2005_en_LD 0.3209892
## 16  UK_natl_2005_en_PCy 0.4569697
## 17   UK_natl_2005_en_SF 0.2959165
## 18  UK_natl_2005_en_SNP 0.5817634
## 19 UK_natl_2005_en_UKIP 0.3882198
## 20  UK_regl_2003_en_PCy 0.2551293

4.4.4 Readability

Readability statistics estimate the difficulty of understanding a text based on characteristics like sentence length and the number of syllables per word. textstat_readability() calculates various standard scores (e.g., Flesch-Kincaid, Gunning Fog). This function works directly on a corpus object or a character vector:

corpus_readability <- textstat_readability(data_corpus_ukmanifestos, measure = "Flesch.Kincaid")
head(corpus_readability)
##               document Flesch.Kincaid
## 1  UK_natl_1997_en_Con       10.79018
## 2  UK_natl_1997_en_Lab       12.70561
## 3   UK_natl_1997_en_LD       11.10189
## 4  UK_natl_1997_en_PCy       13.48201
## 5   UK_natl_1997_en_SF       14.27023
## 6 UK_natl_1997_en_UKIP       13.82250

4.4.5 Similarity and Distance

These functions calculate the similarity or distance between documents or features based on their representation in a DFM. They help quantify how alike or different texts or words are based on their shared vocabulary and term frequencies. Common measures include cosine similarity and Euclidean distance:

# method = 'cosine': Specifies the cosine similarity measure.  margin =
# 'documents': Calculate the similarity between documents (rows of the DFM).

corpus_similarties <- textstat_simil(data_dfm_trimmed, method = "cosine", margin = "documents")

corpus_similarties_matrix <- as.matrix(corpus_similarties)
corpus_similarties_matrix[1:5, 1:5]
##                     UK_natl_1997_en_Con UK_natl_1997_en_Lab UK_natl_1997_en_LD
## UK_natl_1997_en_Con           1.0000000           0.7569519          0.7502951
## UK_natl_1997_en_Lab           0.7569519           1.0000000          0.7528290
## UK_natl_1997_en_LD            0.7502951           0.7528290          1.0000000
## UK_natl_1997_en_PCy           0.6093673           0.6155356          0.6148482
## UK_natl_1997_en_SF            0.4817308           0.4968049          0.4681049
##                     UK_natl_1997_en_PCy UK_natl_1997_en_SF
## UK_natl_1997_en_Con           0.6093673          0.4817308
## UK_natl_1997_en_Lab           0.6155356          0.4968049
## UK_natl_1997_en_LD            0.6148482          0.4681049
## UK_natl_1997_en_PCy           1.0000000          0.4771206
## UK_natl_1997_en_SF            0.4771206          1.0000000
# method = 'euclidean': Specifies the Euclidean distance measure margin =
# 'documents': Calculate the distance between documents

corpus_distances <- textstat_dist(data_dfm_trimmed, margin = "documents", method = "euclidean")

# Convert the distance object to a matrix for inspection of pairwise distances

corpus_distances_matrix <- as.matrix(corpus_distances)
corpus_distances_matrix[1:5, 1:5]
##                     UK_natl_1997_en_Con UK_natl_1997_en_Lab UK_natl_1997_en_LD
## UK_natl_1997_en_Con              0.0000            230.1652           230.7423
## UK_natl_1997_en_Lab            230.1652              0.0000           197.8080
## UK_natl_1997_en_LD             230.7423            197.8080             0.0000
## UK_natl_1997_en_PCy            290.9605            262.1793           249.6117
## UK_natl_1997_en_SF             309.0372            257.2198           229.1244
##                     UK_natl_1997_en_PCy UK_natl_1997_en_SF
## UK_natl_1997_en_Con            290.9605           309.0372
## UK_natl_1997_en_Lab            262.1793           257.2198
## UK_natl_1997_en_LD             249.6117           229.1244
## UK_natl_1997_en_PCy              0.0000           266.7133
## UK_natl_1997_en_SF             266.7133             0.0000

If we would want to, we can also visualise the distances between the documents in the form of a dendrogram, by clustering the distances object:

plot(hclust(as.dist(corpus_distances)))

The results here are quite interesting, for one in that while both the 1997 and 2003 Plaid Cymru documents are very similar, they are clustered together with the 2001 Labour Party document, which is quite far from its 1997 and 2005 counterparts.

4.4.6 Keyness

Keyness statistics identify terms that are unusually frequent or infrequent in a target group of documents compared to a reference group. This is useful for identifying characteristic terms within a corpus subset and understanding what distinguishes one set of texts from another. A common measure is the log-likelihood ratio or chi-squared statistic:

# Create a logical vector TRUE for documents from the Conservative party and
# FALSE for others. This vector defines the 'target' group.

data_dfm_trimmed@docvars$is_conservative <- data_dfm_trimmed@docvars$Party == "Con"

# Compute keyness statistics comparing the Conservative manifestos (target) to
# all other manifestos (reference) target =
# data_dfm_trimmed@docvars$is_conservative: Specifies the target group using
# the logical vector

keyness_conservative <- textstat_keyness(data_dfm_trimmed, target = data_dfm_trimmed@docvars$is_conservative)

# View the most distinctive terms for the Conservative party (highest keyness
# scores) and the least distinctive terms (lowest keyness scores, which are
# characteristic of the reference group)

head(keyness_conservative, 20)
##                     feature      chi2            p n_target n_reference
## 1       conservative_govern 144.44686 0.000000e+00       38          14
## 2                   britain  94.85417 0.000000e+00      141         324
## 3  next_conservative_govern  93.74717 0.000000e+00       17           0
## 4                    famili  60.39675 7.771561e-15       69         137
## 5                     choic  54.78260 1.346701e-13       59         113
## 6                         n  52.41623 4.489742e-13       13           3
## 7                      keep  47.73544 4.877987e-12       47          85
## 8                   conserv  46.54836 8.938517e-12       61         131
## 9                    fallen  46.16172 1.088840e-11       19          14
## 10         time_common_sens  45.96519 1.203726e-11       10           1
## 11                    pound  45.15771 1.817879e-11       23          24
## 12                     give  44.16372 3.020273e-11       87         228
## 13                   contin  36.10094 1.873561e-09       10           3
## 14               foot_mouth  36.10094 1.873561e-09       10           3
## 15                lower_tax  35.07639 3.170212e-09        9           2
## 16                    widow  35.07639 3.170212e-09        9           2
## 17                  opt-out  34.50407 4.253600e-09        8           1
## 18                        w  34.50407 4.253600e-09        8           1
## 19               politician  30.71197 2.993130e-08       19          24
## 20                  continu  30.55484 3.245615e-08       73         207
tail(keyness_conservative, 20)
##               feature      chi2            p n_target n_reference
## 6238          poverti -12.58534 3.887848e-04        3         109
## 6239              snp -12.69011 3.675955e-04        0          75
## 6240       agricultur -13.31662 2.630645e-04        1          91
## 6241      all-ireland -13.87526 1.953533e-04        0          82
## 6242           access -13.90699 1.920823e-04       10         186
## 6243         strategi -14.88492 1.142728e-04        7         164
## 6244          resourc -15.33557 9.000581e-05       12         214
## 6245      environment -17.22185 3.325894e-05        4         148
## 6246             ukip -17.77023 2.492533e-05        0         105
## 6247           provis -18.75683 1.484903e-05        5         168
## 6248      plaid_cymru -18.78653 1.461959e-05        0         111
## 6249 liberal_democrat -19.72714 8.932357e-06        1         129
## 6250          develop -20.11047 7.309549e-06       47         542
## 6251               eu -20.27349 6.712432e-06       12         247
## 6252           promot -23.02328 1.600513e-06       18         321
## 6253            equal -23.45134 1.281138e-06        2         163
## 6254        sinn_féin -26.24228 3.011560e-07        0         155
## 6255         scotland -27.04384 1.988927e-07        3         196
## 6256               uk -28.18851 1.100557e-07       10         278
## 6257             wale -57.12540 4.085621e-14        2         362

The output of textstat_keyness includes the feature, the keyness score, and the \(p\)-value. Positive keyness scores indicate terms that are unusually frequent in the target group. In contrast, negative scores indicate terms unusually infrequent in the target group (and thus characteristic of the reference group). Here, these are words refering to “britain”, the family (“famili”), and choice (“choic”). We can of course also visualise this:

textplot_keyness(keyness_conservative)

4.4.7 Entropy

Entropy measures the randomness or evenness of feature distributions. Here, we can use it to assess the diversity of terms within documents (document entropy) or the evenness of a term’s distribution across the corpus (feature entropy). High document entropy means a document uses a wide variety of terms relatively evenly, while low entropy means a few terms dominate. High feature entropy means a term is spread relatively evenly across documents, while low entropy means it’s concentrated in a few documents.

# margin = 'documents': Calculate entropy for each document (rows of the DFM)

corpus_entropy_docs <- textstat_entropy(data_dfm_trimmed, margin = "documents")
corpus_entropy_docs <- as.data.frame(corpus_entropy_docs)
head(corpus_entropy_docs)
##               document   entropy
## 1  UK_natl_1997_en_Con 10.332738
## 2  UK_natl_1997_en_Lab 10.383004
## 3   UK_natl_1997_en_LD 10.186036
## 4  UK_natl_1997_en_PCy 10.233130
## 5   UK_natl_1997_en_SF  9.619364
## 6 UK_natl_1997_en_UKIP 10.068977
# margin = 'features': Calculate entropy for each feature (columns of the DFM)

corpus_entropy_feats <- textstat_entropy(data_dfm_trimmed, margin = "features")
corpus_entropy_feats <- as.data.frame(corpus_entropy_feats)
corpus_entropy_feats <- corpus_entropy_feats[order(-corpus_entropy_feats$entropy),
    ]
head(corpus_entropy_feats, 10)
##      feature  entropy
## 4      elect 4.140895
## 16   economi 4.129012
## 86      mean 4.122883
## 380    parti 4.114558
## 1254    forc 4.110995
## 479   polici 4.110578
## 130  increas 4.107211
## 582     mani 4.101916
## 782  greater 4.101321
## 688  societi 4.097388