6.4 Text Statistics

Finally, quanteda also allows us to calculate quite some textual statistics. These are all collected in the quanteda.textstats helper package. Here, we will look at several of them, starting with a simple overview of our corpus in the terms of a summary. This tells us the number of characters, sentences, tokens, etc. for each of the texts:

library(quanteda.textstats)

## Warning: package 'quanteda.textstats' was built under R version 4.3.3

corpus_summary <- textstat_summary(data_corpus)

If we want, we can then use this data to make some simple graphs telling us various things about the texts in our corpus. As an example, let’s look at the number of sentences in the various paragraphs:

ggplot(data=corpus_summary, aes(x=document, y=sents, group=1)) +
 geom_line()+
 geom_point()+
 ylab("Number of Characters")+
 xlab("Paragraph")+
 theme_classic()+
 theme(axis.text.x = element_text(angle = 90))

Other things we can look at are the readability and lexical diversity of the texts. The former one of these refers to how readable a text is (i.e. how easy or difficult it is to read), while the latter tells us how many different types of words there are in the texts and thus how diverse the text is in word choice and use. Given that there are many ways to calculate both metrics, please have a look at the help file to see which one works best for you. Here, we will use the most popular:

corpus_readability <- textstat_readability(data_corpus, measure = "Flesch.Kincaid")
corpus_lexdiv <- textstat_lexdiv(data_tokens, measure = "CTTR")

As before, we can plot this data in a graph to see how lexical diversity developed over the course of the article:

ggplot(data=corpus_lexdiv, aes(x=document, y=CTTR, group=1)) +
 geom_line()+
 geom_point()+
 ylab("Lexical Diversity (CTTR)")+
 xlab("Paragraph")+
 theme_classic()+
 theme(axis.text.x = element_text(angle = 90))

## Warning: Removed 1 row containing missing values or values outside the scale range
## (`geom_point()`).

Another thing we can do is look at the similarities and distances between documents. With this, we can answer questions such as: how different are these documents from each other? And if different (or similar), how different (or similar)? The idea is that the larger the similarity is, the smaller the distance is as well. A good way to understand the idea of similarity is to consider how many operations you need to perform to change one text into the other. The more “replace” options you have to carry out, the more different the text. As for the distances, it is best to consider the texts as having positions on a Cartesian plane (with positions based on their word counts). The distance between these two points (either Euclidean, Manhattan or other) is then the distance between the texts.

Let’s start with a look at these similarities (note again that there are many different methods to calculate this):

corpus_similarties <- textstat_simil(data_dfm, method = "correlation", margin = "documents")
corpus_similarties <- as.data.frame(corpus_similarties)

Note that while we look here at the documents, we could also look at individual words (set margin="features). For now, let us look at the distances between the documents, choosing the Euclidean distance between the documents as our metric:

corpus_distances <- textstat_dist(data_dfm, margin = "documents", method = "euclidean")
corpus_distances_df <- as.data.frame(corpus_distances)

If we want to, we can even convert this data into a dendrogram. We do this by taking the information on the distances out of the corpus_distances object, make them into a triangular matrix, and plot them:

plot(hclust(as.dist(corpus_distances)), hang = -1)

Finally, let us look at the entropy of our texts. The entropy of a document measures the ‘amount’ of information each letter of the text produces. To get an idea of what this means, consider the ‘e’ is an often occurring letter in an English text, while ‘z’ is not. Thus, a word with a ‘z’ in it, it more unique and thus likely to carry unique and interesting information. The ‘higher’ the entropy of a text, the less ‘information’ is in it:

corpus_entropy_docs <- textstat_entropy(data_dfm, "documents")
corpus_entropy_docs <- as.data.frame(corpus_entropy_docs)

While not as common as the other distance metrics, entropy is sometimes used to measure the similarity between texts. Thus, it can be useful if we want to know the importance of certain words. This is because if a certain word is not important, we could consider it to be a stop word:

corpus_entropy_feats <- textstat_entropy(data_dfm, "features")
corpus_entropy_feats <- as.data.frame(corpus_entropy_feats)
corpus_entropy_feats <- corpus_entropy_feats[order(-corpus_entropy_feats$entropy),]
head(corpus_entropy_feats, 10)

##       feature  entropy
## 8      soviet 6.620700
## 2         war 6.443909
## 9       union 6.081728
## 7      states 5.923134
## 6      united 5.823699
## 57   military 5.612866
## 1        cold 5.573738
## 107 communist 5.374768
## 222        us 5.293932
## 12    western 5.289149

Looking at the data, we find that ‘soviet’, ‘war’ and ‘union’ have pretty high entropies. This indicates that the words added little to the information of the documents, and would-be candidates for removal from our corpus.