10.1 Support Vector Machines
To show how SVM works, we will look at an example of SVM in quanteda
and one in RTextTools
, and an example of NB in quanteda
.
10.1.1 SVM with RTextTools
For the SVM, we will start with an example using our Twitter data and the RTextTools
package. First, we load the Twitter data:
library("RTextTools")
library("car")
urlfile <- "https://raw.githubusercontent.com/SCJBruinsma/qta-files/master/Tweets.csv"
tweets <- read.csv(url(urlfile))
tweets$text <- gsub("http.*","", tweets$text)
tweets$text <- gsub("https.*","", tweets$text)
tweets$text <- gsub("\\$", "", tweets$text)
tweets$text <- gsub("@\\w+", "", tweets$text)
tweets$text <- gsub("[[:punct:]]", "", tweets$text)
tweets$text <- gsub("[ |\t]{2,}", "", tweets$text)
tweets$text <- gsub("^ ", "", tweets$text)
tweets$text <- gsub(" $", "", tweets$text)
tweets$text <- gsub("RT", "", tweets$text)
tweets$text <- gsub("href", "", tweets$text)
labels <- tweets$airline_sentiment
labels <- car::recode(labels, "'positive'=1;'negative'=-1;'neutral'=0")
The goal of the supervised learning task is to use part of this dataset to train a certain algorithm and then use the trained algorithm to assign categories to the remaining sentences. Since we know the coded categories for the remaining sentences, we will be able to evaluate how well this training was in guessing/estimating what the codes for these sentences were. We start by creating a document term matrix;
doc_matrix <- create_matrix(tweets$text,
language = "english",
removeNumbers = TRUE,
stemWords = TRUE,
removeSparseTerms = 0.998)
doc_matrix
## <<DocumentTermMatrix (documents: 14640, terms: 693)>>
## Non-/sparse entries: 84521/10060999
## Sparsity : 99%
## Maximal term length: 18
## Weighting : term frequency (tf)
Note that RTextTools
gives you plenty of options in preprocessing. Apart from the options used above, we can also strip whitespace, remove punctuation, and remove stopwords. Stemming and stopword removal is language-specific, so when we select the language in the option above (language=''english'')
, RTextTools
will carry this out according to our language of choice. As of now, the package supports Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, and Swedish.
We then create a container parsing the document matrix into a training set, and a test set. We will use the training set will to train the algorithm and the test set to test how well this algorithm was trained. The following command instructs R to use the first 4000 sentences for the training set the remaining 449 sentences for the test set. Moreover, we specify to append to the document matrix the variable that contains the assigned coders:
container <- create_container(doc_matrix,
labels,
trainSize = 1:10000,
testSize = 10001:14640,
virgin = FALSE)
We can then train a model using one of the available algorithms. For instance, we can use the Support Vector Machines algorithm (SVM) as follows:
Other algorithms are available if you change the SVM option. Options exist for Lasso and Elastic-Net Regularized Generalized Linear Models (GLMNET
), maximum entropy (MAXENT
), scaled linear discriminant analysis (SLDA
), bagging (BAGGING
), boosting (BOOSTING
), random forests (RF
), neural networks (NNET
), or classification trees (TREE
).
We then use the model we trained to classify the texts in the test set. The following command instructs R to classify the documents in the test set of the container using the SVM model that we trained earlier.
We can also view the classification that the SVM model performed as follows. The first column corresponds to the label that coders assigned to each of the tweets in the training set. The second column then gives the probability that the SVM algorithm assigned to that particular category. As you can see, while the probability for some sentences is quite high, for others it is quite low. This even while the classification always chooses the category with the highest probability.
The next step is to check the classification performance of our model. To do this, we first request a function that returns a container with different summaries. For instance, we can request summaries based on the labels attached to the sentences, the documents (or in this case, the sentences) by label, or based on the algorithm.
## ENSEMBLE SUMMARY
##
## n-ENSEMBLE COVERAGE n-ENSEMBLE RECALL
## n >= 1 1 0.8
##
##
## ALGORITHM PERFORMANCE
##
## SVM_PRECISION SVM_RECALL SVM_FSCORE
## 0.6833333 0.6766667 0.6800000
Here, precision gives the proportion of bills that SVM classified as belonging to a category that does belong to that category (true positives) to all the bills that are classified in that category (irrespective of where they belong). Recall, then, is the proportion of bills that SVM classifies as belonging to a category and belong to this category (true positives) to all the bills that belong to this category (true positives plus false negatives). The F score is a weighted average between precision and recall ranging from 0 to 1.
Finally, we can compare the scores between the labels given by the coders and those based on our SVM:
## V2
## V1 -1 0 1
## -1 3018 292 109
## 0 288 347 56
## 1 130 58 342
10.1.2 SVM with Quanteda
Instead of using a separate package, we can also use quanteda
to carry out an SVM. For this, we load some movie reviews, select 1000 of them at random, and place them into our corpus:
set.seed(42)
library(quanteda)
library(quanteda.classifiers)
corpus_reviews <- corpus_sample(data_corpus_LMRD, 1000)
Our aim here will be to see how well the SVM algorithm can predict the rating of the reviews. To do this, we first have to create a new variable prediction
. This variable contains the same scores as the original rating. Then, we remove 30% of the scores and replace them with NA. We do so by creating a missing
variable that contains 30% 0s and 70% 1s. We then place the 0s with NAs. These NA scores are then the ones we want the algorithm to predict. Finally, we add the new variable to the corpus:
prediction <- corpus_reviews$rating
missing <- rbinom(1000, 1, 0.7)
prediction[missing == 0] <- NA
docvars(corpus_reviews, "prediction") <- prediction
We then transform the corpus into a data frame, and also remove stopwords, numbers and punctuation:
data_reviews_tokens <- tokens(
corpus_reviews,
what = "word",
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
include_docvars = TRUE,
padding = FALSE,
verbose = TRUE
)
data_reviews_tokens <- tokens_tolower(data_reviews_tokens, keep_acronyms = FALSE)
data_reviews_tokens <- tokens_select(data_reviews_tokens, stopwords("english"), selection = "remove")
dfm_reviews <- dfm(data_reviews_tokens)
Now we can run the SVM algorithm. To do so, we tell the model on which dfm we want to run our model, and which variable contains the scores to train the algorithm. Here, this is our prediction
variable with the missing data:
library(quanteda.textmodels)
svm_reviews <- textmodel_svm(dfm_reviews, y = docvars(dfm_reviews, "prediction"))
svm_reviews
##
## Call:
## textmodel_svm.dfm(x = dfm_reviews, y = docvars(dfm_reviews, "prediction"))
##
## 672 training documents; 121,240 fitted features.
## Method: L2-regularized L2-loss support vector classification dual (L2R_L2LOSS_SVC_DUAL)
Here we see that the algorithm used 672 texts to train the model (the one with a score) and fitted 133,728 features. The latter refers to the total number of words in the training texts and not only the unique ones. Now we can use this model to predict the ratings we removed earlier:
While we can of course look at the resulting numbers, we can also place them in a two-way table with the actual rating, to see how well the algorithm did:
rating <- corpus_reviews$rating
table_data <- as.data.frame(cbind(svm_predict, rating))
table(table_data$svm_predict,table_data$rating)
##
## 1 2 3 4 7 8 9 10
## 1 172 15 9 16 5 3 3 3
## 2 7 69 6 5 2 2 1 3
## 3 7 0 82 3 1 3 0 1
## 4 5 2 5 86 7 6 0 4
## 7 3 1 1 1 55 3 3 2
## 8 0 2 2 2 8 90 7 7
## 9 4 0 0 3 6 9 76 12
## 10 5 2 4 2 7 6 6 138
Here, the table shows the prediction of the algorithm from top to bottom and the original rating from left to right. What we want is that all cases are on the diagonal: in that case, the prediction is the same as the original rating. Here, this happens in the majority of cases. Also, only in a few cases is the algorithm far off.