4.1 Text in R

R sees any form of text as a vector consisting of different types of characters. In their simplest form, these vectors only have a single character in them. At their most complicated, they can contain many sentences or even whole stories. To see how many characters a vector has, we can use the nchar function:

vector1 <- "This is the first of our character vectors"
nchar(vector1)
## [1] 42
length(vector1)
## [1] 1

This example also shows the logic of R. First, we assign the text we have to a certain object. We do so using the <- arrow. This arrow points from the text we have to the object R stores it in, which we here call vector1. We then ask R to give us the number of characters inside this object, which is 40 in this case. The length command returns something else, namely 1. This means that we have a single sentence, or word, in our object. If we want to, we can place more sentences inside our object using the c() option:

vector2 <- c("This is an example", "This is another", "And so we can go on.")
length(vector2)
## [1] 3
nchar(vector2)
## [1] 18 15 20
sum(nchar(vector2))
## [1] 53

Another thing we can do is extract certain words from a sentence. For this, we use the substr() function. With this function, R gives us all the characters that occur between two specific positions. So, when we want the characters between the 4th and 10th characters, we write:

vector3 <- "This is yet another sentence"
substr(vector3, 4, 10)
## [1] "s is ye"

We can also split a character vector into smaller parts. We often do this when we want to split a longer text into several sentences. To do so, we use the strsplit function:

vector3 <- "Here is a sentence - And a second"
parts1 <- strsplit(vector3, "-")
parts1
## [[1]]
## [1] "Here is a sentence " " And a second"

If we now look in the Environment window, we will see that R calls parts1 a list. This is another type of object that R uses to store information. We will see it more often later on. For now, it is good to remember that lists in R can have many vectors (the layers of the list) and that in each of these vectors we can store many objects. Here, our list has only a single vector. To create a longer list, we have to add more vectors, and then join them together, again using the c() command:

vector4 <- "Here is another sentence - And one more"
parts2 <- strsplit(vector4, "-")
parts3 <- c(parts1, parts2)

We can now look at this new list in the Environment and check that it indeed has two elements. A further thing we can do is to join many vectors together. For this, we can use the paste function. Here, the sep argument defines how R will combine the elements:

fruits <- paste("oranges", "lemons", "pears", sep = "-")
fruits
## [1] "oranges-lemons-pears"

Note that we can also use this command that pastes objects that we made earlier together. For example:

sentences <- paste(vector3, vector4, sep = ".")
sentences
## [1] "Here is a sentence - And a second.Here is another sentence - And one more"

Finally, we can change the case (lowercase, uppercase) of the sentence. To do this, we can use tolower and toupper:

tolower(sentences)
## [1] "here is a sentence - and a second.here is another sentence - and one more"
toupper(sentences)
## [1] "HERE IS A SENTENCE - AND A SECOND.HERE IS ANOTHER SENTENCE - AND ONE MORE"

Again, we can also run the same command when we have more than a single element in our vector:

sentences2 <- c("This is a piece of example text", "This is another piece of example text")
toupper(sentences2)
## [1] "THIS IS A PIECE OF EXAMPLE TEXT"      
## [2] "THIS IS ANOTHER PIECE OF EXAMPLE TEXT"
tolower(sentences2)
## [1] "this is a piece of example text"      
## [2] "this is another piece of example text"

And that is it. As you can see, the options for text analysis in basic R are rather limited. This is why packages such as quanteda exist in the first place. Note though, that even quanteda uses the same logic of character vectors and combinations that we saw here.