3.1 Basics

R treats text as a string or character vector, making it one of R’s basic data structures, the others being logical, integer, real (numeric), complex, and raw. A character vector can contain a single string (such as a single word or phrase), while more complex character vectors can contain multiple strings, which could represent a collection of sentences, paragraphs, or even entire documents. Because character vectors are vectors, we can perform many of the same operations on them as we can on other vector types, such as calculations or checking their properties. For example, the nchar() function returns the number of characters within each string element, while length() returns the number of elements (or individual strings) contained in the vector.

Let’s start by defining a character vector containing a single string and examining its properties using these two functions. Note that in R, we must enclose our strings in either double quotes (") or single quotes (“`):

vector1 <- "This is the first of our character vectors"
nchar(vector1)  # Number of characters in the string
length(vector1)  # Number of elements (strings) in the vector

As you can see from the output of the above code, the nchar() function tells us that the string has \(42\) characters. However, length() tells us the vector contains only \(1\) element. This is because vector1 contains only a single string, even though that string is quite long. To illustrate a vector with multiple strings, we use the c() function (short for “combine”). When used with strings, c() combines multiple individual strings into a single character vector. Let’s create a vector with three different strings and see how length() and nchar() behave:

vector2 <- c("This is one example", "This is another", "And so we can continue.")
length(vector2)  # Number of elements in vector
nchar(vector2)  # Returns characters for each element
sum(nchar(vector2))  # Total number of characters in all elements

When we run this code, length(vector2) will return \(3\) because the vector contains three separate strings. As a result, nchar(vector2) now has multiple elements (3), and nchar() returns a vector of results, one for each string. To get the total number of characters in all three strings, we can wrap (or nest) nchar(vector2) inside the sum() function. Note that R typically evaluates commands from the inside out. So, it will first calculate the number of characters for each string in vector2 (producing an intermediate numeric vector), after which the sum() function will calculate the sum of this intermediate vector.

The next step is to modify our vectors. For example, we can extract specific parts of a string and create substrings using the substr(text, start_position, end_position) function. To do so, we specify the starting and ending character positions. Note that the positions are counted from the beginning of the string, starting at 1 (and thus not 0):

substr(vector1, 1, 5)  # Extracts characters from position 1 to 5
substr(vector1, 7, 11)  # Extracts characters from position 7 to 11

Another thing we can do is combine multiple strings into a single one. The first way to do this is to use the paste() function and concatenate multiple strings into a single one. By default, paste() concatenates strings with a space in between, but we can change this using the sep argument. Another approach is to use a vector of multiple strings. In this case, we still use paste(), but now with the collapse argument:

fruits <- paste("oranges", "lemons", "pears", sep = "-")
fruits

paste(vector2, collapse = "")

We can also change the text itself, for example, changing its case (lowercase or uppercase) using tolower() (which converts all characters to lowercase) and toupper() (which converts them to uppercase):

sentences2 <- c("This is a piece of example text", "This is another piece of example text")
tolower(sentences2)
toupper(sentences2)

Another (and powerful) feature of R is its ability to find specific patterns within strings. These functions are especially powerful when combined with regular expressions (see for more on that here):

  • grep(pattern, x): This function searches for the specified pattern within each string element of the character vector x. It returns a vector of the indices (positions) of the elements in x that contain a match for the pattern.
  • grepl(pattern, x): Similar to grep, but instead of returning indices, it returns a logical vector of the same length as x. Each element in the resulting vector is TRUE if the corresponding string in x matches the pattern and FALSE otherwise. This is useful for filtering or sub setting.
  • sub(pattern, replacement, x): This function finds the specified pattern in each string element of x and replaces it with the replacement string. Importantly, sub() only replaces the first occurrence of the pattern found within each string.
  • gsub(pattern, replacement, x): This function is identical to sub(), but with one key difference: it replaces all occurrences of the pattern found within each string element, not just the first one.

Let’s see these pattern-matching and replacement functions in action:

text_vector <- c("This is a test sentence.", "Another test here.", "No match in this one.")

grep("test", text_vector)  # Find elements containing the exact word 'test'
grepl("test", text_vector)  # Check which elements contain the word 'test'
sub("test", "sample", text_vector)  # Replace the first instance of 'test' with 'sample' in each string
gsub(" ", "_", text_vector)  # Replace all spaces ' ' with underscores '_'

The opposite of pasting strings together is splitting them apart. The strsplit(x, split) function is designed to break down the elements of a character vector x into smaller pieces based on a specified split pattern (often a single character like a space, comma, or dash). Because each string in the input vector x might be split into a different number of resulting pieces, strsplit() returns a list. Each element of this list corresponds to an original string from x, and within each list element is a character vector containing the substrings that resulted from the split.

sentence <- "This sentence will be split into words"
strsplit(sentence, " ")  # Splits the single string by spaces, returns a list with one element (a vector of words)

dates <- c("2023-01-15", "2024-11-01")
strsplit(dates, "-")  # Splits each date string by the dash, returns a list with two elements (vectors of year, month, day)

While basic functions like print() and cat() are sufficient for simple output, sprintf() provides much finer control over how numbers, strings, and other data types are formatted within a string, similar to the printf function found in C. You construct a format string containing placeholders (like % d' for integers,%sfor strings,%.2ffor floating-point numbers with two decimal places), andsprintf()` replaces these placeholders with the values of subsequent arguments, respecting the specified formatting rules. This is particularly useful for creating consistent output or messages.

my_var <- "Hello"
print(my_var)  # Prints the variable's value, often with quotes for strings
cat(my_var, "world!\n")  # Concatenates and prints, useful for console output

value <- 42.567
sprintf("The value is %.2f", value)  # Formats' value' to 2 decimal places within the string
sprintf("Integer: %d, String: %s", 100, "example")  # Inserts an integer and a string into the format string

While base R provides these essential tools for working with text, specialised packages such as quanteda, tm, stringr, or tidytext offer more comprehensive, efficient, and often more user-friendly functions for complex text processing and analysis tasks. These packages typically build upon the fundamental vector concepts and functions available in base R, providing extended capabilities that include more powerful regular expression handling, tokenisation, stemming, stop-word removal, and advanced text manipulation tools.