4.3 Import .pdf Files

Apart from .txt, .pdf files are another common format for texts. Yet, as .pdf files contain a lot of information (tables, figures and graphs), besides the texts, we will have to “get out” the texts first. To do so, we first use the pdftools package to convert the .pdf files into .txt files, which we then read (as above) with readtext. Note that this only works if the .pdf files are readable. This means that we can select (and copy-paste) the text in them. Thus, readtext does not work with .pdf files where we cannot select the text. Often this happens when a .pdf is a scan or contains many figures. In such cases, you might have to use optical character recognition (OCR) (such as offered by the tesseract package) to generate the .txt files.

As with the .txt files above, here we will place our .pdf files in a folder called PDF in our Working Directory. Also, we have an (empty) folder called Texts where R will write the new .txt files to. We then tell R where these folder are:

library(pdftools)
library(readtext)

pdf_directory <- paste0(getwd(), "/PDF")
txt_directory <- paste0(getwd(), "/Texts")

Then, we ask R for a list of all the files in the .pdf directory. This is both to ensure that we are not overlooking anything and to tell R which files are in the folder. Here, setting recurse=FALSE means that we only list the files in the main folder and not any files that are in other folders in this main folder.

files <- list.files(pdf_directory, pattern = ".pdf", recursive = FALSE, full.names = TRUE)

files

While we could convert a single document at a time, more often we have to deal with more than one document. To read all documents at once, we have to write a little function. This function does the following. Firstly, we tell R to make a new function that we label extract, and as input give it an element we call filename. This file name is at this point an empty element, but to which we will later refer to the files we want to extract. Then, we ask R to print the file name to ensure that we are working with the right files while the function is running. In the next step, we ask R to try to read this file name using the pdf_text function and save the result as a file called text. Afterwards, we ask R to do so for each of the files that end on .pdf that are in the element files. Then, we have R write this text file to a new file. This file is the extracted .pdf in .txt form:

extract <- function(filename) {
  print(filename)
  try({
    text <- pdf_text(filename)
  })
  title <- gsub("(.*)/([^/]*).pdf", "\\2", filename)
  write(text, file.path(txt_directory, paste0(title, ".txt")))
}

We then use this function to extract all the .pdf files in the pdf_directory folder. To do so, we use a for loop. The logic of this loop is that for each individual file in the element files, we run the extract function we created. This will create an element called file for the file R is currently working on, and will create the .txt files in the txt_directory:

for(file in files) {
 extract(file)
}

We can now read the .txt files into R. To do so, we use paste0(txt_directory, "*") to tell readtext to look into our txt_directory, and read any file in there. Besides this, we need to specify the encoding. Most often, this is UTF-8, though sometimes you might find latin1 or Windows-1252 encodings. While readtext will convert all these to UTF-8, you have to specify the original encoding. To find out which one you need, you have to look into the properties of the .txt file.

Assuming our texts are in UTF-8 encoding, we run:

data_texts <- readtext(paste0(txt_directory, "*"), encoding = "UTF-8")

The result of this is a data frame of texts, which we can transform into a corpus for use in quanteda or keep as it is for other types of analyses.