4.6 Import using Web Scraping
If there is no specific API, we can also choose to scrape the website. The logic of web scraping is that we use the structure of the underlying HTML document to find and download the text we want. Note though that not all websites encourage (or even allow) scraping, which means that we need to have a look at their disclaimer beforehand. You can do this by either checking the website’s Terms and Conditions page, or the robots.txt file that you can usually find appended at the home page (e.g. https://www.facebook.com/robots.txt ).
One easy way to scrape a website is to search and see whether someone else has already built a tool that automates the webscraping process for the particular website that we are interested in. For instance, since Twitter has ended the free access to its API, we can look at places like Apify to find a suitable scraper for popular websites such as Twitter/X, Wikipedia, Instagram (to scrape data on public profiles), Google search, Google maps, Tiktok, Amazon (to scrape data on its products), and so on. Apify is not free, but registering an account for a free trial and using a month’s subscription may be enough for a small project.
If you cannot find a ready-made scraper for your project, or if you cannot pay for such services, you can use rvest
package to configure a webscraper of your own. One of the most popular sites to scrape is Wikipedia as it is very welcoming to web scraping and has pages with a clear structure (indeed, underlying sites such as WikiData are built with web scraping in mind). Here, let us take the following page on the Cold War as an example: https://en.wikipedia.org/wiki/Cold_War. If you have a quick look at the website you see that there is a lot of information on there, including figures, tables and the actual body of text. Here, we are only interested in the latter.
To begin with, we store the address of the webpage in an object and ask R to read it for us:
We now have the HTML page (though R will not yet show it), but we before we can do anything further, we have to figure out how to get the content we want. To do so, we have to inspect the HTML document to find the right element. The easiest way to do this is in the browser. To do so, open the page and use Cmd + Shift + C to open up Developer Tools. In the tools that then open, we can hover over the main body of text on the page and see which element belongs to it. Here, we find that the main body of text is stored in an element called mw-content-text and that within that the individual paragraphs are stored in elements labelled **
**. This latter designation is a standard reference in HTML to individual paragraphs and are what we are after here.
We now ask R to take the HTML extracted from the URL, look only for the nodes labelled **
**, extract the text from it, place the results into a data frame, and then into a character vector:
##
## Attaching package: 'rvest'
## The following object is masked from 'package:readr':
##
## guess_encoding
data_coldwar <- url %>%
html_nodes("p") %>%
html_text2() %>%
as.data.frame()
data_coldwar <- data_coldwar$.
data_coldwar <- data_coldwar[-c(1, 2)]
Note that for the text we can choose to use either html_text
or html_text2
. The difference between the two is that while the former gives us the raw underlying text, the latter gives us the text as it looks like in the browser, which is what we opt for here. Now, looking at the resulting data-frame, we have 207 observations representing the 207 paragraphs of the text (with the first two empty, so we removed them here). We can then use this as an input for any further analysis (and we will get back to this later). If you would like to learn more about web scraping in the context of quantitative text analysis have a look at the textbook by Munzert et al. (2014).