3.6 Import using Web Scraping

When an API is unavailable, web scraping—extracting data directly from the HTML structure of web pages—can be an alternative. Again, there are a few things to keep in mind:

Legality/Ethics: Always check the website’s robots.txt file (e.g., www.example.com/robots.txt) and Terms of Service before scraping. Many sites prohibit or restrict scraping. Respect website resources; avoid overly aggressive scraping that could overload servers.
Website Structure: Scraping relies on the stability of a website’s HTML structure. If the site changes, your scraper might break.
Static or Dynamic Content: Simple websites with content loaded directly in the initial HTML are easier to scrape (using packages like rvest). Websites that load content dynamically using JavaScript after the initial page load often require browser automation tools, such as RSelenium.
Complexity: Scraping can be complex and requires knowledge of HTML and CSS selectors (or XPath).

To see how scraping works, let us scrape the page on the Cold War from Wikipedia using rvest:

# Load necessary libraries
library(rvest)
library(dplyr)
library(stringr)
library(tibble)

# Define the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/Cold_War"

# Read the HTML content from the page
html_content <- read_html(url)

Now, we need to identify the HTML elements containing the desired text. Using browser developer tools (often opened with F12 or Cmd+Shift+C) helps inspect the page structure. The main content paragraphs for Wikipedia articles are typically within <p>tags inside a main content div:

# Extract paragraph text from the content section
paragraphs <- html_content %>%
    html_nodes("#mw-content-text .mw-parser-output p") %>%
    html_text2()  # Extract text, attempting to preserve formatting like line breaks

coldwar_text_df <- tibble(paragraph = paragraphs)  # Convert to a tibble/data frame
coldwar_text_df <- coldwar_text_df %>%
    filter(nchar(trimws(paragraph)) > 0)  # Remove empty or whitespace-only paragraphs

# Extract paragraph text from the content section
paragraphs <- html_content %>%
    html_nodes("#mw-content-text .mw-parser-output p") %>%
    html_text2()  # Extracts text while preserving formatting

# Convert to a tibble and remove empty/whitespace-only paragraphs
coldwar_text_df <- tibble(paragraph = paragraphs) %>%
    filter(nchar(trimws(paragraph)) > 0)

# Display the first few rows
print(head(coldwar_text_df))

For more complex scraping involving logins, button clicks, or dynamically loaded content, explore the RSelenium package, which programmatically controls a web browser. For more on web scraping, see Wickham et al. (2023), with the book being available here.

References

Wickham, H., Çetinkaya-Rundel, M., & Grolemund, G. (2023). R for Data Science (2nd ed.). https://r4ds.hadley.nz/