This tutorial covers how to extract and process text data from web pages or other documents for later analysis. The automated download of HTML pages is called Crawling. The extraction of the textual data and/or metadata (for example, article date, headlines, author names, article text) from the HTML source code (or the DOM document object model of the website) is called Scraping. For these tasks, we use the package “rvest”. In a third exercise, we will extract text data from various formats such as PDF, DOC, DOCX and TXT files with the “readtext” package.

Download a single web page and extract its content
Extract links from a overview page and extract articles
Extract text data from PDF and other formats on disk

1 Preparation

Create a new R Project (File -> New Project -> Existing directory) in the provided tutorial folder. Then create a new R script (File -> New File -> R Script) named “Tutorial_1.R”. In this script you will enter and execute all commands. If you want to run the complete script in RStudio, you can use Ctrl-A to select the complete source code and execute with Ctrl-Return. If you want to execute only one line, you can simply press Ctrl-Return on the respective line. If you want to execute a block of several lines, select the block and press Ctrl-Return.

Tip: Copy individual sections of the source code directly into the console (2) and run it step by step. Get familiar with the function calls included in the Help function.

First, make sure your working directory is the data directory we provided for the exercises.

# important option for text analysis
options(stringsAsFactors = F)
# check working directory. It should be the destination folder of the extracted 
# zip file. If necessary, use `setwd("your-tutorial-folder-path")` to change it.
getwd()

2 Crawl single webpage

In a first exercise, we will download a single web page from “The Guardian” and extract text together with relevant metadata such as the article date. Let’s define the URL of the article of interest and load the rvest package, which provides very useful functions for web crawling and scraping.

url <- "https://www.theguardian.com/world/2017/jun/26/angela-merkel-and-donald-trump-head-for-clash-at-g20-summit"
require("rvest")

A convenient method to download and parse a webpage provides the function read_html which accepts a URL as a parameter. The function downloads the page and interprets the html source code as an HTML / XML object.

html_document <- read_html(url)

HTML / XML objects are a structured representation of HTML / XML source code, which allows to extract single elements (headlines e.g. <h1>, paragraphs <p>, links <a>, …), their attributes (e.g. <a href="http://...">) or text wrapped in between elements (e.g. <p>my text...</p>). Elements can be extracted in XML objects with XPATH-expressions.

XPATH (see https://en.wikipedia.org/wiki/XPath) is a query language to select elements in XML-tree structures. We use it to select the headline element from the HTML page. The following xpath expression queries for first-order-headline elements h1, anywhere in the tree // which fulfill a certain condition [...], namely that the class attribute of the h1 element must contain the value content__headline.

The next expression uses R pipe operator %>%, which takes the input from the left side of the expression and passes it on to the function ion the right side as its first argument. The result of this function is either passed onto the next function, again via %>% or it is assigned to the variable, if it is the last operation in the pipe chain. Our pipe takes the html_document object, passes it to the html_node function, which extracts the first node fitting the given xpath expression. The resulting node object is passed to the html_text function which extracts the text wrapped in the h1-element.

title_xpath <- "//h1[contains(@class, 'content__headline')]"
title_text <- html_document %>%
  html_node(xpath = title_xpath) %>%
  html_text(trim = T)

Let’s see, what the title_text contains:

cat(title_text)

## Angela Merkel and Donald Trump head for clash at G20 summit

Now we modify the xpath expressions, to extract the article info, the paragraphs of the body text and the article date. Note that there are multiple paragraphs in the article. To extract not only the first, but all paragraphs we utilize the html_nodes function and glue the resulting single text vectors of each paragraph together with the paste0 function.

intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p"
intro_text <- html_document %>%
  html_node(xpath = intro_xpath) %>%
  html_text(trim = T)

cat(intro_text)

## German chancellor plans to make climate change, free trade and mass migration key themes in Hamburg, putting her on collision course with US

body_xpath <- "//div[contains(@class, 'content__article-body')]//p"
body_text <- html_document %>%
  html_nodes(xpath = body_xpath) %>%
  html_text(trim = T) %>%
  paste0(collapse = "\n")

cat(body_text)

## A clash between Angela Merkel and Donald Trump appears unavoidable after Germany signalled that it will make climate change, free trade and the manage

date_xpath <- "//time"
date_object <- html_document %>%
  html_node(xpath = date_xpath) %>%
  html_attr(name = "datetime") %>%
  as.Date()

cat(format(date_object, "%Y-%m-%d"))

## 2017-06-26

The variables title_text, intro_text, body_text and date_object now contain the raw data for any subsequent text processing.

3 Follow links

Usually, we do not want download a single document, but a series of documents. In our second exercise, we want to download all Guardian articles tagged with “Angela Merkel”. Instead of a tag page, we could also be interested in downloading results of a site-search engine or any other link collection. The task is always two-fold: First, we download and parse the tag overview page to extract all links to articles of interest:

url <- "https://www.theguardian.com/world/angela-merkel"
html_document <- read_html(url)

Second, we download and scrape each individual article page. For this, we extract all href-attributes from a-elements fitting a certain CSS-class. To select the right contents via XPATH-selectors, you need to investigate the HTML-structure of your specific page. Modern browsers such as Firefox and Chrome support you in that task by a function called “Inspect Element” (or similar), available through a right-click on the page element.

links <- html_document %>%
  html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
  html_attr(name = "href")

Now, links contains a list of 20 hyperlinks to single articles tagged with Angela Merkel.

head(links, 3)

## [1] "https://www.theguardian.com/business/2017/sep/11/macron-merkel-eurozone-france-germany"                               
## [2] "https://www.theguardian.com/world/2017/sep/11/could-germany-make-a-new-pizza-connection-if-merkel-signs-up-the-greens"
## [3] "https://www.theguardian.com/world/2017/sep/10/merkel-backs-iran-style-diplomatic-solution-for-north-korea"

But stop! There is not only one page of links to tagged articles. If you have a look on the page in your browser, the tag overview page has several more than 60 sub pages, accessible via a paging navigator at the bottom. By clicking on the second page, we see a different URL-structure, which now contains a link to a specific paging number. We can use that format to create links to all sub pages by combining the base URL with the page numbers.

page_numbers <- 1:3
base_url <- "https://www.theguardian.com/world/angela-merkel?page="
paging_urls <- paste0(base_url, page_numbers)

# View first 3 urls
head(paging_urls, 3)

## [1] "https://www.theguardian.com/world/angela-merkel?page=1"
## [2] "https://www.theguardian.com/world/angela-merkel?page=2"
## [3] "https://www.theguardian.com/world/angela-merkel?page=3"

Now we can iterate over all URLs of tag overview pages, to collect more/all links to articles tagged with Angela Merkel. We iterate with a for-loop over all URLs and append results from each single URL to a vector of all links.

all_links <- NULL
for (url in paging_urls) {
  # download and parse single ta overview page
  html_document <- read_html(url)

  # extract links to articles
  links <- html_document %>%
    html_nodes(xpath = "//div[contains(@class, 'fc-item__container')]/a") %>%
    html_attr(name = "href")
  
  # append links to vector of all links
  all_links <- c(all_links, links)
}

An effective way of programming is to encapsulate repeatedly used code in a specific function. This function then can be called with specific parameters, process something and return a result. We use this here, to encapsulate the downloading and parsing of a Guardian article given a specific URL. The code is the same as in our exercise 1 above, only that we combine the extracted texts and metadata in a data.frame and wrap the entire process in a function-Block.

scrape_guardian_article <- function(url) {
  
  html_document <- read_html(url)
  
  title_xpath <- "//h1[contains(@class, 'content__headline')]"
  title_text <- html_document %>%
    html_node(xpath = title_xpath) %>%
    html_text(trim = T)
  
  intro_xpath <- "//div[contains(@class, 'content__standfirst')]//p"
  intro_text <- html_document %>%
    html_node(xpath = intro_xpath) %>%
    html_text(trim = T)
  
  body_xpath <- "//div[contains(@class, 'content__article-body')]//p"
  body_text <- html_document %>%
    html_nodes(xpath = body_xpath) %>%
    html_text(trim = T) %>%
    paste0(collapse = "\n")
  
  date_xpath <- "//time"
  date_text <- html_document %>%
    html_node(xpath = date_xpath) %>%
    html_attr(name = "datetime") %>%
    as.Date()
  
  article <- data.frame(
    url = url,
    date = date_text,
    title = title_text,
    body = paste0(intro_text, "\n", body_text)
  )
  
  return(article)
  
}

Now we can use that function scrape_guardian_article in any other part of our script. For instance, we can loop over each of our collected links. We use a running variable i, taking values from 1 to length(all_links) to access the single links in all_links and write some progress output.

all_articles <- data.frame()
for (i in 1:length(all_links)) {
  cat("Downloading", i, "of", length(all_links), "URL:", all_links[i], "\n")
  article <- scrape_guardian_article(all_links[i])
  # Append current article data.frame to the data.frame of all articles
  all_articles <- rbind(all_articles, article)
}

## Downloading 1 of 60 URL: https://www.theguardian.com/business/2017/sep/11/macron-merkel-eurozone-france-germany 
## Downloading 2 of 60 URL: https://www.theguardian.com/world/2017/sep/11/could-germany-make-a-new-pizza-connection-if-merkel-signs-up-the-greens 
## Downloading 3 of 60 URL: https://www.theguardian.com/world/2017/sep/10/merkel-backs-iran-style-diplomatic-solution-for-north-korea

# View first articles
head(all_articles, 3)

# Write articles to disk
write.csv2(all_articles, file = "data/guardian_merkel.csv")

The last command write the extracted articles to a CSV-file in the data directory for any later use.

4 Read various file formats

In case you have already a collection of text data files on your disk, you can import them into R in a very convenient provided by the readtext package. The package depends on some other programs or libraries in your system, to provide extraction for Word- and PDF-documents. So there may be some hurdles to install the package.

But once it is successfully installed, it allows for very easy extraction of text data from various file formats. Fist, we request a list of files in the directory to extract text from. For demonstration purpose, we provide in data/documents a random selection of various text formats.

data_files <- list.files(path = "data/documents", full.names = T, recursive = T)

# View first file paths
head(data_files, 3)

## [1] "data/documents/bundestag/17_16_580-F_neu.pdf"                                   
## [2] "data/documents/bundestag/prot_17_95.pdf"                                        
## [3] "data/documents/bundestag/stellungnahme---buendnis-buergerenergie-e--v--data.pdf"

The readtext function from the package with the same name, detects automatically file formats of the given files list and extracts the content into a data.frame. The parameter docvarsfrom allows you to set metadata variables by splitting path names. In our example, docvar3 contains a source type variable derived from the sub folder name.

require(readtext)

extracted_texts <- readtext(data_files, docvarsfrom = "filepaths", dvsep = "/")

# View first rows of the extracted texts
head(extracted_texts)

# View beginning of tenth extracted text
substr(trimws(extracted_texts$text[10]) , 0, 100)

Again, the extracted_texts can be written by write.csv2 to disk for later use.

write.csv2(extracted_texts, file = "data/text_extracts.csv")

2017, Andreas Niekler and Gregor Wiedemann. GPLv3. tm4ss.github.io

Tutorial 1: Data import and web scraping

Andreas Niekler, Gregor Wiedemann

2017-09-11

1 Preparation

2 Crawl single webpage

3 Follow links

4 Read various file formats