rostrum.blog - Web scraping the {polite} way

Martin Freeman as Douglas Adams's Arthur Dent, taking a sip of tea and saying 'oh, come on, that's lovely'

tl;dr

If you webscrape with R, you should use the the {polite} package. It helps you respect website terms by seeking permission before you scrape.

Ahoy-hoy

Ah, salutations, and welcome to this blog post about polite web scraping. Please do come in. I’ll take your coat. How are you? Would you like a cup of tea? Oh, I insist!

Speaking of tea, perhaps you’d care to join me in genial conversation about it. Where to begin? Let’s draw inspiration from popular posts on the Tea subreddit of Reddit. I’ll fetch the post titles using the {rvest} package from Hadley Wickham and get the correct CSS selector using SelectorGadget by Andrew Cantino and Kyle Maxwell.

# Load some packages we need
library(rvest)  # scrape a site
library(dplyr)  # data manipulation

# CSS for post titles found using SelectorGadget
# (This is a bit of an odd one)
css_select <- "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m"

# Scrape a specific named page
tea_scrape <- read_html("https://www.reddit.com/r/tea") %>%  # read the page
  html_nodes(css = css_select) %>%  # read post titles
  html_text()  # convert to text

print(tea_scrape)

[1] "What's in your cup? Daily discussion, questions and stories - September 08, 2019"                                                                 
[2] "Marketing Monday! - September 02, 2019"                                                                                                           
[3] "Uncle Iroh asking the big questions."                                                                                                             
[4] "The officially licensed browser game of Game of Thrones has launched! Millions of fans have put themselves into the battlefield! What about you?" 
[5] "They mocked me. They said that I was a fool for drinking leaf water."                                                                             
[6] "100 years old tea bush on my estate in Uganda."                                                                                                   
[7] "Cold brew colors"                                                                                                                                 
[8] "Finally completed the interior of my tea house only needed a fire minor touches not now it’s perfect, so excited to have this as a daily tea spot"

That’ll provide us with some conversational fodder, wot wot.

It costs nothing to be polite

Mercy! I failed to doff adequately my cap before entering the website! They must take me for some sort of street urchin.

Forgive me. Perhaps you’ll allow me to show you a more respectful method via the {polite} package in development from the esteemed gentleman Dmytro Perepolkin? An excellent way ‘to promote responsible web etiquette’.

A reverential bow()

Perhaps the website owners don’t want people to keep barging in willy-nilly without so much as a ‘ahoy-hoy’.

We should identify ourselves and our intent with a humble bow(). We can expect a curt but informative response from the site—via its robots.txt file—that tells us where we can visit and how frequently.

# remotes::install_github("dmi3kno/polite")  # to install
library(polite)  # respectful webscraping

# Make our intentions known to the website
reddit_bow <- bow(
  url = "https://www.reddit.com/",  # base URL
  user_agent = "M Dray <https://rostrum.blog>",  # identify ourselves
  force = TRUE
)

print(reddit_bow)

## <polite session> https://www.reddit.com/
##      User-agent: M Dray <https://rostrum.blog>
##      robots.txt: 32 rules are defined for 4 bots
##     Crawl delay: 5 sec
##   The path is scrapable for this user-agent

Super-duper. The (literal) bottom line is that we’re allowed to scrape. The website does have 32 rules to stop unruly behaviour though, and even calls out four very naughty bots that are obviously not very polite. We’re invited to give a five-second delay between requests to allow for maximum respect.

Give a nod()

Ahem, conversation appears to be wearing a little thin; perhaps I can interest you by widening the remit of our chitchat? Rather than merely iterating though subpages of the same subreddit, we can visit the front pages of a few different subreddits. Let’s celebrate the small failures and triumphs of being British; a classic topic of polite conversation in Britain.

We’ve already given a bow() and made out intentions clear; a knowing nod() will be sufficient for the next steps. Here’s a little function to nod() to the site each time we iterate over a vector of subreddit names. Our gentlemanly agreement remains intact from our earlier bow().

library(purrr)  # functional programming tools
library(tidyr)  # tidy-up data structure

get_posts <- function(subreddit_name, bow = reddit_bow, css_select){
  
  # 1. Agree modification of session path with host
  session <- nod(
    bow = bow,
    path = paste0("r/", subreddit_name)
  )
  
  # 2. Scrape the page from the new path
  scraped_page <- scrape(session)
  
  # 3. Extract from xpath on the altered URL
  node_result <- html_nodes(
    scraped_page,
    css = css_select
  )
  
  # 4. Render result as text
  text_result <- html_text(node_result)
  
  # 5. Return the text value
  return(text_result)
  
}

Smashing. Care to join me in applying this function over a vector of subreddit names? Tally ho.

# A vector of subreddits to iterate over
subreddits <- set_names(c("BritishProblems", "BritishSuccess"))

# Get top posts for named subreddits
top_posts <- map_df(
  subreddits,
  ~get_posts(.x, css_select = "._3wqmjmv3tb_k-PROt7qFZe ._eYtD2XCVieq6emjKBH3m")
) %>% 
  gather(
    BritishProblems, BritishSuccess,
    key = subreddit, value = post_text
  )

knitr::kable(top_posts)

Bravo, what excellent manners we’ve demonstrated. You can also iterate over different query strings – for example if your target website displays information over several subpages – with the params argument of the scrape() function.

Oh, you have to leave? No, no, you haven’t overstayed your welcome! It was truly marvellous to see you. Don’t forget your brolly, old chap, and don’t forget to print the session info for this post. Pip pip!

Environment

Session info

Last rendered: 2023-08-02 23:36:13 BST

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/London
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.2 compiler_4.3.1    fastmap_1.1.1     cli_3.6.1        
 [5] tools_4.3.1       htmltools_0.5.5   rstudioapi_0.15.0 yaml_2.3.7       
 [9] rmarkdown_2.23    knitr_1.43.1      jsonlite_1.8.7    xfun_0.39        
[13] digest_0.6.33     rlang_1.1.1       evaluate_0.21

Reuse

CC BY-NC-SA 4.0