Justin Law and Jordan Rosenblum

April 2, 2015

What can you do using rvest?

The list below is partially borrowed from Hadley Wickham (the creator of rvest) and we will go through some of them throughout this presentation.

Also have a look at the three links below for some more information:

Starting off simple: Scraping The Lego Movie on imdb

#install.packages("rvest")

library(rvest)

# Store web url
lego_movie <- html("http://www.imdb.com/title/tt1490017/")

#Scrape the website for the movie rating
rating <- lego_movie %>% 
  html_nodes("strong span") %>%
  html_text() %>%
  as.numeric()
rating
## [1] 7.8
# Scrape the website for the cast
cast <- lego_movie %>%
  html_nodes("#titleCast .itemprop span") %>%
  html_text()
cast
##  [1] "Will Arnett"     "Elizabeth Banks" "Craig Berry"    
##  [4] "Alison Brie"     "David Burrows"   "Anthony Daniels"
##  [7] "Charlie Day"     "Amanda Farinos"  "Keith Ferguson" 
## [10] "Will Ferrell"    "Will Forte"      "Dave Franco"    
## [13] "Morgan Freeman"  "Todd Hansen"     "Jonah Hill"
#Scrape the website for the url of the movie poster
poster <- lego_movie %>%
  html_nodes("#img_primary img") %>%
  html_attr("src")
poster
## [1] "http://ia.media-imdb.com/images/M/MV5BMTg4MDk1ODExN15BMl5BanBnXkFtZTgwNzIyNjg3MDE@._V1_SX214_AL_.jpg"

# Extract the first review
review <- lego_movie %>%
  html_nodes("#titleUserReviewsTeaser p") %>%
  html_text()
review
## [1] "The stand out feature of the Lego Movie for me would be the way the Lego Universe was created. The movie paid great attention to detail making everything appear as it would made from Lego, including the water and clouds, and the surfaces people walked on all had the circles sticking upwards a Lego piece would have. Combined with all the yellow faces, and Lego part during building, I was convinced action took place in the Lego Universe.A combination of adult and child friendly humour should entertain all, the movie has done well to ensure audiences of all ages are catered to. The voice cast were excellent, especially Liam Neeson's split personality police officer, making the 2 personalities sound distinctive, and giving his Bad Cop the usual Liam Neeson tough guy. The plot is about resisting an over-controlling ruler, highlighted by the name of the hero's \"resistance piece\". It is well thought through, well written, and revealing at the right times. Full of surprises, The Lego Movie won't let You see what's coming. Best animated film since Wreck it Ralph! Please let there be sequels."

Scraping indeed.com for jobs

# Submit the form on indeed.com for a job description and location using html_form() and set_values()
query = "data science"
loc = "New York"
session <- html_session("http://www.indeed.com")
form <- html_form(session)[[1]]
form <- set_values(form, q = query, l = loc)

# The rvest submit_form function is still under construction and does not work for web sites which build URLs (i.e. GET requests. It does seem to work for POST requests). 
#url <- submit_form(session, indeed)

# Version 1 of our submit_form function
submit_form2 <- function(session, form){
  library(XML)
  url <- XML::getRelativeURL(form$url, session$url)
  url <- paste(url,'?',sep='')
  values <- as.vector(rvest:::submit_request(form)$values)
  att <- names(values)
  if (tail(att, n=1) == "NULL"){
    values <- values[1:length(values)-1]
    att <- att[1:length(att)-1]
  }
  q <- paste(att,values,sep='=')
  q <- paste(q, collapse = '&')
  q <- gsub(" ", "+", q)
  url <- paste(url, q, sep = '')
  html_session(url)
}


# Version 2 of our submit_form function
library(httr)
# Appends element of a list to another without changing variable type of x
# build_url function uses the httr package and requires a variable of the url class
appendList <- function (x, val)
{
  stopifnot(is.list(x), is.list(val))
  xnames <- names(x)
  for (v in names(val)) {
    x[[v]] <- if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
      appendList(x[[v]], val[[v]])
    else c(x[[v]], val[[v]])
  }
  x
}
 
# Simulating submit_form for GET requests
submit_geturl <- function (session, form)
{
  query <- rvest:::submit_request(form)
  query$method <- NULL
  query$encode <- NULL
  query$url <- NULL
  names(query) <- "query"
 
  relativeurl <- XML::getRelativeURL(form$url, session$url)
  basepath <- parse_url(relativeurl)
 
  fullpath <- appendList(basepath,query)
  fullpath <- build_url(fullpath)
  fullpath
}


# Submit form and get new url
session1 <- submit_form2(session, form)

# Get reviews of last company using follow_link()
session2 <- follow_link(session1, css = "#more_9 li:nth-child(3) a")
reviews <- session2 %>% html_nodes(".description") %>% html_text()
reviews
## [1] "Custody Client Services"                                       
## [2] "An exciting position on a trading floor"                       
## [3] "Great work environment"                                        
## [4] "A company that helps its employees to advance career."         
## [5] "Decent Company to work for while you still have the job there."
# Get average salary for each job listing based on title and location
salary_links <- html_nodes(session1, css = "#resultsCol li:nth-child(2) a") %>% html_attr("href")
salary_links <- paste(session$url, salary_links, sep='')
salaries <- lapply(salary_links, . %>% html() %>% html_nodes("#salary_display_table .salary") %>% html_text())
salary <- unlist(salaries)

# Store web url
data_sci_indeed <- session1

# Get job titles
job_title <- data_sci_indeed %>% 
  html_nodes("[itemprop=title]") %>%
  html_text()

# Get companies
company <- data_sci_indeed %>%
  html_nodes("[itemprop=hiringOrganization]") %>%
  html_text()

# Get locations
location <- data_sci_indeed %>%
  html_nodes("[itemprop=addressLocality]") %>%
  html_text()

# Get descriptions
description <- data_sci_indeed %>%
  html_nodes("[itemprop=description]") %>%
  html_text()

# Get the links
link <- data_sci_indeed %>%
  html_nodes("[itemprop=title]") %>%
  html_attr("href")
link <- paste('[Link](https://www.indeed.com', link, sep='')
link <- paste(link, ')', sep='')

indeed_jobs <- data.frame(job_title,company,location,description,salary,link)

library(knitr)
kable(indeed_jobs, format = "html")
job_title company location description salary link
Data Scientist Career Path Group New York, NY 10018 (Clinton area) Or higher in Computer Science or related field. Design, develop, and optimize our data and analytics system…. $109,000 Link
Data Scientist or Statistician Humana New York, NY Experience with unstructured data analysis. Humana is seeking an experienced statistician with demonstrated health and wellness data analysis expertise to join… $60,000 Link
Analyst 1010data New York, NY Data providers can also use 1010data to share and monetize their data. 1010data is the leading provider of Big Data Discovery and data sharing solutions…. $81,000 Link
Data Scientist & Visualization Engineer Enstoa New York, NY 2+ years professional experience analyzing complex data sets, modeling, machine learning, and/or large-scale data mining…. $210,000 Link
Data Scientist - Intelligent Solutions JPMorgan Chase New York, NY Experience managing and growing a data science team. Data Scientist - Intelligent Solutions. Analyze communications data and Utilize statistical natural… $109,000 Link
Analytics Program Lead AIG New York, NY Lead the analytical team for Data Solutions. Graduate degree from a renowned institution in any advanced quantitative modeling oriented discipline including but… $126,000 Link
Data Engineer Standard Analytics New York, NY Code experience in a production environment (familiar with data structures, parallelism, and concurrency). We aim to organize the world’s scientific information… $122,000 Link
Summer Intern - Network Science and Big Data Analytics IBM Yorktown Heights, NY The Network Science and Big Data Analytics department at the IBM T. Our lab has access to large computing resources and data…. $36,000 Link
Data Scientist The Nielsen Company New York, NY As a Data Scientist in the Data Integration group, you will be involved in the process of integrating data to enable analyses of patterns and relationships… $109,000 Link
Data Analyst, IM Data Science BNY Mellon New York, NY The Data Analyst will support a wide variety of projects and initiatives of the Data Science Group, including the creation of back-end data management tools,… $84,000 Link

Some more on CSS and HTML:

More examples with LinkedIn

# Attempt to crawl LinkedIn, requires useragent to access Linkedin Sites
uastring <- "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36"
session <- html_session("https://www.linkedin.com/job/", user_agent(uastring))
form <- html_form(session)[[1]]
form <- set_values(form, keywords = "Data Science", location="New York")
 
new_url <- submit_geturl(session,form)
new_session <- html_session(new_url, user_agent(uastring))
jobtitle <- new_session %>% html_nodes(".job [itemprop=title]") %>% html_text
company <- new_session %>% html_nodes(".job [itemprop=name]") %>% html_text
location <- new_session %>% html_nodes(".job [itemprop=addressLocality]") %>% html_text
description <- new_session %>% html_nodes(".job [itemprop=description]") %>% html_text
url <- new_session %>% html_nodes(".job [itemprop=title]") %>% html_attr("href")
url <- paste(url, ')', sep='')
url <- paste('[Link](', url, sep='')
df <- data.frame(jobtitle, company, location, url)

df %>% kable
jobtitle company location url
Data Science Lead: Metis Kaplan New York City, NY, US Link
Data Science Lead: Metis Kaplan Test Prep New York, NY Link
Think Big Senior Data Scientist Think Big, A Teradata Company US-NY-New York Link
Think Big Principal Data Scientist Think Big, A Teradata Company US-NY-New York Link
Data Scientist - Professional Services Consultant (East … MapR Technologies Greater New York City Area Link
Think Big Senior Data Scientist Teradata New York City, NY, US Link
Think Big Principal Data Scientist Teradata New York City, NY, US Link
Sr. Software Engineer - Data Science - HookLogic HookLogic, Inc. New York City, NY, US Link
Think Big Data Scientist Think Big, A Teradata Company US-NY-New York Link
Director of Data Science Programs DataKind New York City, NY, US Link
Lead Data Scientist - VP - Intelligent Solutions JPMorgan Chase & Co. US-NY-New York Link
Senior Data Scientist for US Quantitative Fund, NYC GQR Global Markets Greater New York City Area Link
Google Cloud Solutions Practice, Google Data Solution … PricewaterhouseCoopers New York City, NY, US Link
Senior Data Scientist Dun and Bradstreet Short Hills, NJ, US Link
Senior data scientist Mezzobit New York City, NY, US Link
Think Big Data Scientist Teradata New York City, NY, US Link
Data Scientist - Intelligent Solutions JPMorgan Chase & Co. US-NY-New York Link
Technical Trainer EMEA Datameer New York Link
Elementary School Science Teacher Success Academy Charter Schools Greater New York City Area Link
Middle School Science Teacher Success Academy Charter Schools Greater New York City Area Link
Data Scientist (various levels) Burtch Works Greater New York City Area Link
Sr. Data Scientist – Big Data, Online Advertising, Search Magnetic New York, NY Link
Sr. Big Data Engineer FlexGraph ADP New York, NY Link
Data Science Lead Instructor - Data Science, Teaching CyberCoders New York City, NY Link
Director, Data Consulting Havas Media Greater New York City Area Link

Attemping to scrape Columbia LionShare

# Attempt to crawl Columbia Lionshare for jobs
session <- html_session("http://www.careereducation.columbia.edu/lionshare")
form <- html_form(session)[[1]]
form <- set_values(form, username = "uni")
#Below code commented out in Markdown

#pw <- .rs.askForPassword("Password?")
#form <- set_values(form, password = pw)
#rm(pw)
#session2 <- submit_form(session, form)
#session2 <- follow_link(session2, "Job")
#form2 <- html_form(session2)[[1]]
#form2 <- set_values(form2, PositionTypes = 7, Keyword = "Data")
#session3 <- submit_form(session2, form2)

# Unable to scrape because the table containing the job data uses javascript and doesn't load soon enough for rvest to collect information

There isn't any equivalent to checking if the document finishes loading before scraping the data. The general recommendation appears to be using something entirely different such as Selenium to scrape web data.

If you are webscraping with Python chances are that you have already tried urllib, httplib, requests, etc. These are excellent libraries, but some websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work. Hence what the website “sees” is Chrome or Firefox or IE; it does not see Python or Selenium. That makes it a lot harder for the website to tell your bot from a human being.