Skip to content

tomcardoso/intro-to-scraping

Repository files navigation

Introduction to scraping

A version of Gustave Caillebotte's Floor-scapers (Les raboteurs de parquet), 1876 A version of Gustave Caillebotte's Floor-scapers (Les raboteurs de parquet), 1876

These are resources for a one-day class on the basics of web scraping taught at Wilfrid Laurier University on Friday, July 20th, 2018, as part of the Laurier Summer Institute of Research Methods. Here's a brief description of the course:

One of the most time-consuming aspects of performing any sort of data analysis is getting that data in the first place. Often, a straightforward, well-structured database doesn't exist, which means you need to build one yourself, from scratch. That's where scraping comes in: you can build a program to automate this collection for you, saving countless hours of boring and imprecise data entry. In this one-day class, you'll learn how to decide on the structure for your data, pick the right scraping approach, create a scraper and systematize your data collection. The class will introduce the basic concepts and strategies behind scraping, and focus on getting data off both websites and offline documents (such as PDFs).

Though this course assumes a basic working knowledge of R, the resources should be straightforward enough that they can be followed by someone with a background in a different programming language, such as JavaScript, Ruby or Python.

  • About me
  • Today's schedule
  • What is scraping?
  • When is it useful? (extracting text, tables, images, bulk downloading files, automating form entry)
  • Types of scraping (manual entry, text pattern matching, using APIs, parsing the DOM, headless browsers)
  • What you will learn

15min break, 10:30am to 10:45am

  • Basic regular expressions
  • Exercise: Let's write some regular expressions
  • Selectors and XPath
  • Identifying patterns in markup
  • Writing a basic JavaScript selector query
  • Exercise: Let's write some queries
  • Additional resources: https://regexone.com/

Lunch break, 12pm to 1:15pm

  • Quick tidyverse crash course
  • Exercise: Using tidyverse packages to read, manipulate, pipe and save data
  • Connecting to a webpage and extracting information
  • Exercise: Getting familiar with rvest
  • Building a scraper to build a scraper
  • Caveats with high-traffic sites, incl. Facebook, Google, Amazon, etc.
  • Throttling your scrape
  • Exercise: Let's build a throttler
  • Scrape first, clean later
  • Always err on the side of collecting more data than less
  • Make it reproducible
  • Exercise: Adapt our scraper to a new website
  • Advanced scraping with RSelenium
  • Additional resources: http://uc-r.github.io/scraping
  • RSelenium: https://www.r-bloggers.com/scraping-with-selenium/

15min break, 3pm to 3:15pm

  • Using scrapers to download PDFs and other files
  • Challenges and picking the right tool for the job
  • Tabula and Adobe Acrobat
  • Tesseract, pdfplumber, docs2csv
  • Exercise: Let's use Tabula to extract tables from a PDF
  • Additional resources: Parsing prickly PDFs

Part 6 (time allowing): Let's build a scraper from scratch! (3:45pm to end)

  • Pick a target website and let's write a scraper together
  • Exercise: Individual scraping
  • That's it!