Introduction to scraping

A version of Gustave Caillebotte's Floor-scapers (Les raboteurs de parquet), 1876

These are resources for a one-day class on the basics of web scraping taught at Wilfrid Laurier University on Friday, July 20th, 2018, as part of the Laurier Summer Institute of Research Methods. Here's a brief description of the course:

One of the most time-consuming aspects of performing any sort of data analysis is getting that data in the first place. Often, a straightforward, well-structured database doesn't exist, which means you need to build one yourself, from scratch. That's where scraping comes in: you can build a program to automate this collection for you, saving countless hours of boring and imprecise data entry. In this one-day class, you'll learn how to decide on the structure for your data, pick the right scraping approach, create a scraper and systematize your data collection. The class will introduce the basic concepts and strategies behind scraping, and focus on getting data off both websites and offline documents (such as PDFs).

Though this course assumes a basic working knowledge of R, the resources should be straightforward enough that they can be followed by someone with a background in a different programming language, such as JavaScript, Ruby or Python.

Part 1: Introduction (9:15am to 9:45am)

About me
Today's schedule
What is scraping?
When is it useful? (extracting text, tables, images, bulk downloading files, automating form entry)
Types of scraping (manual entry, text pattern matching, using APIs, parsing the DOM, headless browsers)
What you will learn

Part 2: The basics of markup (9:45am to 10:30am)

How web pages work
HTML and the DOM, and "view source"
Exercise: Let's use View Source and peruse some websites
JSON, XML and APIs
Chrome Developer Tools
Exercise: Getting familiar with Chrome Developer Tools's Elements, Console and Network tabs
Additional resources: https://developer.mozilla.org/en-US/docs/Learn/Getting_started_with_the_web/How_the_Web_works

15min break, 10:30am to 10:45am

Part 3: Patterns and selections (10:45am to 12pm)

Basic regular expressions
Exercise: Let's write some regular expressions
Selectors and XPath
Identifying patterns in markup
Writing a basic JavaScript selector query
Exercise: Let's write some queries
Additional resources: https://regexone.com/

Lunch break, 12pm to 1:15pm

Part 4: Writing your first scraper with rvest (1:15pm to 3pm)

Quick tidyverse crash course
Exercise: Using tidyverse packages to read, manipulate, pipe and save data
Connecting to a webpage and extracting information
Exercise: Getting familiar with rvest
Building a scraper to build a scraper
Caveats with high-traffic sites, incl. Facebook, Google, Amazon, etc.
Throttling your scrape
Exercise: Let's build a throttler
Scrape first, clean later
Always err on the side of collecting more data than less
Make it reproducible
Exercise: Adapt our scraper to a new website
Advanced scraping with RSelenium
Additional resources: http://uc-r.github.io/scraping
RSelenium: https://www.r-bloggers.com/scraping-with-selenium/

15min break, 3pm to 3:15pm

Part 5: Offline document scraping (3:15pm to 3:45pm)

Using scrapers to download PDFs and other files
Challenges and picking the right tool for the job
Tabula and Adobe Acrobat
Tesseract, pdfplumber, docs2csv
Exercise: Let's use Tabula to extract tables from a PDF
Additional resources: Parsing prickly PDFs

Part 6 (time allowing): Let's build a scraper from scratch! (3:45pm to end)

Pick a target website and let's write a scraper together
Exercise: Individual scraping
That's it!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
assets		assets
css		css
img		img
js		js
lib		lib
plugin/highlight		plugin/highlight
.gitignore		.gitignore
.here		.here
README.md		README.md
part-1-introduction.html		part-1-introduction.html
part-2-basics-of-markup.html		part-2-basics-of-markup.html
part-3-patterns-and-selections.html		part-3-patterns-and-selections.html
part-4-writing-your-first-scraper.html		part-4-writing-your-first-scraper.html
part-5-offline-document-scraping.html		part-5-offline-document-scraping.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to scraping

Part 1: Introduction (9:15am to 9:45am)

Part 2: The basics of markup (9:45am to 10:30am)

15min break, 10:30am to 10:45am

Part 3: Patterns and selections (10:45am to 12pm)

Lunch break, 12pm to 1:15pm

Part 4: Writing your first scraper with rvest (1:15pm to 3pm)

15min break, 3pm to 3:15pm

Part 5: Offline document scraping (3:15pm to 3:45pm)

Part 6 (time allowing): Let's build a scraper from scratch! (3:45pm to end)

About

Releases

Packages

Languages

tomcardoso/intro-to-scraping

Folders and files

Latest commit

History

Repository files navigation

Introduction to scraping

Part 1: Introduction (9:15am to 9:45am)

Part 2: The basics of markup (9:45am to 10:30am)

15min break, 10:30am to 10:45am

Part 3: Patterns and selections (10:45am to 12pm)

Lunch break, 12pm to 1:15pm

Part 4: Writing your first scraper with rvest (1:15pm to 3pm)

15min break, 3pm to 3:15pm

Part 5: Offline document scraping (3:15pm to 3:45pm)

Part 6 (time allowing): Let's build a scraper from scratch! (3:45pm to end)

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages