Skip to content

Latest commit

 

History

History
51 lines (36 loc) · 1.36 KB

README.md

File metadata and controls

51 lines (36 loc) · 1.36 KB

GutenbergScrapper

This repo contains a multi-threading scrapper for the Gutenberg's project website which contains 56,920 books free to read and download. It can scrape the whole website in just 5 hours. Also in this repo, you can find a text file containing the whole data until April 2018 containing only the Ebook-No., title, authors and language for every book because these attributes are the only ones that I cared about.

You can add these attributes as well:

  • Subject
  • LoC Class
  • Category
  • Release Date
  • Copyright Status
  • Downloads
  • Price

If you want to add any of/all these attributes, you can modify the script to add whatever you want by only modifying the member variable INCLUDE like so:

self.INCLUDE = set(['Title', 'Author', 'EBook-No.', 'Language'])

Then, run the script.

The collected data will be like this:

ID: 1
Author: Jefferson, Thomas, 1743-1826
Title: The Declaration of Independence of the United States of America
Language: English

ID: 2
Author: United States
Title: The United States Bill of Rights
The Ten Original Amendments to the Constitution of the United States
Language: English

ID: 3
Author: Kennedy, John F. (John Fitzgerald), 1917-1963
Title: John F. Kennedy's Inaugural Address
Language: English

...

prerequisites

You need to install:

  • Python3
  • Beautiful Soap 4.0
  • requests
  • multiprocessing