GutenbergScrapper

This repo contains a multi-threading scrapper for the Gutenberg's project website which contains 56,920 books free to read and download. It can scrape the whole website in just 5 hours. Also in this repo, you can find a text file containing the whole data until April 2018 containing only the Ebook-No., title, authors and language for every book because these attributes are the only ones that I cared about.

You can add these attributes as well:

Subject
LoC Class
Category
Release Date
Copyright Status
Downloads
Price

If you want to add any of/all these attributes, you can modify the script to add whatever you want by only modifying the member variable INCLUDE like so:

self.INCLUDE = set(['Title', 'Author', 'EBook-No.', 'Language'])

Then, run the script.

The collected data will be like this:

ID: 1
Author: Jefferson, Thomas, 1743-1826
Title: The Declaration of Independence of the United States of America
Language: English

ID: 2
Author: United States
Title: The United States Bill of Rights
The Ten Original Amendments to the Constitution of the United States
Language: English

ID: 3
Author: Kennedy, John F. (John Fitzgerald), 1917-1963
Title: John F. Kennedy's Inaugural Address
Language: English

...

prerequisites

You need to install:

Python3
Beautiful Soap 4.0
requests
multiprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GutenbergScrapper

prerequisites

Files

README.md

Latest commit

History

README.md

File metadata and controls

GutenbergScrapper

prerequisites