Skip to content

Ebook Corpus - A parser and extractor for electronic books

License

Notifications You must be signed in to change notification settings

dohliam/ebook-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Ebook Corpus - A parser and extractor for electronic books

Ebook Corpus is a set of tools for parsing and extracting the text of ebooks in various formats, designed for the purpose of creating large multilingual ebook-based text corpora.

Many people have amassed enormous collections of ebooks, often containing millions of lines of text when taken as a whole, so it is always surprising to find that there aren't more tools and libraries available to work with ebooks as a corpus source. It seems that almost all the existing tools are focused on consuming (reading) ebooks, while the remaining few provide the functionality to create ebooks to be thus consumed.

As wonderful as ebooks are, they are often packaged in formats that are incredibly underspecified, or worse, that don't follow the specifications that do exist. A remarkable number of parsing libraries choke on very simple books even in presumably well-supported formats like EPUB3.

There are many ways for an ebook to defy the expectations of the parser -- perhaps it has been written in Unicode and the parser only handles US-ASCII, or the parser expects Unicode and it's written in KOI-8. Maybe the ebook contains an OPF file called content.opf in the root directory, or maybe it's in a separate CONTENT subfolder -- or called something completely different, like mytoc.opf or 目录.opf.

The Ebook Corpus tools won't solve all of these problems, but they nevertheless provide a number of options to make it easier to work with large, multilingual collections of ebooks as a raw text source.

Usage

Invoking the program on the command-line is straightforward:

./ebook.rb [options] [filename]

Where [filename] is the path to the ebook file that you want to work with. If the file has a standard extension (*.epub, *.mobi, *.fb2) it should be detected automatically.

Options

  • -a or --all: Extract all contents of epub
  • -c or --cover: Extract cover image
  • -f or --flatten-dir: Save all files to the current folder rather than an individual directory
  • -h or --html: Extract raw html
  • -i or --images: Extract images to a separate folder
  • -m or --metadata: Print metadata
    • -T or --title: Print title metadata only
    • -A or --author: Print author metadata only
    • -I or --isbn: Print ISBN metadata only
    • -L or --language: Print language metadata only
    • -P or --publisher: Print publisher metadata only
    • -D or --description: Print description metadata only
  • -o or --output-dir DIR: Save output to specified director
  • -s or --save: Save (text or html) to file instead of printing
  • -t or --text: Extract plain text
  • -T or --tests: Run test suite
  • -p or --pager: View text in pager
  • -v or --view: Open images in viewer

Supported formats

Format File extension
EPUB .epub
FictionBook .fb2
Mobipocket .mobi, .prc, azw

Support for Mobipocket files is provided via a wrapper for the python script mobiunpack.py by @kevinhendricks (released as GPL3). If you know of a drop-in replacement library in Ruby for parsing MOBI files (or are interested in writing one), please let me know!

Note that only ebooks without DRM will work with this script.

Contributing

PRs, suggestions, examples of ebooks that don't parse properly, and other contributions are always welcome! Providing support for additional formats or opening issues for bugs are examples of ways to help.

MOBI support has only been tested against files with the .mobi extension. It should in theory also work for other extensions. If you have access to ebooks with a .prc or .azw file extension and can confirm this, that would be appreciated!

To do

Code is pretty ad hoc at the moment and in general need of a cleanup. Different formats are handled separately but should probably be merged.

Other things:

  • Guess alternately-named content.opf files
  • Figure out cross-platform way of opening images in default viewer (current kludge is hard-coded to open image folder in Gwenview since xdg-open doesn't play nicely with cleaning up temporary files after viewing)

License

MIT.

Releases

No releases published

Packages

No packages published

Languages