Skip to content
This repository has been archived by the owner on Aug 24, 2023. It is now read-only.
/ seize Public archive

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Notifications You must be signed in to change notification settings

peremenov/seize

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

seize

Build Status Dependency Status Codacy Badge


Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader.

Install

npm i --save seize

Usage

Seize can be used with DOM libraries such as jsdom for example. It only extracts and prepares certain DOM-node for further usage.

Example

var Seize = require('seize'),
    jsdom = require('jsdom').jsdom;

var window = jsdom('<your html here>').defaultView,
    seize  = new Seize(window.document);

seize.content(); // returns DOM-node
seize.text();    // returns only text

Browser usage

For browser usage you shoud clone you DOM object or create it from HTML string:

/**
 * Converts html string to Document
 * @param  {String} html  html document string
 * @return {Node}         document
 */
function HTMLParser(html){
  var doc = document.implementation.createHTMLDocument("example");
  doc.documentElement.innerHTML = html;
  return doc;
};

How it works

Here is algorythm how it works:

  • Getting html tags that we expect to be text or content container such as p, table, img, etc.
  • Filtering unnesessary tags by content and tag names wich defenantly can't be in a content container
  • Setting score for each container by containing tags
  • Setting score by class name, id name, tag xPath score and text score
  • Sorting canditates by score
  • Taking first candidate
  • Cleaning up article

Todo

Seize still in development, so you can use it at one's own risk. You always can help to improve it.

  • Improve readme
  • Improve text scoring
  • Improve page detection wich can't be extracted
  • More tests
  • More examples

Contributing

You are welcomed to improve this small piece of software :)

Author

About

Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published