dom-content-extraction

Rust implementation of Fei Sun, Dandan Song and Lejian Liao paper:

Content Extraction via Text Density (CETD)

use dom_content_extraction::{DensityTree, get_node_text};

let dtree = DensityTree::from_document(&document); // &scraper::Html 
let sorted_nodes = dtree.sorted_nodes();
let node_id = sorted_nodes.last().unwrap().node_id;

println!("{}", get_node_text(node_id, &document));

dtree.calculate_density_sum();
let extracted_content = dtree.extract_content(&document);

println!("{}", extracted_content;

Add it it with:

cargo add dom-content-extraction

or add to you Cargo.toml

dom-content-extraction = "0.3"

Run examples

Check examples.

This one will extract content from generated "lorem ipsum" page

cargo run --example check -- lorem-ipsum

There is scoring example i'm trying to implement scoring. You will need to download GoldenStandard and finalrun-input datasets from:

https://sigwac.org.uk/cleaneval/

and unpack archives into data/ directory.

cargo run --example ce_score

As far as i see there is problem opening some files:

Error processing file 730: Failed to read file: "data/finalrun-input/730.html"

Caused by:
    stream did not contain valid UTF-8

But overall extraction works pretty well:

Overall Performance:
  Files processed: 370
  Average Precision: 0.87
  Average Recall: 0.82
  Average F1 Score: 0.75

Read documentation on docs.rs

Desired features

implement normal scoring
create real world dataset
improve algo

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
.github/workflows		.github/workflows
benches		benches
examples		examples
html		html
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.tmuxp.yaml		.tmuxp.yaml
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
notes.org		notes.org
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dom-content-extraction

Run examples

Desired features

About

Releases

Packages

Languages

License

oiwn/dom-content-extraction

Folders and files

Latest commit

History

Repository files navigation

dom-content-extraction

Run examples

Desired features

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages