Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strip out navigational and other superfluous elements #862

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

woutervanwijk
Copy link

Small tweaks to strip out navigational elements and other common superfluous elements. Works really nicely

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @woutervanwijk , thanks for the patch. Did you see this broke the linter, and probably tests?

The current code unfortunately cannot assume that querySelectorAll or similar are available on the DOM implementation, so using class/ID selectors for _clean probably doesn't quite work.

It would also be really helpful to have some examples where this change produces some improvement (apart from the likely required changes to existing testcases), to understand the motivation behind some of the changes.

Comment on lines +901 to +902
var fullArticleText = this._doc.body.innerText;
if (fullArticleText.length) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is breaking all the tests as innerText is not defined.

@@ -1546,7 +1582,7 @@ Readability.prototype = {

// get article published time
metadata.publishedTime = jsonld.datePublished ||
values["article:published_time"] || null;
values["article:published_time"] || null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a mistake as this is a continuation line vs. the line before.

Copy link
Contributor

@gijsk gijsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing the linting - but we really need some better tests, and ideally we'd make these changes as individual ones with a test per change, so that it's more obvious what changes in behaviour each of the suggested modifications has.

Comment on lines +711 to +715
//scale down h2-h5 because it's too large most of the time (intro's in h2, etc)
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h5"]), "h6");
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h4"]), "h5");
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h3"]), "h4");
this._replaceNodeTags(this._getAllNodesWithTag(articleContent, ["h2"]), "h3");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't seem like a great reason to change the semantics of the headers - the "right" solution would probably to revert intro paragraphs into, well, paragraphs, instead of headers, if they meet some threshold / algorithm.

UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog" ],
UNLIKELY_ROLES: [ "menu", "menubar", "complementary", "navigation", "alert", "alertdialog", "dialog", "nav" ],

NODES_TO_CLEAN_FIRST: ["object", "embed", "footer", "link", "aside", "nav", ".icons", ".byline", ".sub-nav", ".identity", ".logo", ".video-player", ".jw-player", ".jw-wrapper", ".video", ".byline", ".author", ".dateline", ".credentials", ".writtenby", ".p-author", ".article-author", ".navigation", ".hidden-xs", ".hidden-sm", ".brand", ".modalContent", ".noPrint", ".noprint", ".screenonly", ".breadcrumb", ".breadcrumbs", "amp-iframe", "amp-img", "amp-ad", ".advert", ".ads", ".brand", ".search", ".nav", ".user", ".users", "#onetrust-consent-sdk", "#branding", "#branding-content" ],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these aren't nodenames, so _clean won't actually remove them, I think?

It would also be really helpful to better understand what the source of this list of elements is.

Comment on lines +125 to +131
unlikelyCandidates: /-ad-|ai2html|banner|combx|comment|community|cover-wrap|credentials|date|hide|hidden|disqus|extra|footer|gdpr|legends|nav|paywall|meta|menu|related|remark|replies|rss|shoutbox|sidebar|skyscraper|social|sponsor|supplemental|ad-break|agegate|pagination|pager|popup|share|sharing|yom-remote|byline|topbar|article-meta|brand|tooltip/i,
okMaybeItsACandidate: /and|article|body|column|content|main|shadow|header|summary/i,

positive: /article|body|content|entry|hentry|h-entry|main|page|pagination|post|text|blog|story/i,
negative: /-ad-|hidden|^hid$| hid$| hid |^hid |banner|combx|comment|com-|contact|foot|footer|footnote|gdpr|masthead|media|meta|outbrain|promo|related|scroll|share|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|widget/i,
extraneous: /print|archive|comment|discuss|e[\-]?mail|share|reply|all|login|sign|single|utility/i,
byline: /byline|author|dateline|writtenby|p-author/i,
positive: /article|body|content|entry|header|hentry|h-entry|intro|intro|intro|intro|main|main-article|main-content|page|lead|leading|pagination|primary|post|text|blog|story|summary|strapline/i,
negative: /-ad-|affiliate|credentials|controls|date|desktop|hidden|nav|^hid$| hid$| hid |^hid |hide|banner|login|gate|combx|comment|com-|contact|foot|footer|footnote|gdpr|icon|^icon|icons$|icons|masthead|media|meta|paywall|nav|outbrain|promo|related|scroll|share|sharing|shoutbox|sidebar|skyscraper|sponsor|shopping|tags|tool|tooltip|widget|video-player|video|jw-player|jw-aspect|modal|carousel|overlay|byline|brand|disclosure|nav|logo|account|cart|dock/i,
extraneous: /print|affiliate|archive|button|comment|controls|discuss|e[\-]?mail|meta|icons|share|reply|all|login|sign|single|utility|icons|nav|video-player|jw-player|modal|video|paidcontent|carousel|overlay|social|topbar|article-meta|onetrust-consent-sdk|logo|account|cart|hamburger|traffic|weather|search/i,
byline: /byline|author|dateline|credentials|writtenby|p-author|article-author/i,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some really significant changes, and we'd need to have some testcases to help explain why we're making these changes.

Some of the changes also seem like mistakes, e.g. nav is in there several times, and icon will always match when ^icon and icons$ and icons match.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants