Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain <a> tags in top article node? #56

Closed
abelsonlive opened this issue Jun 9, 2014 · 1 comment
Closed

Retain <a> tags in top article node? #56

abelsonlive opened this issue Jun 9, 2014 · 1 comment

Comments

@abelsonlive
Copy link

I'm wondering if there's a way to retain a tags in article.top_node, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!

@codelucas
Copy link
Owner

Reference this line:
https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L207

The top_node dom object is about to undergo heavy modifications
(after that highlighted line) where chunks of it will be removed to assure quality article extraction.

But the drawback is that important pieces of the top_node dom may have been
removed so we maintain a deep copied clone of the top_node named clean_top_node
immediately before it goes through the changes so we have access to the entire top_node.

However, after thinking about it, I think it would make a lot of sense to have a feature called
extract_hrefs() or something which automatically uses the clean_top_node. Hmm..

For now, you can just access the clean node via:

a = Article('http://...')
a.download()
a.parse() 
a.clean_top_node.xpath('//a/@href') 

Or something like that. Update me on your status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants