Retain <a> tags in top article node? #56

abelsonlive · 2014-06-09T01:32:37Z

I'm wondering if there's a way to retain a tags in article.top_node, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!

The text was updated successfully, but these errors were encountered:

codelucas · 2014-06-10T06:01:26Z

Reference this line:
https://github.com/codelucas/newspaper/blob/master/newspaper/article.py#L207

The top_node dom object is about to undergo heavy modifications
(after that highlighted line) where chunks of it will be removed to assure quality article extraction.

But the drawback is that important pieces of the top_node dom may have been
removed so we maintain a deep copied clone of the top_node named clean_top_node
immediately before it goes through the changes so we have access to the entire top_node.

However, after thinking about it, I think it would make a lot of sense to have a feature called
extract_hrefs() or something which automatically uses the clean_top_node. Hmm..

For now, you can just access the clean node via:

a = Article('http://...')
a.download()
a.parse() 
a.clean_top_node.xpath('//a/@href')

Or something like that. Update me on your status.

codelucas closed this as completed Jun 14, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain <a> tags in top article node? #56

Retain <a> tags in top article node? #56

abelsonlive commented Jun 9, 2014

codelucas commented Jun 10, 2014

Retain <a> tags in top article node? #56

Retain <a> tags in top article node? #56

Comments

abelsonlive commented Jun 9, 2014

codelucas commented Jun 10, 2014