You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm wondering if there's a way to retain a tags in article.top_node, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!
The text was updated successfully, but these errors were encountered:
The top_node dom object is about to undergo heavy modifications
(after that highlighted line) where chunks of it will be removed to assure quality article extraction.
But the drawback is that important pieces of the top_node dom may have been
removed so we maintain a deep copied clone of the top_node named clean_top_node
immediately before it goes through the changes so we have access to the entire top_node.
However, after thinking about it, I think it would make a lot of sense to have a feature called extract_hrefs() or something which automatically uses the clean_top_node. Hmm..
For now, you can just access the clean node via:
a = Article('http://...')
a.download()
a.parse()
a.clean_top_node.xpath('//a/@href')
I'm wondering if there's a way to retain
a
tags inarticle.top_node
, or alternatively just extract all the urls from within the article's html. My hope is to be able to find when an article links to any number of other articles. I'm currently digging around in the source to find the best place to include this, but could use guidance. Thanks!The text was updated successfully, but these errors were encountered: