GoogleScholarScraper::ScrapeDoc doesn't work #114

GerHobbelt · 2019-10-19T17:51:44Z

The first line in there:

HtmlNodeCollection NoAltElements_outer = doc.DocumentNode.SelectNodes("//*[@class='gs_r']");

always delivers NULL.

Nett result: Google Scholar Scrape activity will never deliver anything.

Analysis / Resolution

https://stackoverflow.com/questions/13771083/html-agility-pack-get-all-elements-by-class

vs. Google Scholar spitting out HTML like this:

              <div class="gs_r gs_or gs_scl" data-cid="ngzxngyKFX0J" data-did="ngzxngyKFX0J" data-lid="" data-rp="0">
                <div class="gs_ggs gs_fl">
                  <div class="gs_ggsd">
                    <div class="gs_or_ggsm" ontouchstart="gs_evt_dsp(event)" tabindex="-1"><a href="https://commons.emich.edu/cgi/viewcontent.cgi?article=1193&amp;context=loexquarterly" data-clk="hl=nl&amp;sa=T&amp;oi=gga&amp;ct=gga&amp;cd=0&amp;d=9013262016062753950&amp;ei=NkirXeGUFIjTmQGml5nIAg&amp;scisig=AAGBfm2DYn4rw9QE9Orpp4GAaHPhnaBB8w&amp;nossl=1" data-clk-atid="ngzxngyKFX0J"><span class=gs_ctg2>[PDF]</span> emich.edu</a></div>
                  </div>
                </div>
                <div class="gs_ri">
                  <h3 class="gs_rt" ontouchstart="gs_evt_dsp(event)"><span class="gs_ctc"><span class="gs_ct1">[PDF]</span><span class="gs_ct2">[PDF]</span></span> <a id="ngzxngyKFX0J" href="https://commons.emich.edu/cgi/viewcontent.cgi?article=1193&amp;context=loexquarterly" data-clk="hl=nl&amp;sa=T&amp;oi=ggp&amp;ct=res&amp;cd=0&amp;d=9013262016062753950&amp;ei=NkirXeGUFIjTmQGml5nIAg&amp;scisig=AAGBfm2DYn4rw9QE9Orpp4GAaHPhnaBB8w&amp;nossl=1" data-clk-atid="ngzxngyKFX0J">TechMatters:&quot; <b>Qiqqa</b>&quot; than you can say Reference Management: <b>A </b>Tool to Organize <b>the </b>Research Process</a></h3>
                  <div class="gs_a">K GrahamÂ - LOEX Quarterly, 2013 - commons.emich.edu</div>
                  <div class="gs_rs">â€¦ however, they are well documented in <b>the</b> <b>manual</b> that is included in <b>the</b> â€œguestâ€� library that come<br>
                    with initial download of <b>the</b> softwareÂ â€¦ Whether searching for <b>a</b> means to organize your own research<br>
                    process or seeking <b>a</b> tool that you can recommend to students, <b>Qiqqa</b> is mostÂ â€¦ 
                  </div>

(after pulling the HTML through an online formatter which very probably b0rked the Unicode in there...)

shows that the solution can be had from the last SO answer:

HtmlNodeCollection NoAltElements_outer = doc.DocumentNode.SelectNodes("//*[contains(@class,'gs_r')]");

GerHobbelt · 2019-10-19T17:54:57Z

Solution reference: https://stackoverflow.com/questions/13771083/html-agility-pack-get-all-elements-by-class#answer-14087707

…083/html-agility-pack-get-all-elements-by-class#answer-14087707

…not work for users living outside US/UK. Also further fixes jimmejardine#114. Also fixes jimmejardine#117 by enforcing UTF8 encoding on the content: we're downloading from Google Scholar there, so we should be good. Google Scrape finally finds decent titles, author lists and even PDF download links once again. TODO: update the 'Google Scholar' view part in the PDFReader control.

jimmejardine#114 + jimmejardine#115

…der left pane -- this is where most of the scrape info lands. ( jimmejardine#114 / jimmejardine#115 / jimmejardine#117 )

… made it through in the previous fix commit for jimmejardine#114 / jimmejardine#115 : this corrects/augments these commits: SHA-1: 65e5707 + SHA-1: a5faaaf

GerHobbelt · 2019-10-19T21:55:01Z

As .NET only supports XPath 1.0 the tokenize(...) solution doesn't fly.

Binging on SO:

https://stackoverflow.com/questions/1390568/how-can-i-match-on-an-attribute-that-contains-a-certain-string -- using concatenate whitespace around class value and then 'fake' word boundary via contains with surrounding spaces: this is what we'll use, basically
https://stackoverflow.com/questions/17352340/tokenize-or-split-string-using-xsl-in-visual-studio
https://stackoverflow.com/questions/53229668/xpath-normalize-space-with-contains

GerHobbelt · 2019-10-19T21:57:00Z

Fixed in dev repo for v82 release: https://github.com/GerHobbelt/qiqqa-open-source/tree/v82-build

GerHobbelt added 🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause. labels Oct 19, 2019

GerHobbelt added this to the v82 milestone Oct 19, 2019

GerHobbelt mentioned this issue Oct 19, 2019

PDF Reader (which does a Scholar Scrape) does not work for users living outside US/UK #115

Closed

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 19, 2019

fix jimmejardine#114 as per https://stackoverflow.com/questions/13771…

65e5707

…083/html-agility-pack-get-all-elements-by-class#answer-14087707

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 19, 2019

upgrade the HtmlAgilityPack package used by Qiqqa. This is required for

c4d7b6e

jimmejardine#114 + jimmejardine#115

GerHobbelt closed this as completed Oct 19, 2019

GerHobbelt mentioned this issue Dec 10, 2019

upgrade the embedded browser (xulrunner) to the latest version #2

Open

GerHobbelt mentioned this issue May 26, 2021

Unable to flush xulrunner cache - sniffer #330

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GoogleScholarScraper::ScrapeDoc doesn't work #114

GoogleScholarScraper::ScrapeDoc doesn't work #114

GerHobbelt commented Oct 19, 2019

GerHobbelt commented Oct 19, 2019

GerHobbelt commented Oct 19, 2019

GerHobbelt commented Oct 19, 2019

GoogleScholarScraper::ScrapeDoc doesn't work #114

GoogleScholarScraper::ScrapeDoc doesn't work #114

Comments

GerHobbelt commented Oct 19, 2019

Analysis / Resolution

GerHobbelt commented Oct 19, 2019

GerHobbelt commented Oct 19, 2019

GerHobbelt commented Oct 19, 2019