Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GoogleScholarScraper::ScrapeDoc doesn't work #114

Closed
GerHobbelt opened this issue Oct 19, 2019 · 3 comments
Closed

GoogleScholarScraper::ScrapeDoc doesn't work #114

GerHobbelt opened this issue Oct 19, 2019 · 3 comments
Labels
🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause.
Milestone

Comments

@GerHobbelt
Copy link
Collaborator

The first line in there:

HtmlNodeCollection NoAltElements_outer = doc.DocumentNode.SelectNodes("//*[@class='gs_r']");

always delivers NULL.

Nett result: Google Scholar Scrape activity will never deliver anything.

Analysis / Resolution

https://stackoverflow.com/questions/13771083/html-agility-pack-get-all-elements-by-class

vs. Google Scholar spitting out HTML like this:

              <div class="gs_r gs_or gs_scl" data-cid="ngzxngyKFX0J" data-did="ngzxngyKFX0J" data-lid="" data-rp="0">
                <div class="gs_ggs gs_fl">
                  <div class="gs_ggsd">
                    <div class="gs_or_ggsm" ontouchstart="gs_evt_dsp(event)" tabindex="-1"><a href="https://commons.emich.edu/cgi/viewcontent.cgi?article=1193&amp;context=loexquarterly" data-clk="hl=nl&amp;sa=T&amp;oi=gga&amp;ct=gga&amp;cd=0&amp;d=9013262016062753950&amp;ei=NkirXeGUFIjTmQGml5nIAg&amp;scisig=AAGBfm2DYn4rw9QE9Orpp4GAaHPhnaBB8w&amp;nossl=1" data-clk-atid="ngzxngyKFX0J"><span class=gs_ctg2>[PDF]</span> emich.edu</a></div>
                  </div>
                </div>
                <div class="gs_ri">
                  <h3 class="gs_rt" ontouchstart="gs_evt_dsp(event)"><span class="gs_ctc"><span class="gs_ct1">[PDF]</span><span class="gs_ct2">[PDF]</span></span> <a id="ngzxngyKFX0J" href="https://commons.emich.edu/cgi/viewcontent.cgi?article=1193&amp;context=loexquarterly" data-clk="hl=nl&amp;sa=T&amp;oi=ggp&amp;ct=res&amp;cd=0&amp;d=9013262016062753950&amp;ei=NkirXeGUFIjTmQGml5nIAg&amp;scisig=AAGBfm2DYn4rw9QE9Orpp4GAaHPhnaBB8w&amp;nossl=1" data-clk-atid="ngzxngyKFX0J">TechMatters:&quot; <b>Qiqqa</b>&quot; than you can say Reference Management: <b>A </b>Tool to Organize <b>the </b>Research Process</a></h3>
                  <div class="gs_a">K Graham - LOEX Quarterly, 2013 - commons.emich.edu</div>
                  <div class="gs_rs">… however, they are well documented in <b>the</b> <b>manual</b> that is included in <b>the</b> “guest� library that come<br>
                    with initial download of <b>the</b> software … Whether searching for <b>a</b> means to organize your own research<br>
                    process or seeking <b>a</b> tool that you can recommend to students, <b>Qiqqa</b> is most … 
                  </div>

(after pulling the HTML through an online formatter which very probably b0rked the Unicode in there...)

shows that the solution can be had from the last SO answer:

HtmlNodeCollection NoAltElements_outer = doc.DocumentNode.SelectNodes("//*[contains(@class,'gs_r')]");
@GerHobbelt GerHobbelt added 🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause. labels Oct 19, 2019
@GerHobbelt GerHobbelt added this to the v82 milestone Oct 19, 2019
@GerHobbelt
Copy link
Collaborator Author

GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 19, 2019
…not work for users living outside US/UK. Also further fixes jimmejardine#114. Also fixes jimmejardine#117 by enforcing UTF8 encoding on the content: we're downloading from Google Scholar there, so we should be good. Google Scrape finally finds decent titles, author lists and even PDF download links once again.

TODO: update the 'Google Scholar' view part in the PDFReader control.
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 19, 2019
GerHobbelt added a commit to GerHobbelt/qiqqa-open-source that referenced this issue Oct 19, 2019
… made it through in the previous fix commit for jimmejardine#114 / jimmejardine#115 : this corrects/augments these commits: SHA-1: 65e5707 + SHA-1: a5faaaf
@GerHobbelt
Copy link
Collaborator Author

As .NET only supports XPath 1.0 the tokenize(...) solution doesn't fly.

Binging on SO:

@GerHobbelt
Copy link
Collaborator Author

Fixed in dev repo for v82 release: https://github.com/GerHobbelt/qiqqa-open-source/tree/v82-build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🕵investigate Needs further analysis to find the root cause.
Projects
None yet
Development

No branches or pull requests

1 participant