Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saves html code of site not pdf #6

Open
sblanky opened this issue Nov 24, 2022 · 1 comment
Open

Saves html code of site not pdf #6

sblanky opened this issue Nov 24, 2022 · 1 comment

Comments

@sblanky
Copy link

sblanky commented Nov 24, 2022

Each time I try to run this, it downloads the html code for a sci-hub page (I believe something the file not found page), in place of the pdf. An example is shown below;


<html>
    <head>
	 <title>Sci-Hub - search proxy to download article</title>
	<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
        <meta name="viewport" content="width=device-width, heihht=device-height, initial-scale=1.0">
	<meta name="keywords" content="sci-hub,scihub">
	<meta name="description" content="The first pirate website in the world to open mass and public access to tens of millions research papers">
	<meta property="og:image" content="//img.sci-hub.shop/scihub/logo_en.png"/>
        <link rel="alternate" hreflang="ru" href="//sci-hub.tf/lang/ru" />
        <link rel="alternate" hreflang="en" href="//sci-hub.tf/lang/en" />
	<style type = "text/css">
		html, body, div, p, ul { margin: 0; padding: 0; font-family: Avenir }
                
                
                body { background: url('//img.sci-hub.shop/scihub/map.jpg') no-repeat; background-size: contain }
               
                #title { font-family: Tahoma; font-size: 250%; text-align: center; color: #993333 }
                #first { margin: 2%; font-size: 100%; text-align: center; color: #993333 }
                #desc p { margin: 3%;font-size: 100%; text-align: justify }
                #mission { margin-top: 4%; font-size: 120%; text-align: center; color: #993333 }
                
                #social { margin-top: 6%; text-align: center; color: #993333 }
                #social img { border: 0; margin: 2% }
				h1 {color:#993333;margin:0;margin-bottom:24px}
		a:hover {color:darkgreen}
                a {color:#aaa;margin:0;margin-bottom:24px}
                a#back { display: block; text-decoration: none; margin-top: 8%; width: 100%; padding: 2% 0 2% 0; background-color: #993333; color: white; text-align: center; font-size: 100% }
                img{max-width:100%}
				#message { text-align:left;color:#aaa;font-family: Verdana;font-size:16px;margin:32px; width: 480px }
                #noproxy { text-align:center;color:#aaa;font-family: Verdana;font-size:18px;margin-top:32px;display:none; }
                #found { text-align:center;color:green;font-family: Verdana;font-size:22px;margin-top:32px;display:none; }
	</style>
    </head>
<body>
    <div id ="about">
        <div id = "title"><h1><a href = "//sci-hub.wf">Sci-hub</a></h1></div>
        <div id = "first"><h1>
<p>Sorry, sci-hub has not included this article yet</p>
<p>You can register and log in to the <a href="http://www.wosonhj.com" target="_blank">Mutual Aid-Science Community,</a></p>
<p>and get it by posting for help</p>
<p><a href="http://www.wosonhj.com/suggest/22xs.html" target="_blank">Mutual Aid-Science Community Instructions</a></p>
<img src="https://img.sci-hub.shop/misc/img/maid1.png" height="306" width="918">
<br>
Please try to search again using DOI. DOI is the unique identifier of thesis, and searching through DOI can more accurately find the corresponding thesis documents.</h1>
		<h1><a href = "//sci-hub.wf/Find-DOI.html" target="_blank">How to quickly find the DOI number of an article</a></h1>
		<p>If you still cannot find thesis through DOI, we will include relevant articles as soon as possible,please try searching the corresponding DOI again after a while.</p>
		<div id = "message">&#9432; you can close this page and check later if the article has been downloaded</div>
                        <div id = "noproxy">no matching proxies found</div>
                        <div id = "found">proxy found, please wait</div>
<br></br>
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-7696326278603752"
     crossorigin="anonymous"></script>
<ins class="adsbygoogle"
     style="display:inline-block;width:970px;height:90px"
     data-ad-client="ca-pub-7696326278603752"
     data-ad-slot="4246281558"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
<p></div>
<script>var allurl=window.location.href;window.history.pushState({}, 0, "https://" + window.location.host);</script>
<script>setTimeout(function() {  window.history.pushState({}, 0, allurl);  }, 1000);</script>
    </div>
    <a id ="back" href = "/">&larr; return to main</a>
</body>
</html>


<html>
    <head>
	 <title>Sci-Hub - search proxy to download article</title>
	<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
        <meta name="viewport" content="width=device-width, heihht=device-height, initial-scale=1.0">
	<meta name="keywords" content="sci-hub,scihub">
	<meta name="description" content="The first pirate website in the world to open mass and public access to tens of millions research papers">
	<meta property="og:image" content="//img.sci-hub.shop/scihub/logo_en.png"/>
        <link rel="alternate" hreflang="ru" href="//sci-hub.tf/lang/ru" />
        <link rel="alternate" hreflang="en" href="//sci-hub.tf/lang/en" />
	<style type = "text/css">
		html, body, div, p, ul { margin: 0; padding: 0; font-family: Avenir }
                
                
                body { background: url('//img.sci-hub.shop/scihub/map.jpg') no-repeat; background-size: contain }
               
                #title { font-family: Tahoma; font-size: 250%; text-align: center; color: #993333 }
                #first { margin: 2%; font-size: 100%; text-align: center; color: #993333 }
                #desc p { margin: 3%;font-size: 100%; text-align: justify }
                #mission { margin-top: 4%; font-size: 120%; text-align: center; color: #993333 }
                
                #social { margin-top: 6%; text-align: center; color: #993333 }
                #social img { border: 0; margin: 2% }
				h1 {color:#993333;margin:0;margin-bottom:24px}
		a:hover {color:darkgreen}
                a {color:#aaa;margin:0;margin-bottom:24px}
                a#back { display: block; text-decoration: none; margin-top: 8%; width: 100%; padding: 2% 0 2% 0; background-color: #993333; color: white; text-align: center; font-size: 100% }
                img{max-width:100%}
				#message { text-align:left;color:#aaa;font-family: Verdana;font-size:16px;margin:32px; width: 480px }
                #noproxy { text-align:center;color:#aaa;font-family: Verdana;font-size:18px;margin-top:32px;display:none; }
                #found { text-align:center;color:green;font-family: Verdana;font-size:22px;margin-top:32px;display:none; }
	</style>
    </head>
<body>
    <div id ="about">
        <div id = "title"><h1><a href = "//sci-hub.wf">Sci-hub</a></h1></div>
        <div id = "first"><h1>
<p>Sorry, sci-hub has not included this article yet</p>
<p>You can register and log in to the <a href="http://www.wosonhj.com" target="_blank">Mutual Aid-Science Community,</a></p>
<p>and get it by posting for help</p>
<p><a href="http://www.wosonhj.com/suggest/22xs.html" target="_blank">Mutual Aid-Science Community Instructions</a></p>
<img src="https://img.sci-hub.shop/misc/img/maid1.png" height="306" width="918">
<br>
Please try to search again using DOI. DOI is the unique identifier of thesis, and searching through DOI can more accurately find the corresponding thesis documents.</h1>
		<h1><a href = "//sci-hub.wf/Find-DOI.html" target="_blank">How to quickly find the DOI number of an article</a></h1>
		<p>If you still cannot find thesis through DOI, we will include relevant articles as soon as possible,please try searching the corresponding DOI again after a while.</p>
		<div id = "message">&#9432; you can close this page and check later if the article has been downloaded</div>
                        <div id = "noproxy">no matching proxies found</div>
                        <div id = "found">proxy found, please wait</div>
<br></br>
<script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-7696326278603752"
     crossorigin="anonymous"></script>
<ins class="adsbygoogle"
     style="display:inline-block;width:970px;height:90px"
     data-ad-client="ca-pub-7696326278603752"
     data-ad-slot="4246281558"></ins>
<script>
(adsbygoogle = window.adsbygoogle || []).push({});
</script>
<p></div>
<script>var allurl=window.location.href;window.history.pushState({}, 0, "https://" + window.location.host);</script>
<script>setTimeout(function() {  window.history.pushState({}, 0, allurl);  }, 1000);</script>
    </div>
    <a id ="back" href = "/">&larr; return to main</a>
</body>
</html>

It does however seem to actually be able to find the paper and save with a reasonble name, i.e. [first author][year].
I would delve into the code to see where it is going wrong, but my skills are mainly limited to basic python.

Cheers!

@sblanky
Copy link
Author

sblanky commented Nov 30, 2022

I was able to make this work by reverting to commit 9e0fc54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant