Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get painting list requests get throttled by wikiart #21

Open
LorenzoCianciaruso opened this issue Mar 12, 2019 · 2 comments
Open

Get painting list requests get throttled by wikiart #21

LorenzoCianciaruso opened this issue Mar 12, 2019 · 2 comments

Comments

@LorenzoCianciaruso
Copy link

When running genre-scraper.py using the currently harcoded values for randomization

time.sleep(3.0*random.random())  # random sleep to decrease concurrence of requests

requests get throttled by Wikiart that returns

[Errno 104] Connection reset by peer

I think there are 2 improvements:

  • Throttled a bit more the request from the script side by having a fixed latency summed up to the randomised one.
  • when you run the script only failures are printed out in the console. And because of that initially I wrongly assumed the the script wasn't working at all. In reality all the successful requests are not logged, so might be good to add some more logs.

I'm happy to open a PR for this.

@ghost
Copy link

ghost commented Oct 19, 2020

I'm getting the same error but there doesn't seem to be any successful downloads at all.

failed to scrape URL [Errno 54] Connection reset by peer

I'm using the following command

python genre-scraper.py --genre abstract --output_dir abstract

@spasmann
Copy link

Am having the same issue. But only when I'm running through a remote computer cluster. If I run on laptop I get 100% of the images, no throttling.

Tried removing the random.random() and it actually scraped about 500 less images.

Depending on the style/genre it sometimes doesn't download any at all. With the larger sets it usually gets somewhere 900-1500 downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants