Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify that the continous crawler is working #25

Closed
2 tasks
anjackson opened this issue Dec 10, 2018 · 3 comments
Closed
2 tasks

Verify that the continous crawler is working #25

anjackson opened this issue Dec 10, 2018 · 3 comments

Comments

@anjackson
Copy link
Contributor

anjackson commented Dec 10, 2018

The continuous crawler has been running successfully for weeks, but we need to verify that it is doing a sufficiently good job to justify the switch-over.

Proposal is to generate crawl volume breakdowns per host across daily and weekly crawl streams, and compare them to make sure they are roughly equivalent.

  • Daily seed only running every other day, due to small delays. The recrawl periods should be shortened slightly, e.g. 23 hrs not 24hrs etc. but seed relaunch should use a narrow re-crawl window (10min?) to prevent the shorter recrawl period causing the schedule to drift (at the cost of occasionally double-crawling seeds).
  • Ensure DNS failures are not remembered forever. Domain name lookup failures get cached forever internetarchive/heritrix3#234
@anjackson
Copy link
Contributor Author

Proposal is to write something to parse multiple log files, which will output

Host/Target, Launch Date, Total URLs, status codes, etc.

Not 100% clear how to do this. e.g. process log files once, output summary per log file into local file or DB. Then summarise over local files/DB. ?

@anjackson
Copy link
Contributor Author

After some analysis, a couple of problems arose. See main body of ticket.

@anjackson
Copy link
Contributor Author

Closing this as it doens't really fit as a ticket.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant