Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hostname based queue assignment variants that optionally limit queue name length #598

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kris-sigur
Copy link
Collaborator

Adds 2 new queue assignment policies that extend the existing HostnameQueueAssignmentPolicy and SurtAuthorityQueueAssignmentPolicy by adding support for limiting the number of domain segments used in constructing the queue name. The limit can be set on a per-sheet bases.

The SURT variant has been in use in our crawls for many years with no issue. Among other uses, it has proven very important for crawling sites with a large number of dynamically generated subdomains (e.g. username.blogsite.com) where we might otherwise hammer the site excessively.

An argument could be made for adding the functionality into the existing policies, I've done it this way mostly because we've had this in our Heritrix "extras" lib where it was simpler to subclass.

Copy link
Contributor

@adam-miller adam-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good. I've traditionally used a more complicated configuration to achieve single queues for sites with large numbers of subdomains, so this makes a common case more straightforward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants