Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more general support for inferred path discovery #597

Merged

Conversation

kris-sigur
Copy link
Collaborator

ExtractorHTTP already extracts per-host favicon.ico by inference and can be set to also discover hosts' root page (/). This PR adds a list of paths that should be inferred to exist for each host.

The motivation for this was the need to check if a site has a sitemap (/sitemap.xml) even when one isn't listed in the robots.txt file. We've encountered several instances of this. Rather than just adding in one more hard-coded path, it seems better to make this configurable.

The existing config for discovering the root path has been marked deprecated in favor of this new setting, but it continues to function as before. So, existing configuration should not be affected by any of these changes.

Given that discovery of the favicon.ico is hard-coded in and not configurable, it remains unaffected. I judge the impact of changing this as too disruptive, but ideally these inferences could all be managed through this new list. Indeed, if not for the favicon.ico legacy, I might instead suggest that this functionality be broken off into a new "ExtractorInference" as all this is unrelated to the extraction of links in the HTTP response headers.

@ato
Copy link
Collaborator

ato commented Aug 7, 2024

This seems sensible. I can see it being used for security.txt and other well-known URIs too.

@kris-sigur kris-sigur merged commit cd3a424 into internetarchive:master Aug 8, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants