Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support full wildcard syntax in robots.txt directives #250

Open
anjackson opened this issue Mar 29, 2019 · 3 comments
Open

Support full wildcard syntax in robots.txt directives #250

anjackson opened this issue Mar 29, 2019 · 3 comments

Comments

@anjackson
Copy link
Collaborator

We only support trailing * wildcards at present. Ideally we should support wildcards as defined in https://developers.google.com/search/reference/robots_txt

The code to modify would be:

public boolean allows(String path) {
return !(longestPrefixLength(disallows, path) > longestPrefixLength(allows, path));
}

The actual wildcards are not that difficult, but getting the precedence right is harder. Perhaps we can use a standard library e.g. the crawler commons code?

@jr-ewing
Copy link

Maybe interesting https://github.com/google/robotstxt

@ato
Copy link
Collaborator

ato commented May 11, 2023

There's a Java port of Google's parser too https://github.com/google/robotstxt-java/ but unfortunately it doesn't seem to be in Maven Central.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants