Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong URL resolution on redirects #93

Closed
kelunik opened this issue Aug 2, 2019 · 3 comments · Fixed by #98
Closed

Wrong URL resolution on redirects #93

kelunik opened this issue Aug 2, 2019 · 3 comments · Fixed by #98

Comments

@kelunik
Copy link
Contributor

kelunik commented Aug 2, 2019

The URL resolution is broken if there are any redirects and non-absolute URLs involved. The problematic code is in Crawler. It uses the original document URL instead of the URL of the latest request, which might be different from the original URL due to redirects.

Possible Solution

$this->enqueueLinks($this->loadXpath($body), Url::fromUrl((string)$response->getRequest()->getUri()), $report, $queue);

Reproduction Case

Crawl https://amphp.org/. I don't see any failures with -x0 and that fix, without the fix there are 3 errors which are false positives.

@dantleech
Copy link
Owner

Ah yes, I noticed this but subsequently couldn't reproduce, will check again

@dantleech
Copy link
Owner

dantleech commented Aug 4, 2019

I attempted to create a failing test in #94, however my investigation led me to a different path which uncovered some other bugs but did not fix this issue.

From what I can tell the redirect happens on /promises to /promises/, the problem seems to be indeed that the original URL is used to resolve the link.

Annoyingly I haven't succeeded in reproducing this bug in a test, this is the example as it stands, but fink can crawl this fine: https://github.com/dantleech/fink/pull/95/files

@dantleech
Copy link
Owner

When the Crawler gets the URL it calls the HTTP client, which may then redirect. Once redirected the URL that was given to the crawler will vary from the "actual" page we were redirected to.

The original URL is used instead of the redirected one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants