Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative urls and redirects issue #108

Open
marapper opened this issue Nov 15, 2019 · 2 comments
Open

Relative urls and redirects issue #108

marapper opened this issue Nov 15, 2019 · 2 comments
Labels
bug Something isn't working

Comments

@marapper
Copy link
Contributor

For example url https://www.sberbank.ru/ru/person/seizure redirected to https://www.sberbank.ru/seizure and have relative urls in there like ./1142.

If we crawl /seizure directly all this urls are OK. But when we start scanning with /ru/person/seizure all relative urls incorrect prefixed with before-redirected url like /ru/person/seizure/1142 and mark as broken.

@marapper
Copy link
Contributor Author

Also I think <base href=" tag don't taken into account when URL is buildng.

@marapper
Copy link
Contributor Author

Cannot be done without changes in gaxios (referenced PR). If real page URL will be in response this bug can be solved with changing opts.url to res.request.responseURL in index.js:149.

Also it can be another feature. Crawler result json can contains information about page links that are redirects. There are many cases when it can be usefull:

  • http links to sites that fully upgraded to https
  • links without www.
  • redirects can lead to not the same page than before
  • and others

@JustinBeckwith JustinBeckwith added the bug Something isn't working label Nov 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants