Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple URLs almost works #60

Closed
gitressa opened this issue Mar 6, 2019 · 6 comments
Closed

Multiple URLs almost works #60

gitressa opened this issue Mar 6, 2019 · 6 comments

Comments

@gitressa
Copy link
Contributor

gitressa commented Mar 6, 2019

It's possible to check multiple URLs with this:
php fink.phar https://example.org https://example.org/hiddenpage --max-external-distance=1

External links on the second URL are checked, but the crawler doesn't seem to follow internal links. My use case is using example.org/hiddenpage as a list of internal links, to guide the crawler to specific pages, but fink doesn't seem to follow these links from the hiddenpage, but only checks if the links work, and returning a status: 200.

@gitressa
Copy link
Contributor Author

gitressa commented Mar 6, 2019

So I have just confirmed this bug:

  1. Created a single page, linked to from the front page (/node/1)
  2. Created a "secret" page, not linked to from the front page (/node/2)
  3. Created a sitemap, linking to both these pages (http://finkmap.lndo.site/linkmap)

This crawl checks the visible link (node/1) on the public page, as expected:
php fink.phar http://finkmap.lndo.site --max-external-distance=1 --output=linkreport-1.json

This crawl checks links on the public page, as well as visits /linkmap, and verifies that node/2 exists, but doesn't check links on it:
php fink.phar http://finkmap.lndo.site http://finkmap.lndo.site/linkmap --max-external-distance=1 --output=linkreport-1.json

@dantleech
Copy link
Owner

dantleech commented Mar 8, 2019

The problem is that the links in /linkmap are being classed as "external" (as they are not descendants of /linkmap) 🤔

@dantleech
Copy link
Owner

This should fix it: https://github.com/dantleech/fink/pull/72/files

dantleech added a commit that referenced this issue Mar 8, 2019
Allow the inclusion of additional links
@gitressa
Copy link
Contributor Author

gitressa commented Mar 8, 2019

Thanks! It now works like this:

php fink \
http://finkmap.lndo.site \
--include-link=http://finkmap.lndo.site/linkmap \
--output=linkreport.json

@gitressa
Copy link
Contributor Author

gitressa commented Mar 9, 2019

I spoke a little too soon ... It does work and checks external links, which is great. But it seems to only do this on the first page, but doesn't follow paginated links, like /linkmap?page=1

It seems like it transfers the querystring to the base URL in the crawl process, resulting in something like this:
{"url":"http:\/\/finkmap.lndo.site?page=153","distance":2,"referrer":"http:\/\/finkmap.lndo.site\/linkmap","status":200,"request-time":733386,"exception":null}.

Where /linkmap?page=153 is the last page of my collection of links (Linkmap).

@gitressa
Copy link
Contributor Author

gitressa commented Mar 9, 2019

UPDATE: It works! I changed the pager (this is in Drupal) from Paged output, full pager to Paged output, mini pager, and fink now crawls /linkmap?page=1, /linkmap?page=2, /linkmap?page=3, etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants