Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Cookies #31

Open
stevenwaterman opened this issue Jan 14, 2019 · 7 comments
Open

Support for Cookies #31

stevenwaterman opened this issue Jan 14, 2019 · 7 comments
Assignees

Comments

@stevenwaterman
Copy link

It would be good if there was a way to set cookies for requests to allow for crawling sites that require authentication.

Is there currently a way to do this, or is this feature planned?

@nazuke nazuke self-assigned this Jan 28, 2019
@nazuke
Copy link
Owner

nazuke commented Jan 28, 2019

Many thanks for the suggestion @motherlymuppet,

So far, I have not planned to support crawling sites that require form-based log ins yet. However, this would very likely be reasonably straightforward to add an option for.

One thing to bear in mind here, is that crawling a site with this type of log in may have unintended side-effects.

For example, if there are links that perform actions like "delete this page", or similar, then SEO Macroscope will merrily follow these links too.

This is also one of the reasons why GoogleBot et al will not crawl sites as a particular user.

@nazuke
Copy link
Owner

nazuke commented Jan 28, 2019

Hi @motherlymuppet, following up, I took a look at how Screaming Frog handles this situation.

They too include a dire warning about data loss when using forms-based log ins.

Cookie support itself may be fine though.

Do you happen to have an example site that absolutely requires the setting of cookies in order to crawl it properly please?

many thanks

@stevenwaterman
Copy link
Author

The program should only be sending GET requests, surely? In which case there shouldn't be any effects on the site if it's configured properly and not changing state based on GET requests. I can see how that would be an issue for misconfigured sites though.

It'd be fine for it to be a very hidden option, it just seemed crazy that it wasn't there when it seems like a fairly fundamental part of accessing/navigating a website.

The site I wanted to use it on was my own, and authentication was enabled due to large amounts of sensitive information on the site, which was like a knowledge base. I was attempting to crawl the site to reduce the amount of duplicated information and reorganize the site to be more natural to navigate. I don't have an example to hand that you could use for testing, sorry.

@nazuke
Copy link
Owner

nazuke commented Jan 29, 2019

Thanks @motherlymuppet, that feedback helps a lot.

This is one of those cases where things in the real world, don't always match the specs. i.e. there will be some websites that will have regular links that have potentially damaging, to the user, side-effects when clicked. Generally, because these will always expect a human to be logged in, and not a robot that'll "click" everything it can get to on the page.

For example, SEO Macroscope would not know to not click this link:

<a href="/very/important/docs/delete/123">Delete this doc</a>

Under the hood, things are a little convoluted. The only HTTP methods used by the application are HEAD and GET.

In as many cases as possible, HEAD is used to probe a URL, with a subsequent GET where necessary.

You can see the rough flow that occurs for each fetched document here:

https://github.com/nazuke/SEOMacroscope/blob/master/SEOMacroscopeSeriesOne/src/MacroscopeDocument/MacroscopeDocument.cs

...in the public async Task<bool> Execute () method.

In fact, I just recently added an option to force GETs on web servers that don't service HEAD requests properly. The whole web is hack piled upon hack ;-)

So far, HTTP Basic Authentication should work in most cases; but as I don't get as much time as I'd like to work on this, forms-based authentication has so far not been on my TODO list. Hm, I don't actually have a forms-based authentication website to test with at the moment either...

You make some great points though, and this will be something that I'll be taking a look at soon.

many thanks!

@nazuke
Copy link
Owner

nazuke commented Jan 29, 2019

Hi again @motherlymuppet,

At a quick glance, it appears that cookie support itself is reasonably trivial.

So, the next detail would be the login process itself.

Does your login form use a GET like this:

https://www.company.com/login?username=bob&password=secret

or a POST to an endpoint somewhat like this:

https://www.company.com/login

with the credentials in the body?

If so, then this type of process would normally require the login page's URL and the credentials to be entered before the crawl takes place. Alternatively, a form field pattern would be required, with the credentials being prompted for during the crawl.

Either way, the login page would be requested first, in order for the resultant session cookie to be captured.

thanks!

@stevenwaterman
Copy link
Author

It's a POST endpoint, but that shouldn't matter. What I had in mind was a simple text field in the option where you could paste the cookie. I don't expect SEO macroscope to navigate me to the login page or guide me through it or anything like that, and I'd prefer it didn't for security reasons.

I can use the login form myself in a web browser, then take the cookie from the developer menu. All you need to do then is provide the box to put the cookie into, and attach that cookie to all outgoing requests.

That would provide complete flexibility across all login methods, and anyone trying to solve this problem is probably advanced enough to go to the developer menu and grab a cookie.

I don't mean to be patronising if this is already obvious to you, but thought I'd give an example of what I mean:

  • Go to the network tab of the developer menu
  • Navigate to a new page on github
  • On the right, you'll see in the request headers the Cookie: field. If you send a request to github with that cookie attached, github will respond as if you're logged in as you. I'm not 100% on which bits are the important bits for github specifically, but it's probably user_session and _gh_session.

@benhadad
Copy link

I have several websites that I own that require the acceptance of using cookies, this agreement is the "form" but it gives no more rights to the user except access to the website. This is now a very common use case in EU and now US. I just notice on these websites SEOMacroscope fails

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants