Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to headless chrome #31

Closed
segment-srl opened this issue Jul 31, 2017 · 6 comments
Closed

Move to headless chrome #31

segment-srl opened this issue Jul 31, 2017 · 6 comments

Comments

@segment-srl
Copy link
Collaborator

Phantomjs is no longer under development so we need to move to headless Chrome

@ring04h
Copy link

ring04h commented Sep 20, 2017

You are talking puppeteer or chromeless ?

@GuilloOme
Copy link
Contributor

since puppeteer is the project started by the team who develop chrome, I would be inclined to use this lib.

@segment-srl
Copy link
Collaborator Author

hi and sorry for the delayed answere... I agree with Gullohome, puppeteer is the best choice.
I made some tests with puppeteer and the crawler is working pretty well with very few modification to the htcap's js code.
I'm still facing a couple of problems:
1. it seems not possible to load a page using a POST request (with custom headers)
2. it seems that there is no reliable way to "lock navigation" as in phantomJS

Point 1 may gets resolved by writing a chrome extension but point 2 is very problematic. In Chrome it's possible to intercept and abort requests but not the page navigation. For example if we allow the loading of scripts, it's possible that the crawler will naviate to a .js url... also it's not possible to prevent navigation to about:blank (es <a href="about:blank"...)
I'm going to perform more tests to find out if I'm missing something...

@GuilloOme
Copy link
Contributor

GuilloOme commented Dec 18, 2017

We did all this in our fork. If you want to take a look of the implementation details, it is here: https://github.com/delvelabs/htcap/tree/master/core/crawl/probe

We did a lot of work to reach a stable (enough) implementation and it will be deployed in our production environment in January.

@segment-srl
Copy link
Collaborator Author

I tried your fork and it seems it faces the same issue as my test code. If a page contains a link to about:blank (<a href="about:blank") the navigation is not locked.

@GuilloOme
Copy link
Contributor

@segment-srl you are right, any "special" uri scheme makes the probe hang… we didn't found a solution yet. it should be possible to solve it through the webNavigation feature available in chrome extension.

We choose to postpone the issue since not many website use other scheme than http(s) in href attributes but it have to be handle at some point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants