Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extra error when crawling #11

Closed
barhaterahul opened this issue Jan 22, 2017 · 15 comments
Closed

Extra error when crawling #11

barhaterahul opened this issue Jan 22, 2017 · 15 comments

Comments

@barhaterahul
Copy link

I was trying to crawl a website with -m active -v. I am getting these errors. Could you please look into it,
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 69 - 317)

Exception in thread Thread-5:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 62, in run
self.crawl()
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 215, in crawl
probe = self.send_probe(request, errors)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 164, in send_probe
probeArray = self.load_probe_json(jsn)
File "/root/Desktop/htcap/core/crawl/crawler_thread.py", line 99, in load_probe_json
return json.loads(jsn)
File "/usr/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 5 column 1 - line 5 column 249 (char 341 - 589)

@GuilloOme
Copy link
Contributor

GuilloOme commented Jan 27, 2017

I had the same error…

Here is the content of the problematic json:

[
    ["cookies",[]],
    {"status":"ok","redirect":"http://example.com","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.{"status":"ok", "partialcontent":true}]

There is clearly some garbage in it…

After investigation, it's because that the stdout in polluted by PhantomJS error.

The best practice should be using system.stdout.write('my json') (see example here) and overwriting console.log() to provide some controle over the console output. But, I am not sure if it is really the root cause here…

@segment-srl
Copy link
Collaborator

Thanks! It's clearly some garbage generated by phantomjs.
Could you please provide steps to reproduce the problem?

@GuilloOme
Copy link
Contributor

I've got the error while crawling one of our client website, I tried to reproduce it in a more stable environment without success. Sorry…

I'll try again next week

@GuilloOme
Copy link
Contributor

Finally, I found a way to reproduce:

  • run analyse.js on a local path: $ phantomjs core/crawl/probe/analyze.js /

  • It returns the same type of garbage:

[
{"status":"error","code":"load","time":0}
]
Blocked a frame with origin "file://" from accessing a frame with origin "null".  The frame requesting access has a protocol of "file", the frame being accessed has a protocol of "about". Protocols must match.

@segment-srl
Copy link
Collaborator

thanks!!

@GuilloOme
Copy link
Contributor

GuilloOme commented Feb 15, 2017

It looks like the error happened every time PhantomJS hits a redirect…
It became a blocker for us here, so I'm starting to work on a fix.

After some research, it's because phantomjs use stdout to provide feedback and do not offer option to deactivate this feedback, plus we can't rely on the fact that PhantomJS use either stdout or stderr in the right case (PhantomJS send output in stdout even it should have been sent in stderr)

So a solution would be using a temporary file shared between the CrawlerThreads and PhantomJS (with fs.write() more here) and read the file content afterward.

Benefits of this approach:

  • increase the reliability PhantomJS output by providing a 100% conform json
  • clean-up the js code where the output had to going through console.log() calls

An other solution would be having some kind of local http stream to share info between the 2 process… but it seems to be a bit overkill for this matter.

@segment-srl, What do you think?

@segment-srl
Copy link
Collaborator

I'm still unable to reproduce this issue, even with "phantomjs core/crawl/probe/analyze.js /". What version of phantomjs are you using on what os?

@GuilloOme
Copy link
Contributor

$ phantomjs --version
2.1.1

@segment-srl
Copy link
Collaborator

linux?

@GuilloOme
Copy link
Contributor

GuilloOme commented Feb 15, 2017

Yes linux…
This is interesting: I don't get the same result with the binary provided by the ubuntu repo and with the one downloaded on project page!
With the one from the project, I don't get any error…

@segment-srl
Copy link
Collaborator

interesting yes.. so it's an issue related on the phantomjs build.. one solution is to write analize,js output to fie instead of stdout..

@GuilloOme
Copy link
Contributor

GuilloOme commented Feb 15, 2017

I check the build difference between the 2 build (project vs ubuntu repo) and it seems that the ubuntu do not use the same process for building PhantomJS.
I asked them why here: https://answers.launchpad.net/ubuntu/+source/phantomjs/+question/462517

@GuilloOme
Copy link
Contributor

@barhaterahul, what version of PhantomJS do you run? Is it the version provided by Ubuntu too?

GuilloOme referenced this issue in delvelabs/htcap Mar 7, 2017
GuilloOme referenced this issue in delvelabs/htcap Mar 7, 2017
@GuilloOme
Copy link
Contributor

Finally, my question at launchpad regarding the difference with the build process has been closed without a straight answer…
So, I updated the readme: #20

@segment-srl
Copy link
Collaborator

This issue is related to phantomjs build on some linux distros. Using the binary from the officail website should fix the problem.
Since phantomjs is no more supported, htcap is now moving to headless chrome so issue similar to this one won't be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants