Proxy not used when crawling on localhost network #19

GuilloOme · 2017-02-09T15:44:49Z

When launching a crawl, it seems that only the start url and robots.txt are requested through the proxy (during the validation process).

way to reproduce:

start a crawl with:
$ ./htcap.py crawl -v -p http:127.0.0.1:8080 http://localhost/index.html test.db
you get:

Initializing . . done
Database test.db initialized, crawl started with 10 threads
crawl result for: link GET http://localhost/index.html  
  new request found link GET http://localhost/test1.html 
crawl result for: link GET http://localhost/test1.html  
  new request found link GET http://localhost/test2.html 
  new request found link GET http://localhost/index.html 
crawl result for: link GET http://localhost/test2.html  
  new request found link GET http://localhost/test1.html 
  new request found link GET http://localhost/index.html 

Crawl finished, 3 pages analyzed in 0 minutes

I only got 2 hits in the proxy log:
- http://…/index.html
- http://…/robots.txt

The text was updated successfully, but these errors were encountered:

GuilloOme · 2017-02-09T18:09:58Z

After investigating, it seems that phantomjs is not happy with playing on the localhost network…
I filled a bug #14808 to them.

GuilloOme · 2017-07-25T12:05:47Z

It's seems a behavior of QT (the lib used by phantomJS) ; unfortunately, it's not be "fixable"… (see this response)

GuilloOme changed the title ~~Proxy not used for every request during crawl~~ Proxy not used when crawling on localhost network Feb 9, 2017

GuilloOme closed this as completed Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proxy not used when crawling on localhost network #19

Proxy not used when crawling on localhost network #19

GuilloOme commented Feb 9, 2017 •

edited

Loading

GuilloOme commented Feb 9, 2017

GuilloOme commented Jul 25, 2017

Proxy not used when crawling on localhost network #19

Proxy not used when crawling on localhost network #19

Comments

GuilloOme commented Feb 9, 2017 • edited Loading

way to reproduce:

GuilloOme commented Feb 9, 2017

GuilloOme commented Jul 25, 2017

GuilloOme commented Feb 9, 2017 •

edited

Loading