Skip to content

64bit/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Intro

Crawls the web within same domain, for example if start url is https://www.google.com, then it won't crawl https://maps.google.com. Crawler limits webpages visited to 2000. Crawler runs a BFS from start url.

####Usage

The command below prints the static assets (js, css, image, txt, pdf, doc, docx files) url.

./crawl https://www.google.com

Output of above command:

[ 
  {
    "url": "http://www.google.com/history/optout?hl=en",
    "assets": [
      "https://www.gstatic.com/images/branding/googlelogo/2x/googlelogo_light_color_74x24dp.png",
      "http://www.gstatic.com/history/static/myactivity_20170215-0135_1/angular-material.css",
      "https://fonts.googleapis.com/css?family=RobotoDraft:400,500"
    ]
  },
  ...
  ...
  {
    "url": "http://www.google.com/preferences?hl=en",
    "assets": [
      "https://www.google.com/images/branding/searchlogo/1x/googlelogo_desk_heirloom_color_150x55dp.gif",
      "https://www.google.com/images/warning.gif"
    ]
  }
]

To run in verbose mode, so as to see what crawler is doing at current moment:

DEBUG=true ./crawl https://www.google.com

Snippet of intermediate output of above command, where "Queue Size" is the BFS queue size at given moment, also relative urls are getting converted to absolute url:

...
Queue Size: 334
Fetching:  https://www.google.com/intl/en/about/products/products/
converted:  //www.google.com/  -->  https://www.google.com/
converted:  //www.google.com/  -->  https://www.google.com/
Queue Size: 333
Fetching:  https://www.google.com/intl/en/about/products/products/assistant/
converted:  //www.google.com/  -->  https://www.google.com/
converted:  //www.google.com/  -->  https://www.google.com/
Queue Size: 332
Fetching:  https://www.google.com/intl/en/about/products/products/pixel/
converted:  //www.google.com/  -->  https://www.google.com/
converted:  //www.google.com/  -->  https://www.google.com/
Queue Size: 331
Fetching:  https://www.google.com/intl/en/about/products/products/allo-duo/
converted:  //www.google.com/  -->  https://www.google.com/
converted:  //www.google.com/  -->  https://www.google.com/
Queue Size: 330
...

Running Unittests

Unittests runs an actual webserver (flask) instead of mocking requests.get(url)

./run_tests.sh

####Dependencies Python 2.7, requests, beautifulsoup, flask( required by unittests )

####Install Dependencies

pip install requests
pip install beautifulsoup4
pip install flask

Install other required modules using pip

About

Crawls the web within same domain

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published