Skip to content

MongoDB-based components for Scrapy that allows distributed crawling.

Notifications You must be signed in to change notification settings

taicaile/scrapy-mongodb

Repository files navigation

Scrapy MongoDB

MongoDB-based components for Scrapy that allows distributed crawling

Available Scrapy components

  • Scheduler
  • Duplication Filter

Installation

From github

To install it via pip,

# install
pip install git+https://github.com/taicaile/scrapy-mongodb
# reinstall
pip install --ignore-installed git+https://github.com/taicaile/scrapy-mongodb

or clone it first,

git clone https://github.com/taicaile/scrapy-mongodb.git
cd scrapy-mongodb
python setup.py install

To install specific version,

# replace the version `v0.1.0` as you expect,
pip install git+https://github.com/taicaile/scrapy-mongodb@v0.1.0

You can put the following in requirements.txt,

scrapy-mongodb@git+https://github.com/taicaile/scrapy-mongodb@v0.1.0

Usage

Enable the components in your settings.py:

# Enables scheduling storing requests queue in mongodb.
SCHEDULER = "scrapy_mongodb.scheduler.Scheduler"

# Specify the host and port to use when connecting to Mongodb (optional).
MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = "scrapy"

persist,

MONGODB_DUPEFILTER_PERSIST = False # by default
MONGODB_SCHEDULER_QUEUE_PERSIST = False # By default

Note this is not suitable for distribution currently.

About

MongoDB-based components for Scrapy that allows distributed crawling.

Resources

Stars

Watchers

Forks

Packages

No packages published