Scrapy MongoDB

MongoDB-based components for Scrapy that allows distributed crawling

Available Scrapy components

Scheduler
Duplication Filter

Installation

From github

To install it via pip,

# install
pip install git+https://github.com/taicaile/scrapy-mongodb
# reinstall
pip install --ignore-installed git+https://github.com/taicaile/scrapy-mongodb

or clone it first,

git clone https://github.com/taicaile/scrapy-mongodb.git
cd scrapy-mongodb
python setup.py install

To install specific version,

# replace the version `v0.1.0` as you expect,
pip install git+https://github.com/taicaile/scrapy-mongodb@v0.1.0

You can put the following in requirements.txt,

scrapy-mongodb@git+https://github.com/taicaile/scrapy-mongodb@v0.1.0

Usage

Enable the components in your settings.py:

# Enables scheduling storing requests queue in mongodb.
SCHEDULER = "scrapy_mongodb.scheduler.Scheduler"

# Specify the host and port to use when connecting to Mongodb (optional).
MONGODB_SERVER = 'localhost'
MONGODB_PORT = 27017
MONGODB_DB = "scrapy"

persist,

MONGODB_DUPEFILTER_PERSIST = False # by default
MONGODB_SCHEDULER_QUEUE_PERSIST = False # By default

Note this is not suitable for distribution currently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scrapy MongoDB

Available Scrapy components

Installation

Usage

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scrapy MongoDB

Available Scrapy components

Installation

Usage