Web Crawler

This is a simplified search engine built with Chilkat's CkSpider, Gumbo Parser and RapidJSON. The software will collect a given number of web pages anb build an index for information retrieval over that collection. The default number of pages the crawler will try to visit before halting is 100000 (one hundred thousand). You can change this value by modifying the PAGES_TO_COLLECT constant located in main.cpp.

Installing
Usage
Example

Installing

Installing Chilkat for cpp

After installing Chilkat:

$ sudo make install
$ make

This will install Gumbo Parser and build the project (you may need to run sudo ldconfig afterwards), creating an executable file within build/.

Usage

To run the application, you can either use make run (to run with sample inputs, this will automaticaly crawl with a predefined seed at input/seed and build the index for those pages) or use ./build/web-crawler with custom options.

The available options are:

-c [SEED_FILE] replacing [SEED_FILE] with the path to the file containing your seeds, see examples. This will start the crawling process with [SEED_FILE] as seed.
-i [COLLECTION_PATH - optional] replacing [COLLECTION_PATH] with the path where your html collection is stored or simply leaving it blank, by default the collection path will be output/collection.jl. This will build an index for the documents present at [COLLECTION_PATH] and an index for the vocabulary of the collection. Two output files briefing.doc.idx and index.idx, the index for the documents and the vocabulary respectivelly, will be created at output/.
-l [VOCABULARY_INDEX_PATH] which will load the vocabulary index file at [VOCABULARY_INDEX_PATH] to memory (carefull there).
-q [VOCABULARY_INDEX_PATH] [COLLECTION_INDEX_PATH] where both [VOCABULARY_INDEX_PATH] and [COLLECTION_INDEX_PATH] are optional, however, if [COLLECTION_INDEX_PATH] is provided, so should be [VOCABULARY_INDEX_PATH]. This will open the CLI for performing queries. Defaults are ./output/index.idx and ./output/briefing.doc.idx.

The documents on the collection are indexed in batches, by default, the maximum batch size is 4096, defined in include/indexer.hpp. If the batch size is too big, the application will consume a large amount of RAM, however, if it is overly small, the execution time and disk usage may increase. On this document is presented a chart roughly illustrating how memory consumption escalates with batch size.

Be cautious when modifying the maximum batch size as it will require a lot of RAM, e.g. indexing 60000 documents at once can consume over 5GB. Also, make sure you have enough storage space.

The output will be a collection.jl file, which is going to be the collected documents as list of JSON objects separated by line breaks, and a index.idx file, which will be the inverted index for that collection. The JSON for the documents is an object with two keys: url and html_content, containing the link of the document on the web and HTML content of the document respectively. Each line of the inverted index representes a term following this format:

term n d₁ n_d1 p_1,d1 p_2,d1 p_ndn,dn

Where term is the indexed word, n is the number of documents where the term is present, d_i is the i-th document where the term is present, n_di is the number of times the term appears in d_i and p_j,di is the position of the j-th occurrence of the term in d_i.

Example

Your seed file should be a list of urls where the crawler will start visiting separated with line breaks like:

ufmg.br
kurzgesagt.org
www.cam.ac.uk
www.nasa.gov
github.com
medium.com
www.cnnbrasil.com.br
disney.com.br
en.wikipedia.org

The output collection will be formated as below (as should also be input collections):

{"url": "www.document1.com", "html_content": "<html> document 1's html content... </html>"}
{"url": "www.document2.com", "html_content": "<html> document 2's html content... </html>"}

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
.vscode		.vscode
build		build
docs		docs
include		include
input		input
output		output
src		src
third_party		third_party
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.cpp		main.cpp
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawler

Installing

Usage

Example

About

Releases

Languages

License

luizppa/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawler

Installing

Usage

Example

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Languages