Skip to content

Crawler that retrieves commoncrawl's crawled hosts and their corresponding IPs

License

Notifications You must be signed in to change notification settings

CAIDA/commoncrawl-host-ip-mapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CommonCrawl Host-IP Mapper

CommonCrawl Host Mapper crawls the select CommonCrawl index and generate host to IP mapping file.

It is designed to be massively parallelizable. Depending on the capacity of the runtime system, user can run the crawling on tens or hundreds of threads to speed up the retrival process.

It also comes with very straightforward commandline user interface and progress bar on the current crawling process.

Build

cargo build --release

Examples

It defaults to crawl the most-recent available CommonCrawl index, and outputting the results to the current directory with filename to be mapping-INDEX_ID.csv.gz. The CommonCrawl's available indices can be found at https://index.commoncrawl.org/collinfo.json.

To run with 128 threads:

./target/release/cc-host-mapper --threads 128

To output to a different file:

./target/release/cc-host-mapper --threads 128 --output custom-output-file-name.csv

Output

The output of the file is formatted as HOST,DATE,IP.

...
college.ac,2020-11-25,172.104.36.121
college.ac,2020-11-28,172.104.36.121
door.ac,2020-11-26,54.95.55.40
door.ac,2020-11-24,54.95.55.40
door.ac,2020-11-23,54.95.55.40
door.ac,2020-12-01,54.168.46.54
...

About

Crawler that retrieves commoncrawl's crawled hosts and their corresponding IPs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published