Discovering French Digital Literature (LIFRANUM ANR project)
-
Updated
Nov 1, 2023 - Jupyter Notebook
Discovering French Digital Literature (LIFRANUM ANR project)
This is part of my 2022 Summer Internship, it's mainly about web scraping.
From WARC records to MongoDB documents
Common Crawl's processing tools
Process web archives (WARC format) with StormCrawler and index content into Elasticsearch or Solr
Parser for WARC (aka WebArchive) files
📇 Tools to Work with the Web Archive Ecosystem in R
metawarc: a command-line tool for metadata extraction from files from WARC (Web ARChive)
Parse And Create Web ARChive (WARC) files with node.js
Process Common Crawl data with Python and Spark
Add a description, image, and links to the warc-files topic page so that developers can more easily learn about it.
To associate your repository with the warc-files topic, visit your repo's landing page and select "manage topics."