Purpose

Scrape Website using a heap snapshot created by playwright, then parse the heap snapshot for the desired data

How to install (currently there is c++ lib not included in this repo, so building will fail)

Clone this repo, then cd to the lib dir here and pip install ./parser this will build the c libraray and install it into your local python env. Make sure to install playwright afterwards. Then you should be able to use the HPScraper Interface :)

Requirements

Cmake >= 3.25 (or change the version in the CMake files)
On Windows: only MSVC compiler > 2017 is supported (this is a pybind11 limitation)
Playwright (the heap snapshot is created with this)

Roadmap

Implement the JSON parser using SMID Instructions.
Add the setup.py file to make this a self sufficient library
optimize, by reducing unnecessary copys and start using pass by reference, e.g the json file contents are copied as of now
allow to query by value, to make it easier to find the right query params
Docker
async playwright implementation and batches

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
parser		parser
.gitignore		.gitignore
.gitmodules		.gitmodules
HPScraper.py		HPScraper.py
README.md		README.md
main.py		main.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Purpose

How to install (currently there is c++ lib not included in this repo, so building will fail)

Requirements

Roadmap

About

Releases

Packages

Languages

zwiebelslayer/webscraper_using_heap_snapshots

Folders and files

Latest commit

History

Repository files navigation

Purpose

How to install (currently there is c++ lib not included in this repo, so building will fail)

Requirements

Roadmap

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages