Skip to content

jasontlam/snorkel-biocorpus

 
 

Repository files navigation

Snorkel BioCorpus

Initially this is just a pre-processed, Snorkel-format dump of PubTator. We will be adding more soon!

Database Snapshot

The easiest way to get started is to download a preprocessed Snorkel PostgreSQL database dump. This is a 142 GB file and is ready to use directly with Snorkel.

To reload, just use psql snorkel-biocorpus < snorkel_biocorpus.sql

Sources

  • PubMed abstracts

Summary Statistics

XXX PubMed Abstracts
XXX 19XX - 2017

Entity Tags

Building the Database

Full PubTator Snapshot

You can rebuild the entire PubTator database from scratch as follows:

run install.sh

This will download the current PubTator snapshot (~10GB compressed; 32GB raw) from ftp.ncbi.nlm.nih.gov

Parsing using 16 cores with the spaCy parser takes around XX hours. Parsing with CoreNLP will take longer.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 59.6%
  • C++ 22.3%
  • C 6.9%
  • Roff 5.2%
  • Ruby 2.1%
  • Shell 2.0%
  • Other 1.9%