Skip to content

Software mention extraction and linking from scientific articles

License

Notifications You must be signed in to change notification settings

chanzuckerberg/software-mention-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Software mention extraction and linking from scientific articles 💾

Most of the cutting-edge science is built on scientific software, which makes scientific software often as important as traditional scholarly literature. Despite that, the software is not always treated as such, especially when it comes to funding, credit, and citations. Moreover, with the ever-growing number of open-source software tools, it is impossible for many researchers to track tools, databases, and methods in a specific field.

In an effort to automate the process of crediting and identifying relevant and essential software in the biomedical domain, we've developed a machine learning model to extract mentions of software from scientific articles. The input to this model is a text from a scientific article and the output is a list of mentioned software within it. 

We applied this model to the CORD-19 full-text articles and stored the output inCORD-19 Software Mentions . Cite as: Wade, Alex D., & Williams, Ivana. (2021). CORD-19 Software Mentions [Data set]. https://doi.org/10.5061/dryad.vmcvdncs0

Getting started

Dependencies

Python 3.7+ (tested on 3.7.4)
Python packages: pandas, numpy, keras, torch, nltk, sklearn, transformers, os, seqeval, json, time, tqdm, argparse, blink, bs4, re, itertools

Training

Data

Softcite:

  • Softcite data repository

  • Full corpus (downloaded on February 8, 2021)

  • Download the XML file above and place it in the ./data folder.

  • XML file processing notebook: ./notebooks/Parse softcite data.ipynb

    • Input: ./data/softcite_corpus-full.tei.xml

    • Output: ./data/labeled_dfs_all.csv


Model

Performance:


Inference

Software Mentions

  • Download pretrained model 'scibert_software_sent' from: s3://meta-prod-ds-storage/software_mentions_extraction/models and place it in the ./models/ folder.

  • Example of how to run the model in inference mode: ./scripts/Software mentions inference mode.ipynb

  • Example:

Wikipedia Linking

  • This model is based on BLINK

  • Follow instructions on the github repo to download relevant models/install.

  • Example of how to run the model in inference mode: ./scripts/Link text to wikipedia.ipynb

  • Example:


Extract mentions of software from the CORD-19 dataset 🦠 📚

  • CORD-19 data: More information and download instructions: here
  • Save to the ./data/ folder
  • Run notebook: ./scripts/Software mentions CORD19.ipynb

Contributing

Contributions and ideas are welcome! Please see our contributing guide and don't hesitate to open an issue or send a pull request to improve the functionality of this gem.

Code of Conduct

This project adheres to the Contributor Covenant code of conduct. By participating, you are expected to uphold this code. Please report unacceptable behavior to opensource@chanzuckerberg.com.

License

MIT

About

Software mention extraction and linking from scientific articles

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published