Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dockerfile and fix requirements #171

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

leoramme
Copy link

Hi!

I recently tried to reproduce the NER results, but it was really hard to set up the system config. The requirements constraints don't work anymore, as any version of numpy > 1.20 returns this error when fine-tuning the model to NER, and the pandas version isn't compatible with the other packages. scikit-learn is also a better alternative than sklearn, as sklearn is only a dummy package and installs the latest version of scikit-learn.

Because of that, I updated the requirements and created a Dockerfile. Unfortunately, because the model weights are hosted with Google Drive, I couldn't automate the download of the model weights inside the Dockerfile.

With both of these modifications, the NER results can be easily reproduced:

  1. Pull the official tensorflow-gpu image with docker pull tensorflow/tensorflow:1.15.5-gpu-py3-jupyter
  2. Download and extract BioBERT-Base v1.1 (+ PubMed 1M) inside the biobert repo:

The directory structure should look like this:

biobert/
├── biobert_v1.1_pubmed
│   ├── bert_config.json
│   ├── model.ckpt-1000000.data-00000-of-00001
│   ├── model.ckpt-1000000.index
│   ├── model.ckpt-1000000.meta
│   └── vocab.txt
├── biocodes
│   ├── [...]
├── create_pretraining_data.py
├── Dockerfile
├── download.sh
├── extract_features.py
├── figs
│   └── biobert_overview.png
├── __init__.py
├── LICENSE
├── modeling.py
├── modeling_test.py
├── optimization.py
├── optimization_test.py
├── README.md
├── requirements.txt
├── run_classifier.py
├── run_ner.py
├── run_pretraining.py
├── run_qa.py
├── run_re.py
├── sample_text.txt
├── tf_metrics.py
├── tokenization.py
└── tokenization_test.py
  1. To build the image, run docker build -t biobert .
  2. To start the image in interactive mode, run docker run --gpus all -it biobert /bin/bash (remove --gpus all if you want to use your CPU instead of GPU)

In the interactive mode, you can use run_ner.py and biocodes/ner_detokenize.py without problems, and I figured that this might be useful if someone else wants to reproduce the results or develop something on top of BioBERT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant