Skip to content

Source codes and resources of our participation at the BioCreative 7 DrugProt challenge.

Notifications You must be signed in to change notification settings

dmis-lab/BioRE-drugprot-kuaz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Relation Extraction model (Our participation at the BioCreative 7 - Drugprot challenge)

In this repository, we provide source codes and resources of our participation at the BioCreative 7 DrugProt challenge.


Requirements

Codes are tested using Python v3.7.2 and following libraries.

torch>=1.3.1
transformers==4.9.0
datasets==1.8.0

Our main code run_re_hfv4.py is based on example codes of transformer repository, with modification for our pre-processing style and use case.

Pre-processed datasets

Please download pre-processed datasets from here

The compressed file contains pre-processed train, development and test datasets.

NOTE: To evaluate the model predictions on the development set:

Please replace test-mapping.tsv(6.9M) and test.tsv(67M) with the files from dev_named_as_test folder (dev_named_as_test/test-mapping.tsv(399K), dev_named_as_test/test-mapping.tsv(3.9M)). Those two files are the developement dataset pre-processed and renamed in the format of test dataset, to make predictions without modifying run_re_hfv4.py.

The pre-processed test.tsv should have 238,624 lines whereas the developement data dev.tsv (which is identical to dev_named_as_test/test.tsv in the content) should have 13,480 lines. (Measured using wc -l) The corresponding test-mapping.tsv files should have the same number of lines.

How to train the model / make predictions using trained model

First, train your model using run_re_hfv4.py.

For example, the following code (linux bash script) will produce prediction results and checkpoints in $OUTPUT_DIR.

export SEED=0
export CASE_NUM=`printf %02d $SEED`

export LM_FULL_NAME=<LM PATH or HF Transformer name/url>
export SEQ_LEN=192
export BATCH_SIZE=16 #16 with LR 2e-5  #32 with LR 5e-5
export LEARN_RATE=2e-5
export EPOCHS=40
export RE_DIR=<DATA PATH>/format_dt
export OUTPUT_DIR=<OUTPUT PATH>/bs-${BATCH_SIZE}_seqLen-${SEQ_LEN}_lr-${LEARN_RATE}_${EPOCHS}epoch_iter-$CASE_NUM
mkdir $OUTPUT_DIR
echo $OUTPUT_DIR

export TASK_NAME=bc7dp
export CUDA_VISIBLE_DEVICES=0 

python run_re_hfv4.py \
  --model_name_or_path ${LM_FULL_NAME} \
  --task_name $TASK_NAME \
  --do_train --do_eval --do_predict \
  --train_file $RE_DIR/train.tsv --validation_file $RE_DIR/dev.tsv --test_file $RE_DIR/test.tsv \
  --typeDict_file $RE_DIR/typeDict.json --vocab_add_file $RE_DIR/vocab_add.txt \
  --max_seq_length $SEQ_LEN \
  --per_device_train_batch_size $BATCH_SIZE \
  --per_device_eval_batch_size 512 \
  --learning_rate ${LEARN_RATE} \
  --num_train_epochs ${EPOCHS} --warmup_ratio 0.1 \
  --output_dir $OUTPUT_DIR/ \
  --logging_steps 2000 --eval_steps 2000 --save_steps 10000 \
  --seed $CASE_NUM

If you want to transform the formats of output predictions into the input format of DrugProt eval library, you can use our transform_reTorch2bc7dp.py.

python transform_reTorch2bc7dp.py --task=bc7dp \
 --output_path=$OUTPUT_DIR/predict_results_bc7dp.txt \
 --bc7format_out_path=$OUTPUT_DIR/pred_relations.tsv \
 --mapping_path=$RE_DIR/test-mapping.tsv \
 --label_path=$RE_DIR/typeDict.json

This will generate a transformed output file in $OUTPUT_DIR/pred_relations.tsv . Use it to evaluate your trained model.

The following bash scripts are an example of evaluating the model predictions on developement dataset (Please read "NOTE" of the "Pre-processed datasets" section). Please download development dataset from the DrugProt official website and unzip it on DEV_DIR.

export DRUGPROT_EVAL_LIB=${HOME}/github
export DEV_DIR=<PATH TO DEV FILES>/drugprot-gs-training-development/development

python ${DRUGPROT_EVAL_LIB}/drugprot-evaluation-library/src/main.py \
 -g $DEV_DIR/drugprot_development_relations.tsv \
 -p $OUTPUT_DIR/pred_relations.tsv \
 -e $DEV_DIR/drugprot_development_entities.tsv \
 --pmids $DEV_DIR/pmids.txt 2>&1 | tee -a $OUTPUT_DIR/BC7DP_pyTorch_score_2021Aug_total.log

For predicted large track data

Automatic predictions of Drug-Protein relations database: Please check here


Citation info

Our main paper, entitled Biomedical relation extraction with knowledge base refined weak-supervision is under the review process of DATABASE journal - BioCreative special issues.

Until our main paper is available as a journal article, please cite our short technical description, which is accepted and included in the BioCreative VII workshop proceedings.

@inproceedings{yoon2021using,
  title={Using knowledge base to refine data augmentation for biomedical relation extraction},
  author={Yoon, Wonjin and Yi, Sean and Jackson, Richard and Kim, Hyunjae and Kim, Sunkyu and Kang, Jaewoo},
  booktitle={Proceedings of the BioCreative VII challenge evaluation workshop},
  pages={31--35},
  year={2021}
}

We will update citation info shortly.


For inquiries, please contact wjyoon (_at_) korea.ac.kr.

About

Source codes and resources of our participation at the BioCreative 7 DrugProt challenge.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages