Skip to content

Code repository for Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition

License

Notifications You must be signed in to change notification settings

Rosenberg37/GraphNER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README

Code for "Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition". Details and paper please check here.

Setup

Requirements

You can try to create environment as follows:

conda create --name GrapnNER python=3.9.13
conda activate GraphNER
pip install -r requirements.txt

or directly import conda environment on Windows as follows:

conda env create -f windows.yaml

or directly import conda environment on Linux as follows:

conda env create -f linux.yaml

Datasets

Original source of datasets:

You can download our processed datasets from here.

Data format:

{
  "tokens": [
    "IL-2",
    "gene",
    "expression",
    "and",
    "NF-kappa",
    "B",
    "activation",
    "through",
    "CD28",
    "requires",
    "reactive",
    "oxygen",
    "production",
    "by",
    "5-lipoxygenase",
    "."
  ],
  "entities": [
    {
      "start": 14,
      "end": 15,
      "type": "protein"
    },
    {
      "start": 4,
      "end": 6,
      "type": "protein"
    },
    {
      "start": 0,
      "end": 2,
      "type": "DNA"
    },
    {
      "start": 8,
      "end": 9,
      "type": "protein"
    }
  ],
  "relations": {},
  "org_id": "ge/train/0001",
  "pos": [
    "PROPN",
    "NOUN",
    "NOUN",
    "CCONJ",
    "PROPN",
    "PROPN",
    "NOUN",
    "ADP",
    "PROPN",
    "VERB",
    "ADJ",
    "NOUN",
    "NOUN",
    "ADP",
    "NUM",
    "."
  ],
  "ltokens": [],
  "rtokens": []
}

The ltokens contains the tokens from the previous sentence. And The rtokens contains the tokens from the next sentence.

Word vectors

For used word vectors including Chinese word2vec, Glove and Bio-word2vec, you can download from here.

Run

You can run the experiment on GENIA dataset as follows:

python main.py --dataset_name=genia --evaluate=test --concat --pretrain_select=dmis-lab/biobert-base-cased-v1.2 --word2vec_select=bio --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50

You can run the experiment on weiboNER dataset as follows:

python main.py --dataset_name=weiboNER --evaluate=dev --evaluate=test --pretrain_select=bert-base-chinese --word2vec_select=chinese --batch_size=4 --epochs=5 --max_length=64

You can run the experiment on Conll2003 dataset as follows:

python main.py --dataset_name=conll2003 --evaluate=test --concat --pretrain_select=bert-base-cased --word2vec_select=glove --batch_size=4 --epochs=5 --max_length=128 --pos_dim=50 --char_dim=50

Reference

If you have any questions related to the code or the paper or the copyright, please email wenxr2119@mails.jlu.edu.cn. We would appreciate it if you cite our paper as following:

@article{wen2022graph,
  title={Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition},
  author={Xueru Wen, Changjiang Zhou, Haotian Tang, Luguang Liang, Yu Jiang, Hong Qi},
  journal={arXiv preprint arXiv:2210.10240},
  year={2022}
}

About

Code repository for Type-supervised sequence labeling based on the heterogeneous star graph for named entity recognition

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages