Skip to content

A multi-lingual named entity classifier to perform named entity recognition (NER) on two datasets, International: CoNLL 2003, Chinese: Weibo. We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings.

License

Notifications You must be signed in to change notification settings

WideSu/Vanilla-NER

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual Named Entity Recognition

image

Content

A multi-lingual named entity classifier to perform named entity recognition (NER) on two datasets, International: CoNLL 2003, Chinese: Weibo. We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.

  • Data: We used CoNLL 2003 dataset (train, dev) combined with a manually corrected (improved/cleaned) test set from the CrossWeigh paper called CoNLL++ for English corpos and Weibo dataset for Chinese corpos. Then we removed stopwrods and did tokenization using the BERT tokenizer.
  • Results: We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings. Without using pooled-embeddings, CrossWeigh and training to a max 50 instead of 150 epochs, we get a micro F1-score of 93.5%, within 0.7 of a percentage point of the SOTA.

The notebook is structured as follows:

  • Setting up the GPU Environment
  • Getting Data
  • Training and Testing the Model
  • Using the Model (Running Inference)

Task Description

Named-entity recognition is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

image

Literature review: models that can do NER

image

Dataset

  • Chinese corpos: flair.datasets.NER_CHINESE_WEIBO
  • English corpos: CoNLL++

image

Data prep

Since our dataset is relatively clean already, we only removed stopwrods and did tokenization using the BERT tokenizer.

image

The ground truth data was in the shown format, in English , each word token is assigned a NE tag, in chinese, each character is assigned a NER tag.

image

Our Model: BiLSTM-CRF

We combined Bi-directionary LSTM with conditional random field, which has better result than only using BiLSTM.

For deep learning based NER task, it usally has three steps:

  1. Get the representation for the input. We use the pretrained word embedding BERT and Character-level embedding to 

In sequence tagging task, we have access to both past and future input features for a given time, we can thus utilize a bidirectional LSTM network to efficiently make use of past features (via forward states) and future features (via backward states) for a specific time frame. For the output layer, There are two different ways to make use of neighbor tag information in predicting current tags. The first is to predict a distribution of tags for each time step and then use beam-like decoding to find optimal tag sequences. The second one is to focus on sentence level instead of individual positions, thus leading to Conditional Random Fields (CRF) models. 

image

Result Evaluation

GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space.

Our model beats the Baseline "glove" for precision, recall, F1-score in recognizing organization, location, person,  and misc which means all names which are not already in the other categories except the recall for organization.

image

Sample results for English Sentences

Here's some result for recognizing name entities in English sentences.

We choose some confusing sentences such as George Washington went to Washington to study in University of Washington .

image

Sample results for Chinese Sentences

And here are some results for Chinese Sentences

image

Limitations

Since the Conll dataset we used is 2003, which is pretty outdated, we want to try inferencing with our model on new named entity which may be unseen in our training dataset. This is our result: (explain result: 1. although johnny depp was already famous at 2003, amber heard was still a nobody at that time , but our model can recognize her name. 2.similarly trump was also not so famous then, twitter even didn’t exist. 3. the boys is a pretty new TV series first aired in 2019 4. finally a model trained with 2003 dataset can correctly pick up covid-19). in general, our model perform quite well in recognizing new NE.

image

Challenges & Future Work

For example, we tested model inference for Chinese corpus using Telsla, the model cannot detect it. Hence the future work includes:

image

About

A multi-lingual named entity classifier to perform named entity recognition (NER) on two datasets, International: CoNLL 2003, Chinese: Weibo. We used the current state-of-the-art model to test on CoNLL++ dataset, achieved a F1-score of 94.3% with pooled-embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%