Skip to content

COVID-19 Arabic Word embeddings is a domain- specific pre-trained distributed word representation of COVID-19 Tweets which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.

Notifications You must be signed in to change notification settings

BatoolHamawi/COVID-19WordEmbeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 

Repository files navigation

COVID-19 Arabic Word Embeddings

We built a word vectors model exploiting our whole COVID-19 dataset collected from January 2020 to April 2020 link. By removing retweets and duplicated tweets, we ended with 2,821,940 tweets. We consider two noticeable word embeddings generation methods: word2vec, and FastText. Using these pre-trained word embeddings models that are domain-specific (COVID-19) would be more accurate than using other generic pre-trained word embeddings in AI tasks.

COVID19 Twitter Word2Vec models

We release four embedding models

Model Vocabularies No. Vec-Size Download
Word2Vec-Twitter-SkipGram 262,715 200 Download URL
Word2Vec-Twitter-CBOW 262,715 200 Dwonload URL
Word2Vec-Twitter-SkipGram 262,715 300 Download URL
Word2Vec-Twitter- CBOW 262,715 300 Dwonload URL

COVID19 Twitter FastText models

We release two embedding models

Model Vocabularies No. Vec-Size Download
FastText-Twitter-SkipGram 262,715 200 Download URL
FastText-Twitter-SkipGram 262,715 300 Download URL

Word Embeddings in 2D using T-SNE

Here is the T-SNE visualisation of the word embedding in 2D. It was done using "Embedding Projector" with 3500 iterations and 15 perplexity.

  • T-SNE visualisation of Model trianed by Continous Bag of word with a dimension of 300 Tsne_CBOW300

  • T-SNE visualisation of FastText Model trained with a dimension of 300 Tsne_modelSkipGramFast300


How to use

These models were built using gensim Python library. For loading and using one of the models, you should install the gensim and nltk :

  • install gensim >= 3.4 and nltk >= 3.2 using either pip or conda

pip install gensim nltk

conda install gensim nltk


References

If you are going to use this model, please cite this work using the following bibtext:

@inproceedings{hamoui2020covid,
  title={COVID-19: What Are Arabic Tweeters Talking About?},
  author={Hamoui, Btool and Alashaikh, Abdulaziz and Alanazi, Eisa},
  booktitle={International Conference on Computational Data and Social Networks},
  pages={425--436},
  year={2020},
  organization={Springer}
}
@article{alqurashi2021eating,
  title={Eating Garlic Prevents COVID-19 Infection: Detecting Misinformation on the Arabic Content of Twitter},
  author={Alqurashi, Sarah and Hamoui, Btool and Alashaikh, Abdulaziz and Alhindi, Ahmad and Alanazi, Eisa},
  journal={arXiv preprint arXiv:2101.05626},
  year={2021}
}

About

COVID-19 Arabic Word embeddings is a domain- specific pre-trained distributed word representation of COVID-19 Tweets which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published