Skip to content

Source code for the paper "Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19" (2020)

License

Notifications You must be signed in to change notification settings

ogencoglu/Language-agnostic_BERT_COVID19_Twitter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

https://github.com/ogencoglu/Language-agnostic_BERT_COVID19_Twitter/blob/master/media/timeline.png

This repository provides the full implementation in python 3.7. Requires Twitter developer account.

Main Idea

Utilizing Language-agnostic BERT Sentence Embeddings (LaBSE) to analyze 28 million tweets in 109 languages related to COVID-19

Reproduction of Results

Follow steps 1-5 below.

1 - Get the Data

See directory_info in the data directory for the expected files.

1.1 - Download 30+ million tweet IDs and hydrate them into timestamp and tweet text (requires Twitter developer account).

Jan 17,tweet_text_string
Jan 27,tweet_text_string
...

Once tweets.csv is in the example format above, preprocess by running:

python3.7 preprocess.py

1.2 - Download Intent and Questions datasets

--Intent Dataset
 Link
--Questions Dataset
 Link

2 - Extract Tweet Embeddings

2.1 - BERT

python3.7 extract_BERT_embeddings.py -m intent
python3.7 extract_BERT_embeddings.py -m questions

2.2 - Language-agnostic BERT Sentence Embeddings (LaBSE)

python3.7 extract_LaBSE_embeddings.py -m tweets
python3.7 extract_LaBSE_embeddings.py -m intent
python3.7 extract_LaBSE_embeddings.py -m questions

Relevant configurations are defined in configs.py, e.g.:

--model_url 'https://tfhub.dev/google/LaBSE/1'
--max_seq_length
 128
--bert_model 'bert-base-multilingual-uncased'

3 - Cross-validation and Bayesian Hyperparameter Optimization

python3.7 train.py -m hyper_opt -c "model_identifier" -e "embeddings_identifier"

4 - Train

python3.7 train.py -m train -c "model_identifier"

5 - Inference

python3.7 inference.py -c "model_identifier"

source directory tree:

├── configs.py
├── extract_BERT_embeddings.py
├── extract_LaBSE_embeddings.py
├── inference.py
├── LaBSE.py
├── preprocess.py
├── train.py
├── umap_vis.py
└── utils.py
@article{gencoglu2020large,
  title={Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19},
  author={Gencoglu, Oguzhan},
  journal={Machine Learning and Knowledge Extraction},
  volume={2},
  number={4},
  pages={603--616},
  year={2020},
  doi={10.3390/make2040032}
}

Or

Gencoglu, Oguzhan. "Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19." Machine Learning and Knowledge Extraction. 2020; 2(4):603-616.