GitHub - ogencoglu/Language-agnostic_BERT_COVID19_Twitter: Source code for the paper "Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19" (2020)

Implementation of Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19 - Gencoglu O. (2020)

This repository provides the full implementation in python 3.7. Requires Twitter developer account.

Main Idea

Utilizing Language-agnostic BERT Sentence Embeddings (LaBSE) to analyze 28 million tweets in 109 languages related to COVID-19

Reproduction of Results

Follow steps 1-5 below.

1 - Get the Data

See directory_info in the data directory for the expected files.

1.1 - Download 30+ million tweet IDs and hydrate them into timestamp and tweet text (requires Twitter developer account).

Jan 17,tweet_text_string
Jan 27,tweet_text_string
...

Once tweets.csv is in the example format above, preprocess by running:

python3.7 preprocess.py

1.2 - Download Intent and Questions datasets

--Intent Dataset

Link

--Questions Dataset

Link

2 - Extract Tweet Embeddings

2.1 - BERT

python3.7 extract_BERT_embeddings.py -m intent
python3.7 extract_BERT_embeddings.py -m questions

2.2 - Language-agnostic BERT Sentence Embeddings (LaBSE)

python3.7 extract_LaBSE_embeddings.py -m tweets
python3.7 extract_LaBSE_embeddings.py -m intent
python3.7 extract_LaBSE_embeddings.py -m questions

Relevant configurations are defined in configs.py, e.g.:

--model_url 'https://tfhub.dev/google/LaBSE/1'

--max_seq_length

128

--bert_model 'bert-base-multilingual-uncased'

3 - Cross-validation and Bayesian Hyperparameter Optimization

python3.7 train.py -m hyper_opt -c "model_identifier" -e "embeddings_identifier"

4 - Train

python3.7 train.py -m train -c "model_identifier"

5 - Inference

python3.7 inference.py -c "model_identifier"

source directory tree:

├── configs.py
├── extract_BERT_embeddings.py
├── extract_LaBSE_embeddings.py
├── inference.py
├── LaBSE.py
├── preprocess.py
├── train.py
├── umap_vis.py
└── utils.py

Cite

@article{gencoglu2020large,
  title={Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19},
  author={Gencoglu, Oguzhan},
  journal={Machine Learning and Knowledge Extraction},
  volume={2},
  number={4},
  pages={603--616},
  year={2020},
  doi={10.3390/make2040032}
}

Or

Gencoglu, Oguzhan. "Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19." Machine Learning and Knowledge Extraction. 2020; 2(4):603-616.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
logs		logs
media		media
models		models
source		source
.gitignore		.gitignore
LICENSE		LICENSE
README.rst		README.rst
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Implementation of Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19 - Gencoglu O. (2020)

Main Idea

Reproduction of Results

1 - Get the Data

2 - Extract Tweet Embeddings

3 - Cross-validation and Bayesian Hyperparameter Optimization

4 - Train

5 - Inference

Cite

About

Languages

`--model_url`	'https://tfhub.dev/google/LaBSE/1'
`--max_seq_length`
	128
`--bert_model`	'bert-base-multilingual-uncased'

License

ogencoglu/Language-agnostic_BERT_COVID19_Twitter

Folders and files

Latest commit

History

Repository files navigation

Implementation of Large-scale, Language-agnostic Discourse Classification of Tweets During COVID-19 - Gencoglu O. (2020)

Main Idea

Reproduction of Results

1 - Get the Data

2 - Extract Tweet Embeddings

3 - Cross-validation and Bayesian Hyperparameter Optimization

4 - Train

5 - Inference

Cite

About

Topics

Resources

License

Stars

Watchers

Forks

Languages