Contextual word embeddings and Bayesian clustering from transformer models

Python code and environment files to reproduce the process used for the paper A new method for computational cultural cartography: From neural word embeddings to transformers and Bayesian mixture models:

The crs_corpus directory contains a 2% sample of the JSTOR corpus used in the paper, for each decade of publication. This sample size is both for JSTOR terms of use, and to fall within the github 100mb individual file limit.

NOTE this repository is not a software release so is intended for reproduction of the process only. Very limited support is available (ie. only fixes for bugs that we can reproduce with the data sample provided). The pipeline should work with any text corpus but considerable adaptation would be needed for many of the steps. Similarly, distilbert fine-tuning was performed with a GPU. That code may be adaptable for CPU but the processing time for even the 2% sample would increase substantially.

It should also be possible to do a near-reproduction of the full paper results by making a request to JSTOR using the following parameters:

JSTOR Data for Research Search URL https://www.jstor.org/dfr/results?searchType=facetSearch&sd=1900&ed=&Query=democracy+OR+democr*+OR+autocracy+OR+autocr*+OR+authoritarianism+OR+authoritar*+OR+populism+OR+populis*&acc=dfr

OCR Full Text: Yes

Limit to these publication dates: 1900 to July 7th, 2020

The pipeline expects text files with one sentence per line and lowercase English alphabet characters only.

Processing environment

Ubuntu 20.04
Python 3.9

Dell Precision T7600 Workstation
2x Intel Xeon E5-2690
256GB DDR3 ECC RAM
2TB XPG SX8200 Pro NVMe M.2 SSD
GeForce RTX 2070 SUPER

Python environment setup

On Linux, use conda or mamba to install an exact duplicate environment from env_linux-64.txt. Replace "crs_netlab_2022" with any environment name. Confirmed working on Ubuntu 20.04 as of June 16, 2023.

mamba create --name crs_netlab_2022 --file env_linux-64.txt

An environment YAML file is also provided, which may or may not work on Windows or OSX. Tested only on Ubuntu 20.04.

mamba env create -f env_platform_independent_untested.yml

Two additional external scripts are included to ensure their availability. The first is slightly modified for compatibility with Tensorflow 2.

dpgmm_vi.py from https://github.com/mcusi/tf_dpgmm/tree/master/diagonal

run_mlm.py from https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py

Steps to reproduce data pipeline

Clone entire repository
Execute crs_finetune_all_models.sh
Follow IPython Notebooks 1-4 in their numbered order. Unfortunately, there is not a great deal of code comments in these to help with adaptation.
Notebook 5 uses some included results data from the full corpus - it would not execute otherwise

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crs_agg_vectors		crs_agg_vectors
crs_attention_words		crs_attention_words
crs_bgmm_latents		crs_bgmm_latents
crs_corpus		crs_corpus
crs_embeds		crs_embeds
crs_full_results_for_figures_tables		crs_full_results_for_figures_tables
crs_models		crs_models
crs_sents		crs_sents
.gitignore		.gitignore
1_gather_embeddings.ipynb		1_gather_embeddings.ipynb
2_clusters.ipynb		2_clusters.ipynb
3_attention.ipynb		3_attention.ipynb
4_analysis.ipynb		4_analysis.ipynb
5_figures_tables.ipynb		5_figures_tables.ipynb
LICENSE		LICENSE
README.md		README.md
crs_finetune_all_models.sh		crs_finetune_all_models.sh
dpgmm_vi.py		dpgmm_vi.py
env_linux-64.txt		env_linux-64.txt
env_platform_independent_untested.yml		env_platform_independent_untested.yml
run_mlm.py		run_mlm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contextual word embeddings and Bayesian clustering from transformer models

Processing environment

Python environment setup

Steps to reproduce data pipeline

About

Releases

Packages

Languages

License

UWNETLAB/supplement_crs_transformers_dpbgmms

Folders and files

Latest commit

History

Repository files navigation

Contextual word embeddings and Bayesian clustering from transformer models

Processing environment

Python environment setup

Steps to reproduce data pipeline

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages