Transparency at the Source

Repo for the EMNLP 2023 Findings paper Transparency at the Source: Evaluating and Interpreting Language Models With Access to the True Distribution.

Our pipeline consists of a mix of Java and Python code: grammar induction is done using the original code of Petrov et al. (2006) in Java, language model training is done using transformers in Python.

Grammar induction

java -Xmx32g -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.GrammarTrainer -path $path -out $save_dir/stage -treebank SINGLEFILE -mergingPercentage 0.5 -filter 1.0e-8 -SMcycles 5

Masked token PCFG probabilities, $grammarfile should point to the grammar archive file that can be found in the Google Drive resources (500k).

java -cp CustomBerkeley.jar edu.berkeley.nlp.PCFGLA.BerkeleyParser -gr $grammarfile -inputFile $inputfile

EarleyX causal PCFG probabilities

java -Xms32768M -classpath "earleyx_fast.jar:lib/*" parser.Main -in data/eval_subset_100.txt -grammar grammars/earleyx.grammar -out results -verbose 1 -thread 1

Language model training

python3 main_multi.py \
  --model.model_type microsoft/deberta-base \
  --model.is_mlm \
  --tokenizer.path tokenizers/added_tokens.json \
  --data.data_dir corpora \
  --data.train_file train.txt \
  --trainer.output_dir $save_dir

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
pcfg-probs		pcfg-probs
resources		resources
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transparency at the Source

About

Releases

Packages

Languages

License

clclab/pcfg-lm

Folders and files

Latest commit

History

Repository files navigation

Transparency at the Source

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages