Training on the Test Task

Code to reproduce the experiments, figures and tables of the paper Training on the Test Task Confounds Evaluation and Emergence.

The folder experiments/ contains the code to fine-tune models on the datasets of task-relevant data considered, and to evaluate models using the LM Evaluation Harness library.
The folder notebooks/evaluations contains the model evaluation files.
The Jupyter notebook notebooks/figures.ipynb reproduces the figures and tables in the paper.
The fine-tuned models are currently being uploaded here.