This repo contains the code associated with the paper: Process-BERT: A Framework for Representation Learning on Educational Process Data
This includes:
- Our model, which implements BERT-like (per-event) pre-training objecives, and multiple versions of the transfer function for downstream prediction.
- An IRT model with a term derived from our model's transfer function.
- An implementation of the Clickstream Knowledge Tracing encoder.
- Code to visualize latent vectors generated by the per-student and IRT models.
- Code to process the NAEP 2019 competition data, including deriving correctness.
This work will be presented at EDM 2022: The 15th International Conference on Educational Data Mining
If you find this code useful for your research, please cite
@misc{https://doi.org/10.48550/arxiv.2204.13607,
doi = {10.48550/ARXIV.2204.13607},
url = {https://arxiv.org/abs/2204.13607},
author = {Scarlatos, Alexander and Brinton, Christopher and Lan, Andrew},
keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
title = {Process-BERT: A Framework for Representation Learning on Educational Process Data},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
Install Python3.8+
Create a virtual environment
python3 -m venv <env_name>
source <env_name>/bin/activate
Install dependencies
python3 -m pip install -r requirements.txt
For reproducibility, ensure pytorch only uses deterministic algorithms
export CUBLAS_WORKSPACE_CONFIG=:16:8
While we can't put the data in this repo directly, a form to request the competition dataset is available at the competition site: https://sites.google.com/view/dataminingcompetition2019/home
Keep in mind that the data that was available for the competition is not the full dataset. With the restricted data you will be able to run our code on the competition label, as well as run IRT with half as much data. To test the score label, the per-question label, or IRT with all the data, you will need the full dataset (all of blocks A and B).
mkdir -p data
mkdir -p models
To run the experiments, you'll need to create processed data and label files. The following commands should be run exactly as follows, except the .csv and .txt files may be in any directory (and the .txt may have a different name).
Train/test data for per-student tasks
python3 __main__.py --process_data data_a_train.csv --block A --out data/train_data_30.json
python3 __main__.py --process_data data_a_hidden_30.csv --block A --out data/test_data_30.json
Full dataset
python3 __main__.py --process_data data_all.txt --out data/data_all.json
Per-student labels
# Competition
python3 __main__.py --task comp --labels data_train_label.csv --out data/train_labels.json
python3 __main__.py --task comp --labels hidden_label.csv --out data/test_labels.json
# Score
python3 __main__.py --task score --labels data/data_all.json --out data/label_score.json
# Per-Question
python3 __main__.py --task q_stats --labels data/data_all.json --out data/label_q_stats.json
There are two main cross-validation pipelines - one for per-student labels and the other for IRT. They are highly customizable using command line options, and all paper results were collected by running such commands. For cross-validation, training hyperparameters (ex: lr, epochs, etc.) and data sources are hard-coded, whereas model architecture (ex: pred_state) can be configured. See python3 __main__.py --help
for a list of options. A single cross-validation experiment should take a few hours to complete with a GPU.
Per-Student Label
python3 __main__.py --full_pipeline --name <experiment_name> --task <task_id> (--ckt (to use CKT instead of our model))
IRT
python3 __main__.py --irt --name <experiment_name> (--use_behavior_model (to use behavior-enhanced IRT instead of base IRT)) (--ckt (to use CKT instead of our model))
The vector visualizations can be generated using the --cluster
and --cluster_irt
options, while referencing a model (--name
) and data source (--data_src
) that the vectors are drawn from.
These visualizations can be configured, but code changes are required to do so. See the cluster
and cluster_irt
functions in analysis.py
. Examples of alternate configurations are commented out in those functions.