Skip to content

The associated code for the paper "Process-BERT: A Framework for Representation Learning on Educational Process Data".

License

Notifications You must be signed in to change notification settings

alexscarlatos/clickstream-assessments

Repository files navigation

Process-BERT

This repo contains the code associated with the paper: Process-BERT: A Framework for Representation Learning on Educational Process Data

This includes:

  • Our model, which implements BERT-like (per-event) pre-training objecives, and multiple versions of the transfer function for downstream prediction.
  • An IRT model with a term derived from our model's transfer function.
  • An implementation of the Clickstream Knowledge Tracing encoder.
  • Code to visualize latent vectors generated by the per-student and IRT models.
  • Code to process the NAEP 2019 competition data, including deriving correctness.

This work will be presented at EDM 2022: The 15th International Conference on Educational Data Mining

If you find this code useful for your research, please cite

@misc{https://doi.org/10.48550/arxiv.2204.13607,
  doi = {10.48550/ARXIV.2204.13607},
  url = {https://arxiv.org/abs/2204.13607},
  author = {Scarlatos, Alexander and Brinton, Christopher and Lan, Andrew},
  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Process-BERT: A Framework for Representation Learning on Educational Process Data},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Setup

Install Python3.8+

Create a virtual environment

python3 -m venv <env_name>
source <env_name>/bin/activate

Install dependencies

python3 -m pip install -r requirements.txt

For reproducibility, ensure pytorch only uses deterministic algorithms

export CUBLAS_WORKSPACE_CONFIG=:16:8

Data Preparation

While we can't put the data in this repo directly, a form to request the competition dataset is available at the competition site: https://sites.google.com/view/dataminingcompetition2019/home

Keep in mind that the data that was available for the competition is not the full dataset. With the restricted data you will be able to run our code on the competition label, as well as run IRT with half as much data. To test the score label, the per-question label, or IRT with all the data, you will need the full dataset (all of blocks A and B).

Create directories for data and models

mkdir -p data
mkdir -p models

Process data files

To run the experiments, you'll need to create processed data and label files. The following commands should be run exactly as follows, except the .csv and .txt files may be in any directory (and the .txt may have a different name).

Train/test data for per-student tasks

python3 __main__.py --process_data data_a_train.csv --block A --out data/train_data_30.json
python3 __main__.py --process_data data_a_hidden_30.csv --block A --out data/test_data_30.json

Full dataset

python3 __main__.py --process_data data_all.txt --out data/data_all.json

Per-student labels

# Competition
python3 __main__.py --task comp --labels data_train_label.csv --out data/train_labels.json
python3 __main__.py --task comp --labels hidden_label.csv --out data/test_labels.json

# Score
python3 __main__.py --task score --labels data/data_all.json --out data/label_score.json

# Per-Question
python3 __main__.py --task q_stats --labels data/data_all.json --out data/label_q_stats.json

Experiments

There are two main cross-validation pipelines - one for per-student labels and the other for IRT. They are highly customizable using command line options, and all paper results were collected by running such commands. For cross-validation, training hyperparameters (ex: lr, epochs, etc.) and data sources are hard-coded, whereas model architecture (ex: pred_state) can be configured. See python3 __main__.py --help for a list of options. A single cross-validation experiment should take a few hours to complete with a GPU.

Per-Student Label

python3 __main__.py --full_pipeline --name <experiment_name> --task <task_id> (--ckt (to use CKT instead of our model))

IRT

python3 __main__.py --irt --name <experiment_name> (--use_behavior_model (to use behavior-enhanced IRT instead of base IRT)) (--ckt (to use CKT instead of our model))

Visualizations

The vector visualizations can be generated using the --cluster and --cluster_irt options, while referencing a model (--name) and data source (--data_src) that the vectors are drawn from.

These visualizations can be configured, but code changes are required to do so. See the cluster and cluster_irt functions in analysis.py. Examples of alternate configurations are commented out in those functions.

About

The associated code for the paper "Process-BERT: A Framework for Representation Learning on Educational Process Data".

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages