Process-BERT

This repo contains the code associated with the paper: Process-BERT: A Framework for Representation Learning on Educational Process Data

This includes:

Our model, which implements BERT-like (per-event) pre-training objecives, and multiple versions of the transfer function for downstream prediction.
An IRT model with a term derived from our model's transfer function.
An implementation of the Clickstream Knowledge Tracing encoder.
Code to visualize latent vectors generated by the per-student and IRT models.
Code to process the NAEP 2019 competition data, including deriving correctness.

This work will be presented at EDM 2022: The 15th International Conference on Educational Data Mining

If you find this code useful for your research, please cite

@misc{https://doi.org/10.48550/arxiv.2204.13607,
  doi = {10.48550/ARXIV.2204.13607},
  url = {https://arxiv.org/abs/2204.13607},
  author = {Scarlatos, Alexander and Brinton, Christopher and Lan, Andrew},
  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {Process-BERT: A Framework for Representation Learning on Educational Process Data},
  publisher = {arXiv},
  year = {2022},
  copyright = {Creative Commons Attribution 4.0 International}
}

Setup

Install Python3.8+

Create a virtual environment

python3 -m venv <env_name>
source <env_name>/bin/activate

Install dependencies

python3 -m pip install -r requirements.txt

For reproducibility, ensure pytorch only uses deterministic algorithms

export CUBLAS_WORKSPACE_CONFIG=:16:8

Data Preparation

While we can't put the data in this repo directly, a form to request the competition dataset is available at the competition site: https://sites.google.com/view/dataminingcompetition2019/home

Keep in mind that the data that was available for the competition is not the full dataset. With the restricted data you will be able to run our code on the competition label, as well as run IRT with half as much data. To test the score label, the per-question label, or IRT with all the data, you will need the full dataset (all of blocks A and B).

Create directories for data and models

mkdir -p data
mkdir -p models

Process data files

To run the experiments, you'll need to create processed data and label files. The following commands should be run exactly as follows, except the .csv and .txt files may be in any directory (and the .txt may have a different name).

Train/test data for per-student tasks

python3 __main__.py --process_data data_a_train.csv --block A --out data/train_data_30.json
python3 __main__.py --process_data data_a_hidden_30.csv --block A --out data/test_data_30.json

Full dataset

python3 __main__.py --process_data data_all.txt --out data/data_all.json

Per-student labels

# Competition
python3 __main__.py --task comp --labels data_train_label.csv --out data/train_labels.json
python3 __main__.py --task comp --labels hidden_label.csv --out data/test_labels.json

# Score
python3 __main__.py --task score --labels data/data_all.json --out data/label_score.json

# Per-Question
python3 __main__.py --task q_stats --labels data/data_all.json --out data/label_q_stats.json

Experiments

There are two main cross-validation pipelines - one for per-student labels and the other for IRT. They are highly customizable using command line options, and all paper results were collected by running such commands. For cross-validation, training hyperparameters (ex: lr, epochs, etc.) and data sources are hard-coded, whereas model architecture (ex: pred_state) can be configured. See python3 __main__.py --help for a list of options. A single cross-validation experiment should take a few hours to complete with a GPU.

Per-Student Label

python3 __main__.py --full_pipeline --name <experiment_name> --task <task_id> (--ckt (to use CKT instead of our model))

IRT

python3 __main__.py --irt --name <experiment_name> (--use_behavior_model (to use behavior-enhanced IRT instead of base IRT)) (--ckt (to use CKT instead of our model))

Visualizations

The vector visualizations can be generated using the --cluster and --cluster_irt options, while referencing a model (--name) and data source (--data_src) that the vectors are drawn from.

These visualizations can be configured, but code changes are required to do so. See the cluster and cluster_irt functions in analysis.py. Examples of alternate configurations are commented out in those functions.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
irt		irt
ref_data		ref_data
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
__main__.py		__main__.py
analysis.py		analysis.py
baseline.py		baseline.py
bert_model.py		bert_model.py
ckt_model.py		ckt_model.py
constants.py		constants.py
data_loading.py		data_loading.py
data_processing.py		data_processing.py
default.json		default.json
experiments.py		experiments.py
joint_model.py		joint_model.py
model.py		model.py
per_question_data_loading.py		per_question_data_loading.py
requirements.txt		requirements.txt
training.py		training.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Process-BERT

Setup

Data Preparation

Create directories for data and models

Process data files

Experiments

Visualizations

About

Releases

Packages

Languages

License

alexscarlatos/clickstream-assessments

Folders and files

Latest commit

History

Repository files navigation

Process-BERT

Setup

Data Preparation

Create directories for data and models

Process data files

Experiments

Visualizations

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages