An LSTM model implemented with PyTorch

This repo is to reproduce the GlueX tracking algorithm with PyTorch, which originally implemented with TensorFlow Keras here. It aims to a future integration with phasm.

To the best of my knowledge, mimic everything in the original Keras notebook, including the same:

Shuffled training dataset;
Batch size, epochs, NN size, loss function, optimizer, clip value;
Learning rate scheduler.

Python code structure

utils.py: define the NN structure and wrap the datasets into batched PyTorch dataloaders.
LSTM_training.py: train the NN with the whole training dataset with 100 epochs. Save the trained model into a TorchScript.
validation_processing.py: load the model from the TorchScript. Validate the model with a validation dataset of (661644, 6).
submit-training-job.slurm: a one-step slurm script to run jobs on farm.

Configurations

Conda PyTorch environment

Based on my own experience, the bare-metal python3.9+pip3+cudnn8.6 installation would always fail on JLab ifarm A100 GPU nodes because of mismatch cudnn/torch versions. This is solved by installing the latest pytorch (as of Nov-28-2022) via conda virtual environments as guided here. A conda environment file is provided to show my environment configurations.

# install pytorch via conda
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
# create conda env from yml file
conda env create -f environment.yml

NN definition

Table: the LSTM network definition, where batch_size=1256 and seq_len=7.

Layer	Input size	Output size	Param #
LSTM_1	(batch_size, seq_len, 6)	(batch_size, seq_len, 128)	69120
LSTM_2	(batch_size, seq_len, 128)	(batch_size, seq_len, 64)	49408
LSTM_3	(batch_size, seq_len, 64)	(batch_size, 32)	12416
Linear	(batch_size, 32)	(batch_size, 6)	198

The parameter counts of the layers are taken from the original Keras model.summary().

Dataset

Download the training dataset as below.

wget https://halldweb.jlab.org/talks/ML_lunch/Sep2019/MLchallenge2_training.csv
mv MLchallenge2_training.csv train_data.csv

Compared to the dataset at the time of executing the Keras notebook, the new dataset is about 38.5% larger (2646573 v.s. 1910698).

After sequencing, the dimension of the whole training dataset (as on 10/20/2022) is (2646573, 7, 6), with each epoch containing ~2108 batches. We train 100 epochs in total.

Results

Table: results after 100 training epochs

Exp	`loss`	`mse`	`val_loss`	`val_mse`	`lr`	Time	Training `X` size
Keras+TitanRTX*2	0.0015	6.8281e-06	0.0018	7.2508e-06	3.7715e-05	~20 mins	(1910698, 7, 6)
PyTorch+TitanRTX	0.0015	2.0858e-05	0.0015	2.0803e-05	3.7715e-05	~55 mins	(2646573, 7, 6)
PyTorch+T4	0.0012	8.0547e-06	0.0012	7.6509e-06	4.4371e-05	~65 mins	(2646573, 7, 6)
PyTorch+A100	0.0010	2.6062e-06	0.0010	2.5379e-06	5.2201e-05	~45 mins	(2646573, 7, 6)

The code is tested on a single ifarm TitanRTX/T4/A100 GPU. Results are available at:

./res/training-loss: images of the losses along the training process.
./res/job-logs: the detailed job logs. An example of how losses are changed along the epochs, batches and time is here.
./res/evaluation: images of the evaluation results. This is a comparison between the evaluation errors with Epochs=1 and Epochs=100.

References

Keras APIs: https://keras.io/api/
PyTorch documentation: https://pytorch.org/docs/stable/index.html
PyTorch tutorials

Last updated on 02/01/2023 by xmei@jlab.org

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
python		python
res		res
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

An LSTM model implemented with PyTorch

Python code structure

Configurations

Conda PyTorch environment

NN definition

Dataset

Results

References

About

Releases

Packages

Languages

cissieAB/gluex-tracking-pytorch-lstm

Folders and files

Latest commit

History

Repository files navigation

An LSTM model implemented with PyTorch

Python code structure

Configurations

Conda PyTorch environment

NN definition

Dataset

Results

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages