Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility of two YOLOv5 identical train jobs #31

Closed
valentinitnelav opened this issue Jul 8, 2022 · 5 comments
Closed

Reproducibility of two YOLOv5 identical train jobs #31

valentinitnelav opened this issue Jul 8, 2022 · 5 comments

Comments

@valentinitnelav
Copy link
Collaborator

valentinitnelav commented Jul 8, 2022

Hi @stark-t , I run two identical nano models on the Clara cluster and the results are a bit different.
Below you can look at the confusion matrices on the validation dataset. You also find the results.csv for each run at the bottom of this comment.

I personally do not like to see those differences in the two identical nano runs (but I can learn to accept it :D ). Not sure how to set a seed for yolov5 so that two runs of the same model are identical, or if that is even possible with the current configuration. Sadly we didn't see yet any parameters implemented with argparse to take a seed. There is a discussion here ultralytics/yolov5#1222 pointing at PyTorch reproducibility issue https://pytorch.org/docs/stable/notes/randomness.html

The main takes are:

Completely reproducible results are not guaranteed across PyTorch releases, individual commits, or different platforms. Furthermore, results may not be reproducible between CPU and GPU executions, even when using identical seeds.

The only method I'm aware of that might guarantee you identical results might be to train on CPU with --workers 0, but this is impractical naturally, so you simply need to adapt your workflow to accommodate minor variations in final model results.

nano model n1

confusion_matrix

nano model n2

confusion_matrix

small model s

confusion_matrix

Results csv files

nano model n1: results.csv

nano model n2: results.csv

small model s: results.csv

@valentinitnelav
Copy link
Collaborator Author

FYI

Hi @stark-t , just realized that YOLOv5 made some updates (release v6.2) and one interesting aspect is the reproducibility with a --seed argument. I just read this:

Training Reproducibility: Single-GPU YOLOv5 training with torch>=1.12.0 is now fully reproducible, and a new --seed argument can be used (default seed=0) (ultralytics/yolov5#8213 by @AyushExel).

https://github.com/ultralytics/yolov5/releases/tag/v6.2

I didn't check yet all the details, but this might not work for parallel GPUs as we use on the cluster. Sounds like it works only for a single GPU.

@valentinitnelav
Copy link
Collaborator Author

Hi @stark-t ,

On the project with Malika and run again into this reproducibility problem and not sure what I'm doing wrong and how to solve it :/
I opened an issue on the github repo of YOLOv7, here WongKinYiu/yolov7#1144
Meanwhile, if you know what might cause this, please let me know.

If running two identical models on multiple GPUs is inherently not reproducible and the results can be so different, then I m not sure how to properly compare different model architectures or parameters.

FYI, I found these posts interesting to read:

Could machine learning fuel a reproducibility crisis in science?; Nature, 26 July 2022

Artificial intelligence faces reproducibility crisis; Science, 16 Feb 2018

@valentinitnelav
Copy link
Collaborator Author

This seems be an issue with detectron2 as well: facebookresearch/detectron2#4260

@valentinitnelav
Copy link
Collaborator Author

Hi @stark-t ,

I run multiple tests and I discovered that with the release v.6.2 of YOLOv5 one can get reproducible results but only when using a single GPU.
YOLOv7 has a reproducibility problem both when using a single GPU or multiple GPUs at once.

Since we compare different models some with YOLOv5 and some with YOLOv7 not sure how to have a fully reproducible comparison.

An alternative is to run say 5 times a model, so that we have 5 values for each metrics and then compare averages of these values. Does that make sense?
What do you think?

@valentinitnelav
Copy link
Collaborator Author

For the purpose of this paper, it would be too computationally expensive to train the models several times. I'll close this issue here. Hopefully YOLOv and v7 will allow reproducibility in the future when trained in parallel as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant