Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataloader shuffling bug with multiple workers #4960

Closed
kimbayashanologatiniburgettoilalalourvr opened this issue Sep 28, 2021 · 6 comments · Fixed by #5623
Closed

Dataloader shuffling bug with multiple workers #4960

kimbayashanologatiniburgettoilalalourvr opened this issue Sep 28, 2021 · 6 comments · Fixed by #5623
Labels

Comments

@kimbayashanologatiniburgettoilalalourvr

Hi there,

I wanted to report and also solicit some thoughts on a dataloader bug that occurs when spinning up a number of workers that is greater than 1. Because the parent Python process is forked, all of the prng state is copied across workers. Having multiple workers is almost a requirement for training at a decent pace due to the work being spread out across the multiple various CPU threads.

Additionally, if the user has set the threads to restart at every epoch, then the issue is made worse because the numbers are all restarted from the same seed again, potentially the same seed/state as from the last epoch, if no changes are made in the parent process to the random calls before forking again.

This issue is detailed in https://tanelp.github.io/posts/a-bug-that-plagues-thousands-of-open-source-ml-projects/ in more detail. A fix would be to insert a worker_init_fn at the codeline below. PyTorch Lightning has a great worker init function that you may reference if interested, licensing issues aside. There is also some work in an original thread that led to the PL implementation, but the PL implementation I've found easier to understand and use due to some of their (hopefully appropriate) coding around some of the in-practice 'gotchas' encountered along the way.

collate_fn=LoadImagesAndLabels.collate_fn4 if quad else LoadImagesAndLabels.collate_fn)

This issue gets worse the more workers someone has, causing certainly non-deterministic behavior that can get worse the more workers someone has open. This also can be particularly difficult to debug due to multiple batches in a row not being something that is visualized in Tensorboard. It is not as bad as having shuffling off by default, but it still can cause issues.

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 28, 2021

@kimbayashanologatiniburgettoilalalourvr hi, thank you for your feature suggestion on how to improve YOLOv5 🚀!

The fastest and easiest way to incorporate your ideas into the official codebase is to submit a Pull Request (PR) implementing your idea, and if applicable providing before and after profiling/inference/training results to help us understand the improvement your feature provides. This allows us to directly see the changes in the code and to understand how they affect workflows and performance.

Please see our ✅ Contributing Guide to get started.

To investigate any issues I plotted batch 0 for epochs 0, 1 and 2. I don't see anything that would indicate a problem.

batch 0 epoch 0

train_batch0_epoch0

batch 0 epoch 1

train_batch0_epoch1

batch 0 epoch 2

train_batch0_epoch2

@kimbayashanologatiniburgettoilalalourvr

Hi Glenn,

Many thanks for taking a look into this.

Would you have the config you used to produce these images? If I have time, I can see if I can reproduce on my end to see if the reported bug comes up.

Best,

kimbaya&etc

@kimbayashanologatiniburgettoilalalourvr

Alrighty,

Taking a closer look through the code, I think there may be 2 things here. One would be that random.random is used far more than np.random, which I think here is a good thing since my understanding is the state for that generator is shared -- only np.random is causing some issues.

The other would be that the prefetch factor should automatically be set to 2 per epoch, so some batches may get prefetched out of order/get out of order due to race conditions.

The HSV augmentation not being entirely random in the face of other things may not be worth pursuing deep investigation into this. I think from our end, due to the potential for bugs in the future should someone add an np.random-type call in our code, we'll probably try to keep good worker init around just to shield that off alright in the end there.

@glenn-jocher
Copy link
Member

@kimbayashanologatiniburgettoilalalourvr hi there! Pinging you to see if you had come up with a solution to this or if it was still an issue.

I think we'll want to implement shuffle in master under some conditions. In particular we are looking at adding an option to update the dataloader/augmentation during late training in #5046, and this will run into the same unshuffled data issue you noticed which will be more pronounced if mosaic is reduced or eliminated.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 5, 2021

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

@github-actions github-actions bot added the Stale label Nov 5, 2021
@glenn-jocher glenn-jocher linked a pull request Nov 13, 2021 that will close this issue
@glenn-jocher
Copy link
Member

@kimbayashanologatiniburgettoilalalourvr good news 😃! Your original issue may now be fixed ✅ in PR #5623 by @werner-duvaud. This PR turns on shuffling in the YOLOv5 training DataLoader by default, which was missing until now. This works for all training formats: CPU, Single-GPU, Multi-GPU DDP.

train_loader, dataset = create_dataloader(train_path, imgsz, batch_size // WORLD_SIZE, gs, single_cls,
                                          hyp=hyp, augment=True, cache=opt.cache, rect=opt.rect, rank=LOCAL_RANK,
                                          workers=workers, image_weights=opt.image_weights, quad=opt.quad,
                                          prefix=colorstr('train: '), shuffle=True)  # <--- NEW

I evaluated this PR against master on VOC finetuning for 50 epochs, and the results show a slight improvement in most metrics and losses, particularly in objectness loss and mAP@0.5, perhaps indicating that the shuffle addition may help delay overtraining.

https://wandb.ai/glenn-jocher/VOC
Screenshot 2021-11-13 at 13 03 26

To receive this update:

  • Gitgit pull from within your yolov5/ directory or git clone https://github.com/ultralytics/yolov5 again
  • PyTorch Hub – Force-reload model = torch.hub.load('ultralytics/yolov5', 'yolov5s', force_reload=True)
  • Notebooks – View updated notebooks Open In Colab Open In Kaggle
  • Dockersudo docker pull ultralytics/yolov5:latest to update your image Docker Pulls

Thank you for spotting this issue and informing us of the problem. Please let us know if this update resolves the issue for you, and feel free to inform us of any other issues you discover or feature requests that come to mind. Happy trainings with YOLOv5 🚀!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants