Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Merged
merged 11 commits into from
Jan 3, 2024

Conversation

davidsvaughn
Copy link
Contributor

@davidsvaughn davidsvaughn commented Dec 2, 2022

This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug (#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.

Signed-off-by: davidsvaughn davidsvaughn@gmail.com

πŸ› οΈ PR Summary

Made with ❀️ by Ultralytics Actions

🌟 Summary

Enhanced data loading for improved distributed training support in YOLOv5 πŸš€

πŸ“Š Key Changes

  • πŸ†• Added WORLD_SIZE variable for better control in distributed environments.
  • πŸ‘¨β€πŸ’» Created SmartDistributedSampler, an iterator overriding PyTorch's DistributedSampler for more efficient data shuffling.
  • πŸ”„ Modified data loaders and dataloading methods to use SmartDistributedSampler, adding rank and seed parameters.
  • πŸ”’ Changed how indices are calculated and used in various methods for improved distributed training consistency.
  • πŸ”© Adjusted image caching logic to consider WORLD_SIZE in RAM usage calculations.

🎯 Purpose & Impact

  • πŸ“ˆ Improves model training efficiency and effectiveness in multi-GPU setups.
  • 🀝 Ensures consistent data sampling across different processes for better distributed training.
  • πŸ›  Offers a more predictable and systematic approach to data shuffling, enhancing training stability.
  • πŸ” Provides finer memory usage control, leading to better resource management during training.
  • These updates are likely to benefit users training models on clusters or with multiple GPUs, leading to potential improvements in model performance and training speed.

davidsvaughn and others added 2 commits December 2, 2022 11:50
This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818).  This was a very serious and "blocking" issue until I could figure out what was going on.  The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures.  Hopefully this fix will help others.
DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache.  The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images.  So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs.  I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.  

Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

πŸ‘‹ Hello @davidsvaughn, thank you for submitting a YOLOv5 πŸš€ PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • βœ… Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
  • βœ… Verify all YOLOv5 Continuous Integration (CI) checks are passing.
  • βœ… Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." β€” Bruce Lee

davidsvaughn and others added 3 commits December 2, 2022 12:44
move extra parameter (rank) to end so won't mess up pre-existing positional args
removing extra '#'
@glenn-jocher
Copy link
Member

@davidsvaughn thanks for the PR! This would be a great fix if it works. Unfortunately with these sorts of changes we should test on a full COCO training to make sure DDP shuffling/sampling are not affected (changes would be visible as reduced mAP). I'll add a TODO to prioritize this so we can run these tests once we free up GPUs in a few days, or do you have any of your own results vs master?

@glenn-jocher glenn-jocher self-assigned this Dec 3, 2022
@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 3, 2022

@davidsvaughn after looking at the code I'm a little confused, as it seems that the first image in the mosaic is pulled from self.indices, which span the entire dataset, but the remaining 3 are pulled from self.idx, which is only a subset of the dataset. Since the first image is pulled from the entire dataset would the entire dataset need to be cached into RAM for that GPU?

@davidsvaughn
Copy link
Contributor Author

davidsvaughn commented Dec 5, 2022

@glenn-jocher I don't blame you for being confused... the way I did this is a little confusing. Sorry about that! Originally I didn't notice the mosaic code needed tweaking. I was trying to preserve as much as possible of the original code structure, by just tweaking the indices that were coming out of the DistributedSampler (now SmartDistributedSampler...) which acts as a sort of "global" coordinator (i.e. global to all GPU processes), by sending the correct data indices to each GPU-process based on it's RANK (indices which are all referencing the "global" self.indices array). But when I realized the other 3 (or other 8) samples in the mosaic also needed to be sampled from the "correct" subset of indices (based on RANK), that is done outside the machinery of the (Smart)DistributedSampler... so I needed to just sample from self.idx directly.

To say this another way... yes, the first image looks like it is pulled from the entire dataset (unlike the other 3) because it references the "global" self.indices array... but it's actually sampling from the same subset that is in self.idx. It's just that in the case of the first image, this smart sub-sampling is being controlled by the index coming out of SmartDistributedSampler.

Yes I think it would be cleaner if everything was just sampled from the same array (self.idx seems like the natural choice)... but this would mean more of an overhaul to the whole DistributedSampler thing.... I guess it wouldn't really be needed, because then the first image in the mosaic would be sampled the same way as the other 3, where each GPU process is just sampling all values in self.idx (instead of some values in self.indices). Sorry I hope that made sense (indices of indices... always fun!). I can try to make this less confusing, but maybe I'll wait for your feedback before proceeding.

BTW... I also realized whatever change is made also needs to be duplicated in segment/dataloaders.py. I'm still getting used to the new segmentation code...

davidsvaughn and others added 2 commits December 5, 2022 12:46
sample from DDP index array (self.idx) in mixup mosaic
@glenn-jocher
Copy link
Member

@davidsvaughn ok got it. If the two attributes can be merged then yes let's do it please. Thanks!

davidsvaughn and others added 2 commits December 7, 2022 10:41
… (self.indices).

Also adding SmartDistributedSampler to segmentation dataloader
@davidsvaughn
Copy link
Contributor Author

davidsvaughn commented Dec 7, 2022

@glenn-jocher okay I merged the two attributes into one (self.indices and self.idx -> self.indices). When using single GPU (or CPU) self.indices is the same as before (1..n). But with Multi-GPU (DDP) self.indices will only contain the subset of indices that are partitioned for the given GPU process.

FYI, the point of the random permutation here is so the partitioning of data across the GPUs is random. The SmartDistributedSampler will only randomize the sampling (if shuffle=True) but without also randomizing the partitioning, then (with WORLD_SIZE=4) the RANK=0 process will get self.indices=[0,4,8,12,...], RANK=1 process will get self.indices=[1,5,9,13,...] and so on, and this affects which images are grouped together in mosaics, and mixup. It seems like this should be randomized as well, in case there is some inherent order in the dataset.

I also made necessary changes to utils/segment/dataloaders.py so the same behavior is implemented for segmentation. This change was minor since most behavior is inherited from utils/dataloaders.py. However, I didn't make any changes to the classification pipeline, since that is different enough that it should probably be a separate PR.

I ran COCO from-scratch training for 10 epochs, before and after the change, as a sanity check to show there's no affect on mAP. I realize you'll probably need to train for longer. But I'll just show my results (using 4 x TeslaV100):

torchrun --standalone --nnodes=1 --nproc_per_node=4 train.py --data coco.yaml --weights '' --cfg yolov5s.yaml --batch-size 256 --cache --epochs 10

BEFORE:
ram_before

Validating runs/train/exp1/weights/best.pt...
Fusing layers... 
YOLOv5s summary: 157 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
     Class     Images  Instances          P          R      mAP50   mAP50-95
       all       5000      36335      0.413      0.286      0.272      0.147
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.149
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.277
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.145
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.074
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.169
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.185
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.180
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.383
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.197
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.493

AFTER:
ram_after

Validating runs/train/exp1/weights/best.pt...
Fusing layers... 
YOLOv5s summary: 157 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
     Class     Images  Instances          P          R      mAP50   mAP50-95
       all       5000      36335      0.436      0.286      0.277       0.15
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.152
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.282
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.148
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.074
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.169
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.194
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.180
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.336
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.388
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.196
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.509

I should mention that one side-effect is that when caching images before training, since what is displayed is only the caching stats for the RANK=0 process, this now (correctly) shows only a fraction (1/WORLD_SIZE) of the total data. Not sure if this should be changed too (multiply x WORLD_SIZE ?). I can see pros and cons either way....

BEFORE:
cache_before
AFTER:
cache_after

@glenn-jocher
Copy link
Member

@davidsvaughn very cool! Results look pretty good. I just made a small change to display Rank 0 GB * WORLD_SIZE as an approximation of total RAM used. I will try to test this week, but your results seem perfectly fine.

@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions YOLOv5 πŸš€ and Vision AI ⭐.

@github-actions github-actions bot added the Stale label Mar 22, 2023
@kbegiedza
Copy link

F5

What do you think @glenn-jocher ?
Looks nice to me

@github-actions github-actions bot removed the Stale label Mar 23, 2023
@davidsvaughn
Copy link
Contributor Author

@glenn-jocher Hey again - I'm just wondering if you know whether this bug exists in yolov8, or if it was fixed there?
Thanks!

@glenn-jocher
Copy link
Member

@davidsvaughn hi there! I'm not sure about the specific bug you're referring to in YOLOv8, as I don't have the details at the moment. However, I can tell you that YOLOv8 is a different version from YOLOv5, so it's possible that any bugs present in YOLOv5 may not exist in YOLOv8, or vice versa. It would be best to check the documentation or official sources for YOLOv8 to see if there are any known bug reports or updates regarding the specific issue you're facing.

Copy link
Contributor

github-actions bot commented Dec 6, 2023

πŸ‘‹ Hello there! We wanted to let you know that we've decided to close this pull request due to inactivity. We appreciate the effort you put into contributing to our project, but unfortunately, not all contributions are suitable or aligned with our product roadmap.

We hope you understand our decision, and please don't let it discourage you from contributing to open source projects in the future. We value all of our community members and their contributions, and we encourage you to keep exploring new projects and ways to get involved.

For additional resources and information, please see the links below:

Thank you for your contributions to YOLO πŸš€ and Vision AI ⭐

@github-actions github-actions bot added the Stale label Dec 6, 2023
@glenn-jocher glenn-jocher merged commit 46ae996 into ultralytics:master Jan 3, 2024
@glenn-jocher
Copy link
Member

@davidsvaughn PR merged, thank you for your contributions!

@glenn-jocher glenn-jocher removed the TODO label Jan 3, 2024
pleb631 pushed a commit to pleb631/yolov5 that referenced this pull request Jan 6, 2024
… issue (ultralytics#10383)

* Update dataloaders.py

This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818).  This was a very serious and "blocking" issue until I could figure out what was going on.  The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures.  Hopefully this fix will help others.
DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache.  The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images.  So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs.  I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.  

Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update dataloaders.py

move extra parameter (rank) to end so won't mess up pre-existing positional args

* Update dataloaders.py

removing extra '#'

* Update dataloaders.py

sample from DDP index array (self.idx) in mixup mosaic

* Merging self.indices and self.idx (DDP indices) into single attribute (self.indices).
Also adding SmartDistributedSampler to segmentation dataloader

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Multiply GB displayed by WORLD_SIZE

---------

Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants