Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

davidsvaughn · 2022-12-02T16:54:48Z

This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug (#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.

Signed-off-by: davidsvaughn davidsvaughn@gmail.com

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced data loading for improved distributed training support in YOLOv5 🚀

📊 Key Changes

🆕 Added WORLD_SIZE variable for better control in distributed environments.
👨‍💻 Created SmartDistributedSampler, an iterator overriding PyTorch's DistributedSampler for more efficient data shuffling.
🔄 Modified data loaders and dataloading methods to use SmartDistributedSampler, adding rank and seed parameters.
🔢 Changed how indices are calculated and used in various methods for improved distributed training consistency.
🔩 Adjusted image caching logic to consider WORLD_SIZE in RAM usage calculations.

🎯 Purpose & Impact

📈 Improves model training efficiency and effectiveness in multi-GPU setups.
🤝 Ensures consistent data sampling across different processes for better distributed training.
🛠 Offers a more predictable and systematic approach to data shuffling, enhancing training stability.
🔍 Provides finer memory usage control, leading to better resource management during training.
These updates are likely to benefit users training models on clusters or with multiple GPUs, leading to potential improvements in model performance and training speed.

This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>

for more information, see https://pre-commit.ci

github-actions

👋 Hello @davidsvaughn, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
✅ Verify all YOLOv5 Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

move extra parameter (rank) to end so won't mess up pre-existing positional args

removing extra '#'

glenn-jocher · 2022-12-03T20:33:18Z

@davidsvaughn thanks for the PR! This would be a great fix if it works. Unfortunately with these sorts of changes we should test on a full COCO training to make sure DDP shuffling/sampling are not affected (changes would be visible as reduced mAP). I'll add a TODO to prioritize this so we can run these tests once we free up GPUs in a few days, or do you have any of your own results vs master?

glenn-jocher · 2022-12-03T22:23:41Z

@davidsvaughn after looking at the code I'm a little confused, as it seems that the first image in the mosaic is pulled from self.indices, which span the entire dataset, but the remaining 3 are pulled from self.idx, which is only a subset of the dataset. Since the first image is pulled from the entire dataset would the entire dataset need to be cached into RAM for that GPU?

davidsvaughn · 2022-12-05T15:31:51Z

@glenn-jocher I don't blame you for being confused... the way I did this is a little confusing. Sorry about that! Originally I didn't notice the mosaic code needed tweaking. I was trying to preserve as much as possible of the original code structure, by just tweaking the indices that were coming out of the DistributedSampler (now SmartDistributedSampler...) which acts as a sort of "global" coordinator (i.e. global to all GPU processes), by sending the correct data indices to each GPU-process based on it's RANK (indices which are all referencing the "global" self.indices array). But when I realized the other 3 (or other 8) samples in the mosaic also needed to be sampled from the "correct" subset of indices (based on RANK), that is done outside the machinery of the (Smart)DistributedSampler... so I needed to just sample from self.idx directly.

To say this another way... yes, the first image looks like it is pulled from the entire dataset (unlike the other 3) because it references the "global" self.indices array... but it's actually sampling from the same subset that is in self.idx. It's just that in the case of the first image, this smart sub-sampling is being controlled by the index coming out of SmartDistributedSampler.

Yes I think it would be cleaner if everything was just sampled from the same array (self.idx seems like the natural choice)... but this would mean more of an overhaul to the whole DistributedSampler thing.... I guess it wouldn't really be needed, because then the first image in the mosaic would be sampled the same way as the other 3, where each GPU process is just sampling all values in self.idx (instead of some values in self.indices). Sorry I hope that made sense (indices of indices... always fun!). I can try to make this less confusing, but maybe I'll wait for your feedback before proceeding.

BTW... I also realized whatever change is made also needs to be duplicated in segment/dataloaders.py. I'm still getting used to the new segmentation code...

sample from DDP index array (self.idx) in mixup mosaic

glenn-jocher · 2022-12-05T18:25:47Z

@davidsvaughn ok got it. If the two attributes can be merged then yes let's do it please. Thanks!

… (self.indices). Also adding SmartDistributedSampler to segmentation dataloader

for more information, see https://pre-commit.ci

davidsvaughn · 2022-12-07T16:48:41Z

@glenn-jocher okay I merged the two attributes into one (self.indices and self.idx -> self.indices). When using single GPU (or CPU) self.indices is the same as before (1..n). But with Multi-GPU (DDP) self.indices will only contain the subset of indices that are partitioned for the given GPU process.

FYI, the point of the random permutation here is so the partitioning of data across the GPUs is random. The SmartDistributedSampler will only randomize the sampling (if shuffle=True) but without also randomizing the partitioning, then (with WORLD_SIZE=4) the RANK=0 process will get self.indices=[0,4,8,12,...], RANK=1 process will get self.indices=[1,5,9,13,...] and so on, and this affects which images are grouped together in mosaics, and mixup. It seems like this should be randomized as well, in case there is some inherent order in the dataset.

I also made necessary changes to utils/segment/dataloaders.py so the same behavior is implemented for segmentation. This change was minor since most behavior is inherited from utils/dataloaders.py. However, I didn't make any changes to the classification pipeline, since that is different enough that it should probably be a separate PR.

I ran COCO from-scratch training for 10 epochs, before and after the change, as a sanity check to show there's no affect on mAP. I realize you'll probably need to train for longer. But I'll just show my results (using 4 x TeslaV100):

torchrun --standalone --nnodes=1 --nproc_per_node=4 train.py --data coco.yaml --weights '' --cfg yolov5s.yaml --batch-size 256 --cache --epochs 10

BEFORE:

Validating runs/train/exp1/weights/best.pt...
Fusing layers... 
YOLOv5s summary: 157 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
     Class     Images  Instances          P          R      mAP50   mAP50-95
       all       5000      36335      0.413      0.286      0.272      0.147
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.149
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.277
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.145
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.074
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.169
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.185
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.180
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.334
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.383
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.197
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.430
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.493

AFTER:

Validating runs/train/exp1/weights/best.pt...
Fusing layers... 
YOLOv5s summary: 157 layers, 7225885 parameters, 0 gradients, 16.4 GFLOPs
     Class     Images  Instances          P          R      mAP50   mAP50-95
       all       5000      36335      0.436      0.286      0.277       0.15
...
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.152
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.282
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.148
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.074
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.169
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.194
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.180
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.336
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.388
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.196
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.509

I should mention that one side-effect is that when caching images before training, since what is displayed is only the caching stats for the RANK=0 process, this now (correctly) shows only a fraction (1/WORLD_SIZE) of the total data. Not sure if this should be changed too (multiply x WORLD_SIZE ?). I can see pros and cons either way....

BEFORE:

AFTER:

glenn-jocher · 2022-12-09T02:39:49Z

@davidsvaughn very cool! Results look pretty good. I just made a small change to display Rank 0 GB * WORLD_SIZE as an approximation of total RAM used. I will try to test this week, but your results seem perfectly fine.

github-actions · 2023-03-22T00:20:03Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions YOLOv5 🚀 and Vision AI ⭐.

kbegiedza · 2023-03-22T08:51:56Z

F5

What do you think @glenn-jocher ?
Looks nice to me

davidsvaughn · 2023-09-05T21:39:38Z

@glenn-jocher Hey again - I'm just wondering if you know whether this bug exists in yolov8, or if it was fixed there?
Thanks!

glenn-jocher · 2023-09-06T01:10:03Z

@davidsvaughn hi there! I'm not sure about the specific bug you're referring to in YOLOv8, as I don't have the details at the moment. However, I can tell you that YOLOv8 is a different version from YOLOv5, so it's possible that any bugs present in YOLOv5 may not exist in YOLOv8, or vice versa. It would be best to check the documentation or official sources for YOLOv8 to see if there are any known bug reports or updates regarding the specific issue you're facing.

github-actions · 2023-12-06T00:21:21Z

👋 Hello there! We wanted to let you know that we've decided to close this pull request due to inactivity. We appreciate the effort you put into contributing to our project, but unfortunately, not all contributions are suitable or aligned with our product roadmap.

We hope you understand our decision, and please don't let it discourage you from contributing to open source projects in the future. We value all of our community members and their contributions, and we encourage you to keep exploring new projects and ways to get involved.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

glenn-jocher · 2024-01-03T07:15:31Z

@davidsvaughn PR merged, thank you for your contributions!

… issue (ultralytics#10383) * Update dataloaders.py This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dataloaders.py move extra parameter (rank) to end so won't mess up pre-existing positional args * Update dataloaders.py removing extra '#' * Update dataloaders.py sample from DDP index array (self.idx) in mixup mosaic * Merging self.indices and self.idx (DDP indices) into single attribute (self.indices). Also adding SmartDistributedSampler to segmentation dataloader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Multiply GB displayed by WORLD_SIZE --------- Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>

davidsvaughn and others added 2 commits December 2, 2022 11:50

[pre-commit.ci] auto fixes from pre-commit.com hooks

5721f2e

for more information, see https://pre-commit.ci

github-actions bot reviewed Dec 2, 2022

View reviewed changes

davidsvaughn and others added 3 commits December 2, 2022 12:44

Update dataloaders.py

be8594f

move extra parameter (rank) to end so won't mess up pre-existing positional args

Update dataloaders.py

3e818a1

removing extra '#'

Merge branch 'master' into patch-1

998687d

glenn-jocher self-assigned this Dec 3, 2022

glenn-jocher added the TODO label Dec 3, 2022

davidsvaughn and others added 2 commits December 5, 2022 12:46

Update dataloaders.py

b0a30eb

sample from DDP index array (self.idx) in mixup mosaic

Merge branch 'master' into patch-1

7f308e7

davidsvaughn and others added 2 commits December 7, 2022 10:41

Merging self.indices and self.idx (DDP indices) into single attribute…

e2991ea

… (self.indices). Also adding SmartDistributedSampler to segmentation dataloader

[pre-commit.ci] auto fixes from pre-commit.com hooks

9ba2a0f

for more information, see https://pre-commit.ci

glenn-jocher added 2 commits December 8, 2022 17:51

Merge branch 'master' into patch-1

4ea372a

Multiply GB displayed by WORLD_SIZE

092b944

github-actions bot added the Stale label Mar 22, 2023

github-actions bot removed the Stale label Mar 23, 2023

davidsvaughn mentioned this pull request Sep 6, 2023

how to avoid high RAM usage ultralytics/ultralytics#1467

Closed

1 task

github-actions bot added the Stale label Dec 6, 2023

glenn-jocher merged commit 46ae996 into ultralytics:master Jan 3, 2024

glenn-jocher removed the TODO label Jan 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

davidsvaughn commented Dec 2, 2022 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

glenn-jocher commented Dec 3, 2022

glenn-jocher commented Dec 3, 2022 •

edited

Loading

davidsvaughn commented Dec 5, 2022 •

edited

Loading

glenn-jocher commented Dec 5, 2022

davidsvaughn commented Dec 7, 2022 •

edited

Loading

glenn-jocher commented Dec 9, 2022

github-actions bot commented Mar 22, 2023

kbegiedza commented Mar 22, 2023

davidsvaughn commented Sep 5, 2023

glenn-jocher commented Sep 6, 2023

github-actions bot commented Dec 6, 2023

glenn-jocher commented Jan 3, 2024

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Conversation

davidsvaughn commented Dec 2, 2022 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

glenn-jocher commented Dec 3, 2022

glenn-jocher commented Dec 3, 2022 • edited Loading

davidsvaughn commented Dec 5, 2022 • edited Loading

glenn-jocher commented Dec 5, 2022

davidsvaughn commented Dec 7, 2022 • edited Loading

glenn-jocher commented Dec 9, 2022

github-actions bot commented Mar 22, 2023

kbegiedza commented Mar 22, 2023

davidsvaughn commented Sep 5, 2023

glenn-jocher commented Sep 6, 2023

github-actions bot commented Dec 6, 2023

glenn-jocher commented Jan 3, 2024

davidsvaughn commented Dec 2, 2022 •

edited by UltralyticsAssistant

Loading

glenn-jocher commented Dec 3, 2022 •

edited

Loading

davidsvaughn commented Dec 5, 2022 •

edited

Loading

davidsvaughn commented Dec 7, 2022 •

edited

Loading