-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383
Conversation
This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
π Hello @davidsvaughn, thank you for submitting a YOLOv5 π PR! To allow your work to be integrated as seamlessly as possible, we advise you to:
- β
Verify your PR is up-to-date with
ultralytics/yolov5
master
branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by runninggit pull
andgit merge master
locally. - β Verify all YOLOv5 Continuous Integration (CI) checks are passing.
- β Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." β Bruce Lee
move extra parameter (rank) to end so won't mess up pre-existing positional args
removing extra '#'
@davidsvaughn thanks for the PR! This would be a great fix if it works. Unfortunately with these sorts of changes we should test on a full COCO training to make sure DDP shuffling/sampling are not affected (changes would be visible as reduced mAP). I'll add a TODO to prioritize this so we can run these tests once we free up GPUs in a few days, or do you have any of your own results vs master? |
@davidsvaughn after looking at the code I'm a little confused, as it seems that the first image in the mosaic is pulled from self.indices, which span the entire dataset, but the remaining 3 are pulled from self.idx, which is only a subset of the dataset. Since the first image is pulled from the entire dataset would the entire dataset need to be cached into RAM for that GPU? |
@glenn-jocher I don't blame you for being confused... the way I did this is a little confusing. Sorry about that! Originally I didn't notice the mosaic code needed tweaking. I was trying to preserve as much as possible of the original code structure, by just tweaking the indices that were coming out of the DistributedSampler (now SmartDistributedSampler...) which acts as a sort of "global" coordinator (i.e. global to all GPU processes), by sending the correct data indices to each GPU-process based on it's RANK (indices which are all referencing the "global" self.indices array). But when I realized the other 3 (or other 8) samples in the mosaic also needed to be sampled from the "correct" subset of indices (based on RANK), that is done outside the machinery of the (Smart)DistributedSampler... so I needed to just sample from self.idx directly. To say this another way... yes, the first image looks like it is pulled from the entire dataset (unlike the other 3) because it references the "global" self.indices array... but it's actually sampling from the same subset that is in self.idx. It's just that in the case of the first image, this smart sub-sampling is being controlled by the index coming out of SmartDistributedSampler. Yes I think it would be cleaner if everything was just sampled from the same array (self.idx seems like the natural choice)... but this would mean more of an overhaul to the whole DistributedSampler thing.... I guess it wouldn't really be needed, because then the first image in the mosaic would be sampled the same way as the other 3, where each GPU process is just sampling all values in self.idx (instead of some values in self.indices). Sorry I hope that made sense (indices of indices... always fun!). I can try to make this less confusing, but maybe I'll wait for your feedback before proceeding. BTW... I also realized whatever change is made also needs to be duplicated in segment/dataloaders.py. I'm still getting used to the new segmentation code... |
sample from DDP index array (self.idx) in mixup mosaic
@davidsvaughn ok got it. If the two attributes can be merged then yes let's do it please. Thanks! |
β¦ (self.indices). Also adding SmartDistributedSampler to segmentation dataloader
for more information, see https://pre-commit.ci
@glenn-jocher okay I merged the two attributes into one (self.indices and self.idx -> self.indices). When using single GPU (or CPU) self.indices is the same as before (1..n). But with Multi-GPU (DDP) self.indices will only contain the subset of indices that are partitioned for the given GPU process. FYI, the point of the random permutation here is so the partitioning of data across the GPUs is random. The SmartDistributedSampler will only randomize the sampling (if shuffle=True) but without also randomizing the partitioning, then (with WORLD_SIZE=4) the RANK=0 process will get self.indices=[0,4,8,12,...], RANK=1 process will get self.indices=[1,5,9,13,...] and so on, and this affects which images are grouped together in mosaics, and mixup. It seems like this should be randomized as well, in case there is some inherent order in the dataset. I also made necessary changes to utils/segment/dataloaders.py so the same behavior is implemented for segmentation. This change was minor since most behavior is inherited from utils/dataloaders.py. However, I didn't make any changes to the classification pipeline, since that is different enough that it should probably be a separate PR. I ran COCO from-scratch training for 10 epochs, before and after the change, as a sanity check to show there's no affect on mAP. I realize you'll probably need to train for longer. But I'll just show my results (using 4 x TeslaV100):
I should mention that one side-effect is that when caching images before training, since what is displayed is only the caching stats for the RANK=0 process, this now (correctly) shows only a fraction (1/WORLD_SIZE) of the total data. Not sure if this should be changed too (multiply x WORLD_SIZE ?). I can see pros and cons either way.... |
@davidsvaughn very cool! Results look pretty good. I just made a small change to display Rank 0 GB * WORLD_SIZE as an approximation of total RAM used. I will try to test this week, but your results seem perfectly fine. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions YOLOv5 π and Vision AI β. |
F5 What do you think @glenn-jocher ? |
@glenn-jocher Hey again - I'm just wondering if you know whether this bug exists in yolov8, or if it was fixed there? |
@davidsvaughn hi there! I'm not sure about the specific bug you're referring to in YOLOv8, as I don't have the details at the moment. However, I can tell you that YOLOv8 is a different version from YOLOv5, so it's possible that any bugs present in YOLOv5 may not exist in YOLOv8, or vice versa. It would be best to check the documentation or official sources for YOLOv8 to see if there are any known bug reports or updates regarding the specific issue you're facing. |
π Hello there! We wanted to let you know that we've decided to close this pull request due to inactivity. We appreciate the effort you put into contributing to our project, but unfortunately, not all contributions are suitable or aligned with our product roadmap. We hope you understand our decision, and please don't let it discourage you from contributing to open source projects in the future. We value all of our community members and their contributions, and we encourage you to keep exploring new projects and ways to get involved. For additional resources and information, please see the links below:
Thank you for your contributions to YOLO π and Vision AI β |
@davidsvaughn PR merged, thank you for your contributions! |
β¦ issue (ultralytics#10383) * Update dataloaders.py This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dataloaders.py move extra parameter (rank) to end so won't mess up pre-existing positional args * Update dataloaders.py removing extra '#' * Update dataloaders.py sample from DDP index array (self.idx) in mixup mosaic * Merging self.indices and self.idx (DDP indices) into single attribute (self.indices). Also adding SmartDistributedSampler to segmentation dataloader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Multiply GB displayed by WORLD_SIZE --------- Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug (#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.
Signed-off-by: davidsvaughn davidsvaughn@gmail.com
π οΈ PR Summary
Made with β€οΈ by Ultralytics Actions
π Summary
Enhanced data loading for improved distributed training support in YOLOv5 π
π Key Changes
WORLD_SIZE
variable for better control in distributed environments.SmartDistributedSampler
, an iterator overriding PyTorch'sDistributedSampler
for more efficient data shuffling.SmartDistributedSampler
, addingrank
andseed
parameters.WORLD_SIZE
in RAM usage calculations.π― Purpose & Impact