Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>

for more information, see https://pre-commit.ci

move extra parameter (rank) to end so won't mess up pre-existing positional args

removing extra '#'

sample from DDP index array (self.idx) in mixup mosaic

… (self.indices). Also adding SmartDistributedSampler to segmentation dataloader

for more information, see https://pre-commit.ci

Commits on Dec 3, 2022

Merge branch 'master' into patch-1

glenn-jocher committed Dec 3, 2022

Configuration menu

View commit details

Copy full SHA for 998687d

Browse repository at this point

Copy the full SHA

998687d View commit details

Browse the repository at this point in the history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Commits on Dec 2, 2022

Commits on Dec 3, 2022

Commits on Dec 5, 2022

Commits on Dec 7, 2022

Commits on Dec 9, 2022