Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache…
… issue (#10383) * Update dataloaders.py This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug #3818 (#3818). This was a very serious and "blocking" issue until I could figure out what was going on. The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures. Hopefully this fix will help others. DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache. The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images. So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs. I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation. Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update dataloaders.py move extra parameter (rank) to end so won't mess up pre-existing positional args * Update dataloaders.py removing extra '#' * Update dataloaders.py sample from DDP index array (self.idx) in mixup mosaic * Merging self.indices and self.idx (DDP indices) into single attribute (self.indices). Also adding SmartDistributedSampler to segmentation dataloader * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Multiply GB displayed by WORLD_SIZE --------- Signed-off-by: davidsvaughn <davidsvaughn@gmail.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Glenn Jocher <glenn.jocher@ultralytics.com>
- Loading branch information