Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: update dataloaders.py to fix Multi-GPU DDP RAM multiple-cache issue #10383

Merged
merged 11 commits into from
Jan 3, 2024

Commits on Dec 2, 2022

  1. Update dataloaders.py

    This is to address (and hopefully fix) this issue: Multi-GPU DDP RAM multiple-cache bug ultralytics#3818 (ultralytics#3818).  This was a very serious and "blocking" issue until I could figure out what was going on.  The problem was especially bad when running Multi-GPU jobs with 8 GPUs, RAM usage was 8x higher than expected (!), causing repeated OOM failures.  Hopefully this fix will help others.
    DDP causes each RANK to launch it's own process (one for each GPU) with it's own trainloader, and its own RAM image cache.  The DistributedSampler used by DDP (https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py) will feed only a subset of images (1/WORLD_SIZE) to each available GPU on each epoch, but since the images are shuffled between epochs, each GPU process must still cache all images.  So I created a subclass of DistributedSampler called SmartDistributedSampler that forces each GPU process to always sample the same subset (using modulo arithmetic with RANK and WORLD_SIZE) while still allowing random shuffling between epochs.  I don't believe this disrupts the overall "randomness" of the sampling, and I haven't noticed any performance degradation.  
    
    Signed-off-by: davidsvaughn <davidsvaughn@gmail.com>
    davidsvaughn committed Dec 2, 2022
    Configuration menu
    Copy the full SHA
    5488bd5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5721f2e View commit details
    Browse the repository at this point in the history
  3. Update dataloaders.py

    move extra parameter (rank) to end so won't mess up pre-existing positional args
    davidsvaughn committed Dec 2, 2022
    Configuration menu
    Copy the full SHA
    be8594f View commit details
    Browse the repository at this point in the history
  4. Update dataloaders.py

    removing extra '#'
    davidsvaughn committed Dec 2, 2022
    Configuration menu
    Copy the full SHA
    3e818a1 View commit details
    Browse the repository at this point in the history

Commits on Dec 3, 2022

  1. Configuration menu
    Copy the full SHA
    998687d View commit details
    Browse the repository at this point in the history

Commits on Dec 5, 2022

  1. Update dataloaders.py

    sample from DDP index array (self.idx) in mixup mosaic
    davidsvaughn committed Dec 5, 2022
    Configuration menu
    Copy the full SHA
    b0a30eb View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    7f308e7 View commit details
    Browse the repository at this point in the history

Commits on Dec 7, 2022

  1. Merging self.indices and self.idx (DDP indices) into single attribute…

    … (self.indices).
    
    Also adding SmartDistributedSampler to segmentation dataloader
    davidsvaughn committed Dec 7, 2022
    Configuration menu
    Copy the full SHA
    e2991ea View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    9ba2a0f View commit details
    Browse the repository at this point in the history

Commits on Dec 9, 2022

  1. Configuration menu
    Copy the full SHA
    4ea372a View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    092b944 View commit details
    Browse the repository at this point in the history