Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generator seed fix for DDP mAP drop #9545

Merged
merged 8 commits into from
Sep 24, 2022

Conversation

Forever518
Copy link
Contributor

@Forever518 Forever518 commented Sep 22, 2022

The mAP drops severely in DDP training mode. The chart below shows the training result of single GPU without DDP and 4 GPU DDP.

1663838565125

I logged the random seed of each RANK and its workers. It's obvious that the current version generates the same seeds per RANK, while among workers the seeds are different. I guess this may affect the data augmentation so the mAP drops.

Following the torch guide of torch.Generator.manual_seed(link is here), I set the seed to a number with balanced 0 and 1 bits meantime associated with RANK. So the seeds of each RANK are different now and the mAP of DDP training increases to a "normal" status. You can see the result above or here.

The remaining questions are:

  • The reason for the unreachable mAP: I have set the same hyps as yours but the mAP cannot reach official 37.4 no longer.
  • About training reproducible: I am running another test right now and it seems that the result of DDP is reproducible...well at least on the same node with 4 same GPUs. However the result must be different between single GPU and multi-GPUs because the seed is associated with RANK and even with the type of machine. I have no idea about this problem.

🛠️ PR Summary

Made with ❤️ by Ultralytics Actions

🌟 Summary

Enhanced consistency in activation layers and improved data loader seeding for distributed training in the YOLOv5 model.

📊 Key Changes

  • 🛠 Refactored the activation class variable in Conv to be default_act for clearer code interpretation.
  • 🔄 Updated the activation function redefinition in yolo.py to match the new class variable name.
  • 🔀 Introduced RANK environment variable to ensure unique random seed generation in distributed settings for DataLoader and InfiniteDataLoader.

🎯 Purpose & Impact

  • 💡 Renaming the Conv activation variable improves code readability and maintainability, making it easier for developers to understand the default behavior.
  • 🌐 By incorporating the RANK environment variable, the change ensures that workers in different distributed training processes have varying random seeds, thereby enhancing reproducibility and reducing potential data overlap during training.
  • 🤖 These updates will benefit developers maintaining the codebase and users leveraging the YOLOv5 model for scalable, distributed training settings.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 Hello @Forever518, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

  • ✅ Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
  • ✅ Verify all YOLOv5 Continuous Integration (CI) checks are passing.
  • ✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

@glenn-jocher
Copy link
Member

@Forever518 thanks for the PR!

A DDP mAP drop sounds like a serious problem, are you sure you're able to reproduce with the latest torch and YOLOv5 on a common dataset like COCO128 or VOC?

@Forever518
Copy link
Contributor Author

Forever518 commented Sep 23, 2022

@glenn-jocher Sure, I trained yolov5s using Pytorch 1.12.1+cu116 and the latest yolov5 on COCO dataset. The chart truly shows that with DDP mode on, the mAP drops from the beginning. And the result of DDP after 300 epochs is about 2% lower than official and my fix version. (repro-1 and repro-2 are my fix PR version and other two are the latest yolov5)

It's really a strange problem because I didn't see any drop in your wandb runs with multi-GPU.

The coco128 is too small to use DDP training, while due to my network I have not test the VOC dataset yet.

python -m torch.distributed.run --nproc_per_node 4 train.py --data data/coco.yaml --img 640 --epoch 300 --device 0,1,2,3 --hyp data/hyps/hyp.scratch-low.yaml --cfg models/yolov5s.yaml --weights '' --batch 128 --project yolov5-ddp-seed --name repro-1

1663903469300

1663903453782

@Forever518
Copy link
Contributor Author

@glenn-jocher I find another bug when trying to change the act field of Conv. When I set self.act to another activation function, it's still nn.SiLU. This may cause the assignment to self.act in __init__ invalid. You can run the code below to reproduce it. On my machine it outputs SiLU() twice.

import torch.nn as nn


def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p


class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    act = nn.SiLU()  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        print(self.act)
        self.act = nn.Identity()
        print(self.act)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))


if __name__ == '__main__':
    c = Conv(3, 64)

@glenn-jocher
Copy link
Member

@Forever518 ok, I ran an experiment and am able to reproduce mAP drop on COCO. I'll go ahead and merge this fix, except what's the reason for the large number on seed init? Could this also work just with RANK?

- generator.manual_seed(6148914691236517205 + RANK)
+ generator.manual_seed(RANK)  # can we use this instead?

@glenn-jocher glenn-jocher linked an issue Sep 24, 2022 that may be closed by this pull request
2 tasks
@Forever518
Copy link
Contributor Author

@glenn-jocher Pytorch official suggests using a large number with balanced 0 and 1 bits as the seed. See generator manual seed usage. I just set it to the number whose hex is 5555 5555 5555 5555. Of course the seed can be any other thing such as 2147483647...... except only RANK with too many zeros.

Actually the result on my machine still cannot reach 37.4 with the lastest version even after this fix, less than 0.4 lower compared to official benchmark... I am afraid something about the training settings is still missing just as you mentioned in PR #8602.

1664036285846

1664036558293

@glenn-jocher
Copy link
Member

@Forever518 ah, if you are trying to reproduce officical then you should download COCO-segments dataset, i.e.:

bash data/scripts/get_coco.sh --train --val --segments

Also note YOLOv5s results are on single-GPU, DDP may reduce mAP slightly.

@glenn-jocher
Copy link
Member

@Forever518 I'm going to merge but need to expand this solution to all 3 dataloaders with generators: detection, classification, segmentation first

@glenn-jocher
Copy link
Member

Screenshot 2022-09-24 at 19 17 15

@Forever518
Copy link
Contributor Author

@glenn-jocher

@Forever518 I'm going to merge but need to expand this solution to all 3 dataloaders with generators: detection, classification, segmentation first

Haha maybe a framework is needed for supporting the 3 tasks at the same time in case of always changing the code 3 times. 🤣

@glenn-jocher glenn-jocher changed the title Try to fix DDP mAP drop generator seed fix for DDP mAP drop Sep 24, 2022
@glenn-jocher
Copy link
Member

@Forever518 yes definitely. This triple maintenance is not optimal.

@AyushExel @Laughing-q this should resolve generator seed issues across tasks, and may lead to DDP reproducibility

@glenn-jocher glenn-jocher merged commit f11a8a6 into ultralytics:master Sep 24, 2022
@glenn-jocher
Copy link
Member

@Forever518 PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

Can you verify that the DDP mAP drop issue is now resolved in master?

@Forever518
Copy link
Contributor Author

Forever518 commented Sep 24, 2022

@Forever518 PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

Can you verify that the DDP mAP drop issue is now resolved in master?

@glenn-jocher OK I will try it later. Due to the computing resources constraint it may take some time...

@glenn-jocher
Copy link
Member

@Forever518 great, thank you!

@glenn-jocher
Copy link
Member

glenn-jocher commented Sep 24, 2022

@Forever518 @AyushExel @Laughing-q can confirm that DetectionModel DDP low-mAP training bug is now fixed, and is also reproducible now across DDP runs (first time ever, awesome!).

Tested with 4x A100 on YOLOv5m6 COCO twice vs baseline.

@AyushExel
Copy link
Contributor

Awesome!!! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Latest code can't reproduce YOLOv5s on coco
3 participants