generator seed fix for DDP mAP drop #9545

Forever518 · 2022-09-22T10:01:14Z

The mAP drops severely in DDP training mode. The chart below shows the training result of single GPU without DDP and 4 GPU DDP.

I logged the random seed of each RANK and its workers. It's obvious that the current version generates the same seeds per RANK, while among workers the seeds are different. I guess this may affect the data augmentation so the mAP drops.

Following the torch guide of torch.Generator.manual_seed(link is here), I set the seed to a number with balanced 0 and 1 bits meantime associated with RANK. So the seeds of each RANK are different now and the mAP of DDP training increases to a "normal" status. You can see the result above or here.

The remaining questions are:

The reason for the unreachable mAP: I have set the same hyps as yours but the mAP cannot reach official 37.4 no longer.
About training reproducible: I am running another test right now and it seems that the result of DDP is reproducible...well at least on the same node with 4 same GPUs. However the result must be different between single GPU and multi-GPUs because the seed is associated with RANK and even with the type of machine. I have no idea about this problem.

🛠️ PR Summary

_{Made with ❤️ by Ultralytics Actions}

🌟 Summary

Enhanced consistency in activation layers and improved data loader seeding for distributed training in the YOLOv5 model.

📊 Key Changes

🛠 Refactored the activation class variable in Conv to be default_act for clearer code interpretation.
🔄 Updated the activation function redefinition in yolo.py to match the new class variable name.
🔀 Introduced RANK environment variable to ensure unique random seed generation in distributed settings for DataLoader and InfiniteDataLoader.

🎯 Purpose & Impact

💡 Renaming the Conv activation variable improves code readability and maintainability, making it easier for developers to understand the default behavior.
🌐 By incorporating the RANK environment variable, the change ensures that workers in different distributed training processes have varying random seeds, thereby enhancing reproducibility and reducing potential data overlap during training.
🤖 These updates will benefit developers maintaining the codebase and users leveraging the YOLOv5 model for scalable, distributed training settings.

github-actions

👋 Hello @Forever518, thank you for submitting a YOLOv5 🚀 PR! To allow your work to be integrated as seamlessly as possible, we advise you to:

✅ Verify your PR is up-to-date with ultralytics/yolov5 master branch. If your PR is behind you can update your code by clicking the 'Update branch' button or by running git pull and git merge master locally.
✅ Verify all YOLOv5 Continuous Integration (CI) checks are passing.
✅ Reduce changes to the absolute minimum required for your bug fix or feature addition. "It is not daily increase but daily decrease, hack away the unessential. The closer to the source, the less wastage there is." — Bruce Lee

glenn-jocher · 2022-09-22T21:50:14Z

@Forever518 thanks for the PR!

A DDP mAP drop sounds like a serious problem, are you sure you're able to reproduce with the latest torch and YOLOv5 on a common dataset like COCO128 or VOC?

Forever518 · 2022-09-23T03:26:54Z

@glenn-jocher Sure, I trained yolov5s using Pytorch 1.12.1+cu116 and the latest yolov5 on COCO dataset. The chart truly shows that with DDP mode on, the mAP drops from the beginning. And the result of DDP after 300 epochs is about 2% lower than official and my fix version. (repro-1 and repro-2 are my fix PR version and other two are the latest yolov5)

It's really a strange problem because I didn't see any drop in your wandb runs with multi-GPU.

The coco128 is too small to use DDP training, while due to my network I have not test the VOC dataset yet.

python -m torch.distributed.run --nproc_per_node 4 train.py --data data/coco.yaml --img 640 --epoch 300 --device 0,1,2,3 --hyp data/hyps/hyp.scratch-low.yaml --cfg models/yolov5s.yaml --weights '' --batch 128 --project yolov5-ddp-seed --name repro-1

…ddp-seed

Forever518 · 2022-09-23T14:45:28Z

@glenn-jocher I find another bug when trying to change the act field of Conv. When I set self.act to another activation function, it's still nn.SiLU. This may cause the assignment to self.act in __init__ invalid. You can run the code below to reproduce it. On my machine it outputs SiLU() twice.

import torch.nn as nn


def autopad(k, p=None, d=1):  # kernel, padding, dilation
    # Pad to 'same' shape outputs
    if d > 1:
        k = d * (k - 1) + 1 if isinstance(k, int) else [d * (x - 1) + 1 for x in k]  # actual kernel-size
    if p is None:
        p = k // 2 if isinstance(k, int) else [x // 2 for x in k]  # auto-pad
    return p


class Conv(nn.Module):
    # Standard convolution with args(ch_in, ch_out, kernel, stride, padding, groups, dilation, activation)
    act = nn.SiLU()  # default activation

    def __init__(self, c1, c2, k=1, s=1, p=None, g=1, d=1, act=True):
        super().__init__()
        self.conv = nn.Conv2d(c1, c2, k, s, autopad(k, p, d), groups=g, dilation=d, bias=False)
        self.bn = nn.BatchNorm2d(c2)
        print(self.act)
        self.act = nn.Identity()
        print(self.act)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def forward_fuse(self, x):
        return self.act(self.conv(x))


if __name__ == '__main__':
    c = Conv(3, 64)

glenn-jocher · 2022-09-24T16:08:23Z

@Forever518 ok, I ran an experiment and am able to reproduce mAP drop on COCO. I'll go ahead and merge this fix, except what's the reason for the large number on seed init? Could this also work just with RANK?

- generator.manual_seed(6148914691236517205 + RANK)
+ generator.manual_seed(RANK)  # can we use this instead?

Forever518 · 2022-09-24T16:36:12Z

@glenn-jocher Pytorch official suggests using a large number with balanced 0 and 1 bits as the seed. See generator manual seed usage. I just set it to the number whose hex is 5555 5555 5555 5555. Of course the seed can be any other thing such as 2147483647...... except only RANK with too many zeros.

Actually the result on my machine still cannot reach 37.4 with the lastest version even after this fix, less than 0.4 lower compared to official benchmark... I am afraid something about the training settings is still missing just as you mentioned in PR #8602.

glenn-jocher · 2022-09-24T17:06:47Z

@Forever518 ah, if you are trying to reproduce officical then you should download COCO-segments dataset, i.e.:

bash data/scripts/get_coco.sh --train --val --segments

Also note YOLOv5s results are on single-GPU, DDP may reduce mAP slightly.

glenn-jocher · 2022-09-24T17:16:39Z

@Forever518 I'm going to merge but need to expand this solution to all 3 dataloaders with generators: detection, classification, segmentation first

glenn-jocher · 2022-09-24T17:17:31Z

for more information, see https://pre-commit.ci

Forever518 · 2022-09-24T17:32:19Z

@glenn-jocher

@Forever518 I'm going to merge but need to expand this solution to all 3 dataloaders with generators: detection, classification, segmentation first

Haha maybe a framework is needed for supporting the 3 tasks at the same time in case of always changing the code 3 times. 🤣

glenn-jocher · 2022-09-24T17:34:44Z

@Forever518 yes definitely. This triple maintenance is not optimal.

@AyushExel @Laughing-q this should resolve generator seed issues across tasks, and may lead to DDP reproducibility

glenn-jocher · 2022-09-24T17:35:35Z

@Forever518 PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

Can you verify that the DDP mAP drop issue is now resolved in master?

Forever518 · 2022-09-24T17:44:15Z

@Forever518 PR is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

Can you verify that the DDP mAP drop issue is now resolved in master?

@glenn-jocher OK I will try it later. Due to the computing resources constraint it may take some time...

glenn-jocher · 2022-09-24T17:52:29Z

@Forever518 great, thank you!

glenn-jocher · 2022-09-24T20:03:46Z

@Forever518 @AyushExel @Laughing-q can confirm that DetectionModel DDP low-mAP training bug is now fixed, and is also reproducible now across DDP runs (first time ever, awesome!).

Tested with 4x A100 on YOLOv5m6 COCO twice vs baseline.

AyushExel · 2022-09-24T20:05:16Z

Awesome!!! 🚀

Try to fix DDP mAP drop by setting generator's seed to RANK

794f8dd

github-actions bot reviewed Sep 22, 2022

View reviewed changes

Merge branch 'master' into ddp-seed

d07da2f

Forever518 added 2 commits September 23, 2022 22:05

Merge branch 'ddp-seed' of https://github.com/Forever518/yolov5 into …

b380071

…ddp-seed

Fix default activation bug

bab9f92

Merge branch 'master' into ddp-seed

13e7bc4

glenn-jocher mentioned this pull request Sep 24, 2022

Latest code can't reproduce YOLOv5s on coco #9547

Closed

2 tasks

glenn-jocher linked an issue Sep 24, 2022 that may be closed by this pull request

Latest code can't reproduce YOLOv5s on coco #9547

Closed

2 tasks

glenn-jocher and others added 3 commits September 24, 2022 19:19

Update dataloaders.py

47eca2b

[pre-commit.ci] auto fixes from pre-commit.com hooks

9efed01

for more information, see https://pre-commit.ci

Update dataloaders.py

150ae22

glenn-jocher changed the title ~~Try to fix DDP mAP drop~~ generator seed fix for DDP mAP drop Sep 24, 2022

glenn-jocher merged commit f11a8a6 into ultralytics:master Sep 24, 2022

This was referenced Sep 30, 2022

fix_DPP_mAP_drop Oneflow-Inc/one-yolov5#42

Merged

最近实验训练达不到目标精度的原因 Oneflow-Inc/one-yolov5#43

Closed

Hojland mentioned this pull request Oct 17, 2022

feat/bump Go-Autonomous/yolov5#15

Merged

developer0hye mentioned this pull request Dec 13, 2022

Set a seed of generator with an option for more randomness when training several models with different seeds #10486

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

generator seed fix for DDP mAP drop #9545

generator seed fix for DDP mAP drop #9545

Forever518 commented Sep 22, 2022 •

edited by UltralyticsAssistant

Loading

github-actions bot left a comment

glenn-jocher commented Sep 22, 2022

Forever518 commented Sep 23, 2022 •

edited

Loading

Forever518 commented Sep 23, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022 •

edited

Loading

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022 •

edited

Loading

AyushExel commented Sep 24, 2022

generator seed fix for DDP mAP drop #9545

generator seed fix for DDP mAP drop #9545

Conversation

Forever518 commented Sep 22, 2022 • edited by UltralyticsAssistant Loading

🛠️ PR Summary

🌟 Summary

📊 Key Changes

🎯 Purpose & Impact

github-actions bot left a comment

Choose a reason for hiding this comment

glenn-jocher commented Sep 22, 2022

Forever518 commented Sep 23, 2022 • edited Loading

Forever518 commented Sep 23, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022

Forever518 commented Sep 24, 2022 • edited Loading

glenn-jocher commented Sep 24, 2022

glenn-jocher commented Sep 24, 2022 • edited Loading

AyushExel commented Sep 24, 2022

Forever518 commented Sep 22, 2022 •

edited by UltralyticsAssistant

Loading

Forever518 commented Sep 23, 2022 •

edited

Loading

Forever518 commented Sep 24, 2022 •

edited

Loading

glenn-jocher commented Sep 24, 2022 •

edited

Loading