fix performance issue in convnext DDP train #1098

cybergeek2077 · 2022-10-16T09:47:25Z

Thanks for your contribution and we appreciate it a lot. The following instructions would make your pull request more healthy and more easily get feedback. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.

Motivation

There are flowing warning when use dist_train.sh to train convnext

[W reducer.cpp:347] Warning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance.

Modification

call contiguous after permute in layernorm, and the performance actually improve 3x in my test

BC-breaking (Optional)

Does the modification introduce changes that break the backward compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.

Checklist

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues.
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests.
The modification is covered by complete unit tests. If not, please add more unit test to ensure the correctness.
The documentation has been modified accordingly, like docstring or example tutorials.

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects, like MMDet or MMSeg.
CLA has been signed and all committers have signed the CLA in this PR.

to fix performance issue in convnext DDP train

CLAassistant · 2022-10-16T09:47:29Z

All committers have signed the CLA.

mzr1996

LGTM, I have tested it and this modification can slightly accelerate the training.

kamzero · 2023-02-07T12:55:38Z

the performance actually improve 3x in my test

Hi! May I ask whether this warning only affects the training speed and convergence speed, or will it affect the accuracy?

cybergeek2077 · 2023-02-07T13:44:35Z

the performance actually improve 3x in my test

Hi! May I ask whether this warning only affects the training speed and convergence speed, or will it affect the accuracy?

I do not do the ablation experiment, but the model i trained before fixing the bug works normally, so i think it only affects the training speed.

OpenMMLab-Assistant005 · 2023-04-07T11:35:40Z

Hi @790475019 ！First of all, we want to express our gratitude for your significant PR in the project. Your contribution is highly appreciated, and we are grateful for your efforts in helping improve this open-source project during your personal time. We believe that many developers will benefit from your PR.

We would also like to invite you to join our Special Interest Group (SIG) private channel on Discord, where you can share your experiences, ideas, and build connections with like-minded peers. To join the SIG channel, simply message moderator— OpenMMLab on Discord or briefly share your open-source contributions in the #introductions channel and we will assist you. Look forward to seeing you there! Join us ：https://discord.gg/UjgXkPWNqA

If you have WeChat account，welcome to join our community on WeChat. You can add our assistant ：openmmlabwx. Please add "mmsig + Github ID" as a remark when adding friends：）
Thank you again for your contribution❤

call contiguous after permute

129cabc

to fix performance issue in convnext DDP train

mzr1996 approved these changes Oct 17, 2022

View reviewed changes

mzr1996 changed the base branch from master to dev October 17, 2022 02:09

mzr1996 merged commit 38040d5 into open-mmlab:dev Oct 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix performance issue in convnext DDP train #1098

fix performance issue in convnext DDP train #1098

cybergeek2077 commented Oct 16, 2022 •

edited

Loading

CLAassistant commented Oct 16, 2022 •

edited

Loading

mzr1996 left a comment

kamzero commented Feb 7, 2023

cybergeek2077 commented Feb 7, 2023

OpenMMLab-Assistant005 commented Apr 7, 2023

fix performance issue in convnext DDP train #1098

fix performance issue in convnext DDP train #1098

Conversation

cybergeek2077 commented Oct 16, 2022 • edited Loading

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

CLAassistant commented Oct 16, 2022 • edited Loading

mzr1996 left a comment

Choose a reason for hiding this comment

kamzero commented Feb 7, 2023

cybergeek2077 commented Feb 7, 2023

OpenMMLab-Assistant005 commented Apr 7, 2023

cybergeek2077 commented Oct 16, 2022 •

edited

Loading

CLAassistant commented Oct 16, 2022 •

edited

Loading