Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerical Stability #11

Open
jonmorton opened this issue Sep 23, 2022 · 1 comment
Open

Numerical Stability #11

jonmorton opened this issue Sep 23, 2022 · 1 comment

Comments

@jonmorton
Copy link

jonmorton commented Sep 23, 2022

Hi, I'm wondering if you've run into any issues with numerical stability or know what may be the cause.

With normal RepVGG, I get differences as high as 4e-4 comparing before and after switching to deploy. After changing first conv to OREPA_LargeConv, I get errors as high as 2e-3. After changing the 1x1 conv in the RepVGG block to OREPA_1x1, I get differences as high as 0.1.

It seems numerical stability makes it challenging to use identity + OREPA_1x1 + OREPA_3x3 blocks in RepVGG style model. Any thoughts about why?

@JUGGHM
Copy link
Owner

JUGGHM commented Sep 23, 2022

Hi, I'm wondering if you've run into any issues with numerical stability or know what may be the cause.

With normal RepVGG, I get differences as high as 4e-4 comparing before and after switching to deploy. After changing first conv to OREPA_LargeConv, I get errors as high as 2e-3. After changing the 1x1 conv in the RepVGG block to OREPA_1x1, I get differences as high as 0.1.

It seems numerical stability makes it challenging to use identity + OREPA_1x1 + OREPA_3x3 blocks in RepVGG style model. Any thoughts about why?

Thanks for your interest Jon! This is quite weird but interesting. After two hours' investigation, we found that the reason is that the computation on cpu and gpu will lead to different results. We shall notice that the weight re-param procedure is conducted on cpu in convert.py. Therefore if you could add a .cuda() to the end of the 17th line in convert.py, this numerical gap will be shrunk.

But even though, we are still not able to shrink the gap to 0. I am not quite sure whether it is caused by some random factors caused by different convolutional implementation inside torch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants