Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

Closed
haifengl opened this issue May 20, 2024 · 7 comments · Fixed by #1510
Closed

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

haifengl opened this issue May 20, 2024 · 7 comments · Fixed by #1510
Assignees

Comments

@haifengl
Copy link

On macOS and Linux, PyTorch 2.2.2 training with MNIST produces good models that has training accuracy > 90%. However, the same code reports very low accuracy (about 11%) on Windows. The same code works fine with 2.2.1 on Windows. You can run your sample code to reproduce the issue. The below code is to calculate the accuracy:

        net.eval();
        int correct = 0;
        int n = 0;
        for (ExampleIterator it = dataLoader.begin(); !it.equals(dataLoader.end()); it = it.increment()) {
            Example batch = it.access();
            var output = net.forward(batch.data());
            var prediction = output.argmax(new LongOptional(1), false);
            correct += prediction.eq(batch.target()).sum().item_int();
            n += batch.target().size(0);
            // Explicitly free native memory
            batch.close();
        }
        System.out.format("Training accuracy = %.2f%%\n", 100.0 * correct / n);
@saudet
Copy link
Member

saudet commented May 20, 2024

Is it the same with 2.3.0-1.5.11-SNAPSHOT?

@haifengl
Copy link
Author

2.3.0 doesn't work on Windows. Failed to load jnitorch.dll. I think that it is the same issue as #1500.

@HGuillemet
Copy link
Collaborator

That's strange. the missing library has been added to javacpp, not to pytorch presets.
Anyway I see that the convergence is still abnormal on windows with 2.3.0, like with 2.2.2. It was good on 2.2.1.
I'll try to find out what's happening.

@haifengl
Copy link
Author

libomp

See the screenshot. PyTorch 2.3.0 cannot find libomp.dll

@HGuillemet
Copy link
Collaborator

No idea why pytorch 2.2.2 would find liomp140 and not pytorch 2.3.0.
Anyway, there is obviously a problem with this library. The sample MNIST code gives sensible results only if we set OMP_NUM_THREADS to 1.
I see that the Pytorch team recommend the Intel version (iomp) and that's what they ship with official libtorch.
So I think we must tweek the build on windows to link with iomp instead of what cmake finds by default.

@HGuillemet
Copy link
Collaborator

So, after many experiments and code investigations, it turns out that, when the github runner was upgraded with a new version of Visual Studio (about 2 months ago, when we merged Pytorch 2.2.2) the Windows build of libtorch linked to both the legacy vcomp and the newer (simd compatible) libomp libraries. This is the reason for the wrong computation results.

I included a fix in PR #1510 that consists in removing on windows the FindOpenMP.cmake adaptation from Pytorch to use the normal cmake version. This results in the binary linking with legacy vcomp only. This works but probably doesn't give the best performance.

Official build uses MKL, which includes openmp. We could do this (linking dynamically instead of statically), but this would require to add a dependency to MKL, even for people using GPU only. And pytorch uses a 2022 version of MKL. Not sure it would work with the 2024 version of the current MKL presets.

I also realized that openblas is not detected by pytorch build. I don't know if and how it is supposed to find it during build.

@haifengl
Copy link
Author

Thanks for hard working!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants