[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

haifengl · 2024-05-20T03:07:18Z

On macOS and Linux, PyTorch 2.2.2 training with MNIST produces good models that has training accuracy > 90%. However, the same code reports very low accuracy (about 11%) on Windows. The same code works fine with 2.2.1 on Windows. You can run your sample code to reproduce the issue. The below code is to calculate the accuracy:

        net.eval();
        int correct = 0;
        int n = 0;
        for (ExampleIterator it = dataLoader.begin(); !it.equals(dataLoader.end()); it = it.increment()) {
            Example batch = it.access();
            var output = net.forward(batch.data());
            var prediction = output.argmax(new LongOptional(1), false);
            correct += prediction.eq(batch.target()).sum().item_int();
            n += batch.target().size(0);
            // Explicitly free native memory
            batch.close();
        }
        System.out.format("Training accuracy = %.2f%%\n", 100.0 * correct / n);

The text was updated successfully, but these errors were encountered:

saudet · 2024-05-20T03:35:03Z

Is it the same with 2.3.0-1.5.11-SNAPSHOT?

haifengl · 2024-05-20T17:37:01Z

2.3.0 doesn't work on Windows. Failed to load jnitorch.dll. I think that it is the same issue as #1500.

HGuillemet · 2024-05-20T18:18:59Z

That's strange. the missing library has been added to javacpp, not to pytorch presets.
Anyway I see that the convergence is still abnormal on windows with 2.3.0, like with 2.2.2. It was good on 2.2.1.
I'll try to find out what's happening.

haifengl · 2024-05-20T19:49:30Z

See the screenshot. PyTorch 2.3.0 cannot find libomp.dll

HGuillemet · 2024-05-21T21:56:25Z

No idea why pytorch 2.2.2 would find liomp140 and not pytorch 2.3.0.
Anyway, there is obviously a problem with this library. The sample MNIST code gives sensible results only if we set OMP_NUM_THREADS to 1.
I see that the Pytorch team recommend the Intel version (iomp) and that's what they ship with official libtorch.
So I think we must tweek the build on windows to link with iomp instead of what cmake finds by default.

HGuillemet · 2024-06-20T15:05:33Z

So, after many experiments and code investigations, it turns out that, when the github runner was upgraded with a new version of Visual Studio (about 2 months ago, when we merged Pytorch 2.2.2) the Windows build of libtorch linked to both the legacy vcomp and the newer (simd compatible) libomp libraries. This is the reason for the wrong computation results.

I included a fix in PR #1510 that consists in removing on windows the FindOpenMP.cmake adaptation from Pytorch to use the normal cmake version. This results in the binary linking with legacy vcomp only. This works but probably doesn't give the best performance.

Official build uses MKL, which includes openmp. We could do this (linking dynamically instead of statically), but this would require to add a dependency to MKL, even for people using GPU only. And pytorch uses a 2022 version of MKL. Not sure it would work with the 2024 version of the current MKL presets.

I also realized that openblas is not detected by pytorch build. I don't know if and how it is supposed to find it during build.

haifengl · 2024-06-20T21:19:47Z

Thanks for hard working!

saudet added bug help wanted labels May 20, 2024

saudet assigned HGuillemet May 20, 2024

HGuillemet mentioned this issue Jun 7, 2024

[PyTorch] Update to 2.4.0, add distributed #1510

Merged

saudet closed this as completed in #1510 Sep 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

haifengl commented May 20, 2024

saudet commented May 20, 2024

haifengl commented May 20, 2024

HGuillemet commented May 20, 2024

haifengl commented May 20, 2024

HGuillemet commented May 21, 2024

HGuillemet commented Jun 20, 2024

haifengl commented Jun 20, 2024

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

[PyTorch 2.2.2-1.5.11-SNAPSHOT] Training produces poor MNIST model on Windows #1503

Comments

haifengl commented May 20, 2024

saudet commented May 20, 2024

haifengl commented May 20, 2024

HGuillemet commented May 20, 2024

haifengl commented May 20, 2024

HGuillemet commented May 21, 2024

HGuillemet commented Jun 20, 2024

haifengl commented Jun 20, 2024