Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ginkgo 1.7.0 tests capture stderr and fail due to different number of mpirun warnings #1567

Open
lahwaacz opened this issue Mar 9, 2024 · 1 comment

Comments

@lahwaacz
Copy link
Contributor

lahwaacz commented Mar 9, 2024

Hi,

I'm creating a stable ginkgo-hpc package for Arch Linux and I'm getting some issues. Besides #1564, #1566 and #1143, there are some tests that fail with the following error:

281/285 Test #283: benchmark_multi_vector_distributed .......................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/blas/distributed/multi_vector_distributed' '-input' '[{"n": 100}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99043] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99045] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99044] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

282/285 Test #284: benchmark_spmv_distributed ...............................***Failed    1.27 sec
TEST: '/usr/bin/mpiexec' '-n' '3' '/build/ginkgo-hpc/src/build/benchmark/spmv/distributed/spmv_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil"}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,6 @@

+[arch-nspawn-268570:99066] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99065] No HIP capabale device found. Disabling component.
+[arch-nspawn-268570:99064] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

283/285 Test #285: benchmark_solver_distributed .............................***Failed    1.21 sec
TEST: '/build/ginkgo-hpc/src/build/benchmark/solver/distributed/solver_distributed' '-input' '[{"size": 100, "stencil": "7pt", "comm_pattern": "stencil", "optimal": {"spmv": "csr-csr"}}]'
FAIL: stderr differs
---

+++

@@ -1,3 +1,4 @@

+[arch-nspawn-268570:99060] No HIP capabale device found. Disabling component.
 This is Ginkgo 1.7.0 (master)
     running with core module 1.7.0 (master)
 Running on reference(0)

The build system has no GPU, but ROCm/HIP is installed for building the -hip variant of the package. But these tests are built with -DGINKGO_BUILD_HIP=OFF (I know it is pointless to run HIP tests without a GPU).

Arch Linux has ROCm-aware OpenMPI 5.0 and it is responsible for printing the No HIP capabale device found. Disabling component. message from each rank. Hence, if you compare the output of a serial test with that run through mpirun, there will necessarily be a difference. The tests should be designed better, assuming that the MPI library itself does not print anything is rather naive.

@upsj
Copy link
Member

upsj commented Mar 10, 2024

I would suggest disabling the corresponding tests using ctest -E benchmark_.*_distributed in the short term, changing this behavior would require some refactoring of the benchmark code that we can't prioritize immediately. The benchmarks are not designed for easy testability, the tests were added after the fact to enable some refactoring, so they are mainly intended for us developers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants