Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'ndk/machinefiles/nersc-use-kdreg2' into next (PR #6687)
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far. kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing. It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default. Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower. Fixes #6655 I also found some older issues (some with lower node-count) that this fixes: Fixes #6516 Fixes #6451 Fixes #6521 [bfb]
- Loading branch information