Skip to content

Commit

Permalink
Merge branch 'ndk/machinefiles/nersc-use-kdreg2' into next (PR #6687)
Browse files Browse the repository at this point in the history
With new slighshot software (s2.2 h11.0.1), now installed on Perlmutter, there were some hangs in init for
certain cases at higher node counts. Using environment variable FI_MR_CACHE_MONITOR=kdreg2 avoids any issues so far.
kdreg2 is another option for memory cache monitoring -- it is a Linux kernel module using open-source licensing.
It comes with HPE Slingshot host software distribution (optionally installed) and may one day be the default.

Regarding performance, it seems about the same. For one HR F-case at 256 nodes, using kdreg2 was about 1% slower.

Fixes #6655

I also found some older issues (some with lower node-count) that this fixes:
Fixes #6516
Fixes #6451
Fixes #6521

[bfb]
  • Loading branch information
ndkeen committed Oct 19, 2024
2 parents f0150c3 + e1dff58 commit 4b0a4aa
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions cime_config/machines/config_machines.xml
Original file line number Diff line number Diff line change
Expand Up @@ -278,6 +278,7 @@
<env name="HDF5_USE_FILE_LOCKING">FALSE</env>
<env name="PERL5LIB">/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch</env>
<env name="FI_CXI_RX_MATCH_MODE">software</env>
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
<env name="NETCDF_PATH">$ENV{CRAY_NETCDF_HDF5PARALLEL_PREFIX}</env>
<env name="PNETCDF_PATH">$ENV{CRAY_PARALLEL_NETCDF_PREFIX}</env>
Expand Down Expand Up @@ -445,6 +446,7 @@
<env name="OMP_PLACES">threads</env>
<env name="HDF5_USE_FILE_LOCKING">FALSE</env>
<env name="PERL5LIB">/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch</env>
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
<env name="NETCDF_PATH">$ENV{CRAY_NETCDF_HDF5PARALLEL_PREFIX}</env>
<env name="PNETCDF_PATH">$ENV{CRAY_PARALLEL_NETCDF_PREFIX}</env>
Expand Down Expand Up @@ -591,6 +593,7 @@
<env name="HDF5_USE_FILE_LOCKING">FALSE</env>
<env name="PERL5LIB">/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch</env>
<env name="FI_CXI_RX_MATCH_MODE">software</env>
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
<env name="Albany_ROOT">$SHELL{if [ -z "$Albany_ROOT" ]; then echo /global/common/software/e3sm/mali_tpls/albany-e3sm-serial-release-gcc; else echo "$Albany_ROOT"; fi}</env>
<env name="Trilinos_ROOT">$SHELL{if [ -z "$Trilinos_ROOT" ]; then echo /global/common/software/e3sm/mali_tpls/trilinos-e3sm-serial-release-gcc; else echo "$Trilinos_ROOT"; fi}</env>
Expand Down Expand Up @@ -752,6 +755,7 @@
<env name="OMP_PLACES">threads</env>
<env name="HDF5_USE_FILE_LOCKING">FALSE</env>
<env name="PERL5LIB">/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch</env>
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
<env name="NETCDF_PATH">$ENV{CRAY_NETCDF_HDF5PARALLEL_PREFIX}</env>
<env name="PNETCDF_PATH">$ENV{CRAY_PARALLEL_NETCDF_PREFIX}</env>
Expand Down Expand Up @@ -894,6 +898,7 @@
<env name="HDF5_USE_FILE_LOCKING">FALSE</env>
<env name="PERL5LIB">/global/cfs/cdirs/e3sm/perl/lib/perl5-only-switch</env>
<env name="FI_CXI_RX_MATCH_MODE">software</env>
<env name="FI_MR_CACHE_MONITOR">kdreg2</env>
<env name="MPICH_COLL_SYNC">MPI_Bcast</env>
<env name="NETCDF_PATH">$ENV{CRAY_NETCDF_HDF5PARALLEL_PREFIX}</env>
<env name="PNETCDF_PATH">$ENV{CRAY_PARALLEL_NETCDF_PREFIX}</env>
Expand Down

0 comments on commit 4b0a4aa

Please sign in to comment.