Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GSI fails to run using newer Intel compilers (i.e, Intel 2022) #447

Closed
MichaelLueken opened this issue Jul 29, 2022 · 133 comments · Fixed by #571
Closed

GSI fails to run using newer Intel compilers (i.e, Intel 2022) #447

MichaelLueken opened this issue Jul 29, 2022 · 133 comments · Fixed by #571
Assignees

Comments

@MichaelLueken
Copy link
Contributor

The GSI and EnKF compiles without issue using Intel 2022 and the hpc-stack built Intel 2022 libraries. However, while running the executable built using Intel 2022, the jobs fail:

    Start 1: global_T62
1/1 Test #1: global_T62 .................***Failed  240.51 sec

Looking at the output:

[h1c05:304703:0:304703] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
[h1c05:304708:0:304708] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))

From the traceback:

==== backtrace (tid: 147167) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x000000000150cd0d rad_setup_mp_setuprad_()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/setuprad.f90:555
 2 0x000000000113c3da gsi_radoper_mp_setup__()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/gsi_radOper.F90:100
 3 0x0000000000b0a770 setuprhsall_()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/setuprhsall.f90:490
 4 0x000000000111ad0d glbsoi_()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/glbsoi.f90:323
 5 0x000000000062e6f7 gsisub_()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/gsisub.F90:200
 6 0x000000000040d00d gsimod_mp_gsimain_run_()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/gsimod.F90:2209
 7 0x000000000040cf4f MAIN__()  /scratch1/NCEPDEV/da/Michael.Lueken/test/src/gsi/gsimain.f90:631
 8 0x000000000040cee2 main()  ???:0
 9 0x0000000000022555 __libc_start_main()  ???:0
10 0x000000000040cde9 _start()  ???:0
=================================

Line 555 in src/gsi/setuprad.f90 is:

if (iuse_rad(j)==4) predx(i,j)=zero

Either compiling the code with -O0 (no optimization) or writing out iuse_rad and predx in setuprad, the same test runs through to completion without a segmentation fault.

To compile using Intel 2022:

  • modulefile/gsi_hera.intel.lua
    • Replace local hpc_intel_ver=os.getenv("hpc_intel_ver") or "18.0.5.274"
      • with
    • local hpc_intel_ver=os.getenv("hpc_intel_ver") or "2022.1.2"
    • Replace local hpc_impi_ver=os.getenv("hpc_impi_ver") or "2018.0.4"
      • with
    • local hpc_impi_ver=os.getenv("hpc_impi_ver") or "2022.1.2"
  • modulefiles/gsi_common.lua
    • Add local w3nco_ver=os.getenv("w3nco_ver") or "2.4.1"
    • Add load(pathJoin("w3nco", w3nco_ver))
@DavidHuber-NOAA
Copy link
Collaborator

This crash was replicated with the intel 2022.3.0 compiler after compiling a fresh hpc-stack. Enabling various debug options (e.g. -check bounds,uninit,bounds,contiguous,pointers,shape,stack,teams,udio_iostat -fpe0 -fp-stack-check) fails to find any culprit.

Running this test with valgrind (with options --error-limit=no --track-origins=yes), intel 2022.3.0, and -O3 without any debug options did find many instances of invalid reads and writes (summarized below), but the program ended up crashing at setupw.f90:347 (backtrace at the end). Note that getting to this point requires a walltime of at least 2 hours. It may be beneficial to run this same test with intel 2018.4 just to see if similar issues arise.

Conditional jump or move locations:

  • mpeu_util.F90:2012
  • bufr:icbfms.f:68

Invalid read/write locations:

  • read_atms.f90: 1, 2, 151, 168, 169, 170, 173-186, (and many others)
  • read_obs.f90: 1700
  • satthin.f90: 463, 476, 1499, 1534, 1535, 1579
  • radinfo.f90: 1412
  • observer.f90: 311
  • general_sub2grid_mod.f90: 1421, 1428, 1456, 1605, 1611, 1618
  • setupps.f90: 229, 276, 303-305,314
  • setupw.f90: 347

Uninitialized value created by a stack allocation:

  • bufr:icbfms.f:31

Crash backtrace:

==== backtrace (tid:  76028) ====
 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x00000000053f42c9 w_setup_mp_setupw_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/setupw.f90:347
 2 0x0000000003cc1bcf gsi_woper_mp_setup__()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsi_wOper.F90:103
 3 0x0000000002348ba0 setuprhsall_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/setuprhsall.f90:490
 4 0x0000000003bf406d glbsoi_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/glbsoi.f90:323
 5 0x0000000000e75678 gsisub_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsisub.F90:200
 6 0x000000000042a21b gsimod_mp_gsimain_run_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsimod.F90:2230
 7 0x0000000000412973 MAIN__()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsimain.f90:631
 8 0x00000000004127dd main()  ???:0
 9 0x0000000000022555 __libc_start_main()  ???:0
10 0x00000000004126f6 _start()  ???:0
=================================

@aerorahul
Copy link
Contributor

@NOAA-EMC/gsi-admins
Upcoming implementations of GSI will need to use the Intel compilers later than 2020.
This is also preventing implementing build.ver and run.ver NCO EE2 requirement in the develop of the GFS's global-workflow.

@RussTreadon-NOAA
Copy link
Contributor

Thank you, @aerorahul , for reminding us of this known issue. Since we no longer have a GSI code manager I do not know who will work on this issue. Adding @dtkleist for awareness.

@DavidHuber-NOAA
Copy link
Collaborator

@aerorahul @RussTreadon-NOAA I have reached out to Intel via Google to help look into this. Depending on suggestions from Intel, it may be beneficial to work with a developer. I'll keep you posted.

@arunchawla-NOAA
Copy link

Hi @DavidHuber-NOAA, @aerorahul and @RussTreadon-NOAA

We are trying to get a developer to work on this. Can we get a little summary on 1. what compilers it has been tested with, 2. which is the platform to test this (I would think HERA, but want to confirm) and 3. what is the test case to run ? Should the GSI regression tests be enough for this ?

@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Jan 24, 2023

@arunchawla-NOAA

  1. This has been tested with Intel 2022.1.2 and 2022.3.0 on Hera (as well as 2021.3.0 on a GCP instance)
  2. Hera would be fine, but it could probably also be tested on Orion or Jet
  3. The global_T62 test is adequate for testing -O3. If -O2 should be tested, where the global_T62 test passes, then additional tests may be required.
  4. I think the RTs should be enough. Real data cases have also all failed with -O3. However, I have not seen a real data case fail with -O2, so the RTs seem to be a better barometer for that optimization option.

If the developer is interested, I have an hpc-stack build devoted to this purpose (built with 2022.3.0) located here on Hera: /scratch1/NESDIS/nesdis-rdo2/David.Huber/hpc-stack_libs and the ported version of the GSI here: /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test.

@DavidHuber-NOAA
Copy link
Collaborator

Some headway has been made on this issue. Building the hpc-stack suite with ifort and icc version 2022.3.0 and the GSI with icx and ifx version 2023.0.0 allowed some of the regression tests to run to completion, while those that fail to do so in a different location. Attached are the logs for the global_3dvar_hiproc_updat (fails) and global_3dvar_loproc_updat (passes) tests. Note that icx and ifx version 2022.3.0 produce an ICE when compiling the GSI.
RT_output.zip

@climbfuji
Copy link

As it stands, this issue is blocking the move to a unified spack environment and creating a lot of unnecessary work for the spack-stack team. Can we just lower the optimization for that one particular file for now, or put CPP pragmas around the code that prevent optimizing this section.

@climbfuji
Copy link

@DavidHuber-NOAA @RussTreadon-NOAA Can you check if #533 fixes the problem?

@RussTreadon-NOAA
Copy link
Contributor

Tests on Orion and Hera with intel/2022.1.2 indicate that the following code refactoring in setuprad.f90 yields a gsi.x which runs to completion.

diff --git a/src/gsi/setuprad.f90 b/src/gsi/setuprad.f90
index 8d0d343f..022c0ff9 100644
--- a/src/gsi/setuprad.f90
+++ b/src/gsi/setuprad.f90
@@ -551,10 +551,12 @@ contains
         end if

 !       Load channel numbers into local array based on satellite type
+        if (iuse_rad(j)==4) then
+           predx(:,j)=zero
+        endif

         ich(jc)=j
         do i=1,npred
-           if (iuse_rad(j)==4) predx(i,j)=zero
            predchan(i,jc)=predx(i,j)
         end do
 !

global_3dvar ctest results were not altered by this change. More extensive testing is necessary to ensure no adverse impacts with other configurations.

@arunchawla-NOAA
Copy link

So this is solved? @RussTreadon-NOAA do you have a fork that we can test with the workflow? This looks like great news !!

@RussTreadon-NOAA
Copy link
Contributor

@arunchawla-NOAA , additional testing is necessary to ensure that (a) the change really works and (b) it does not alter analysis results. Ideally, one would build NOAA-EMC/GSI gsi.x and enkf.x with all the supporting libraries built with the same intel/2023 or higher compiler. I only made the changes Mike outlines above: (1) change hpc_intel_ver, hpc_impi_ver, and (2) add w3nco_ver.

I shared my finding in hopes that it is useful to those actively working this issue.

@arunchawla-NOAA
Copy link

do you have a fork that can be tested ?

RussTreadon-NOAA added a commit to RussTreadon-NOAA/GSI that referenced this issue Feb 16, 2023
@DavidHuber-NOAA
Copy link
Collaborator

DavidHuber-NOAA commented Feb 16, 2023

@arunchawla-NOAA @RussTreadon-NOAA This looks great! I will give this a try with Intel 2022.3.0 and 2023.0.0 and run a full test suite.

Since 2023 is desired, I should note that ifort and icc will soon be deprecated, so Intel support has suggested using ifx and icx instead. Using these compilers (version 2023.0.0) did allow some regression tests to pass (like global_3dvar_loproc_updat), but others are still failing due to MPI errors (global_3dvar_hiproc_updat). Attached is the output from the global_3dvar_hiproc_updat test. The lines producing the errors are stpjo.f90:303-309. Given that, I will try out Russ's fix with both mpiifort and ifx to see how they both perform.
3dvar_hiproc (1).stdout.txt

@climbfuji
Copy link

Intel's current position is to use icx+icpx+ifort for production, and icx+icpx+ifx for testing and debugging only. ifx is not yet ready for production.

@DavidHuber-NOAA
Copy link
Collaborator

Noted, thanks @climbfuji

DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/GSI that referenced this issue Feb 16, 2023
@DavidHuber-NOAA
Copy link
Collaborator

@arunchawla-NOAA @RussTreadon-NOAA Using Intel version 2022.3.0, this workaround did prevent the crash in setuprad.f90, but unfortunately, I am still getting the crash in read_files.f90:

 0 0x000000000004d455 ucs_debug_print_backtrace()  ???:0
 1 0x0000000001e9188c do_deallocate_all()  for_alloc_copy.c:0
 2 0x00000000012e2585 read_files_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/read_files.f90:724
 3 0x0000000000fa9bee gesinfo_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gesinfo.F90:606
 4 0x000000000064644f gsisub_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsisub.F90:131
 5 0x00000000004164fd gsimod_mp_gsimain_run_()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsimod.F90:2302
 6 0x000000000041643f MAIN__()  /scratch1/NESDIS/nesdis-rdo2/David.Huber/gsi_test/src/gsi/gsimain.f90:631
 7 0x00000000004163dd main()  ???:0
 8 0x0000000000022555 __libc_start_main()  ???:0
 9 0x00000000004162f6 _start()  ???:0

@RussTreadon-NOAA
Copy link
Contributor

Not surprising. Thank you @DavidHuber-NOAA for running an intel/2022.3 test.

@DavidHuber-NOAA
Copy link
Collaborator

Running with icx and mpiifort version 2023.0.0 with Russ's fix completes the global_3dvar_loproc_updat test, but crashes during the global_3dvar_hiproc_updat test as before.

@DavidHuber-NOAA
Copy link
Collaborator

I have now tried running the RTs with -O0 optimization for all modules using Intel 2023.0.0, which still fails on the hiproc 3dvar test case. I believe this is an MPI bug and will report it to Intel as such.

@climbfuji
Copy link

@DavidHuber-NOAA Here are a few ideas:

  1. Can you try to compile with optimization but with -no-vec or -fp-speculation=safe, or both? Less likely, does disabling fma help?
  2. Which -fp-model option is used when compiling with optimization? Is catching fp exceptions enabled (-f[no-]exceptions, enabled by default)?
  3. Are you tracking uninitialized variables -ftrapuv (trap uninitialized variables)? Or can you initialize everything (scalars, arrays) to NaN and then enable trapping uninitialized variables? Intel provides an option for that.
  4. is this a parallel run or a single/serial run? If parallel, can you run it as a single run and see if the problem goes away?

DavidHuber-NOAA added a commit to DavidHuber-NOAA/GSI that referenced this issue Feb 28, 2023
@DavidHuber-NOAA
Copy link
Collaborator

I spent some time moving/porting these tests (with restricted data replaced) over to S4 where I have substantially larger allocations. After doing so, I was able to fix an uninitialized variable bug in general_read_fv3atm.f90 and general_read_gfsatm.f90.

@climbfuji

  1. I tried both -no-vec, -fp-speculation=safe, -no-fma, and -fexceptions but when compiling with Intel 2023, the global_3dvar_hiproc_updat RT still crashes at the same line as before.

  2. To run with -fp-speculation=safe, I did not set -fp-model. Previously, it was set to strict, though I've considered source. -fexceptions is enabled by default.

  3. I have tried -ftrapuv, but that returns floating invalids from the CRTM modules as I recall. I can try this again and see what I can track down. I will look into the NaN option as well.

  4. There are two tests being run and they are both parallel. Both run 12 tasks per node, one runs over 5 nodes (loproc) while the other runs over 9 nodes (hiproc). The loproc case is successful, but the hiproc case crashes.

@climbfuji
Copy link

This is a mystery! Thanks for all your debugging efforts.

@DavidHuber-NOAA
Copy link
Collaborator

I've run the RTs with a number of -check flags enabled as well as -ftrapuv. Sure enough, -ftrapuv catches a divide by zero out of CRTM:

forrtl: error (73): floating divide by zero
Image              PC                Routine            Line        Source
libpthread-2.17.s  00002B04A24945D0  Unknown               Unknown  Unknown
gsi.x              000000000478D924  odps_coordinatema        1217  ODPS_CoordinateMapping.f90
gsi.x              00000000045A585C  odps_predictor_mp         306  ODPS_Predictor.f90
gsi.x              000000000450F841  crtm_predictor_mp         212  CRTM_Predictor.f90
gsi.x              00000000044FBC39  crtm_k_matrix_mod         693  CRTM_K_Matrix_Module.f90
gsi.x              0000000003793484  call_crtm                2157  crtm_interface.f90
gsi.x              0000000002F262AC  setuprad                  903  setuprad.f90
gsi.x              00000000022E5638  setup_                    100  gsi_radOper.F90
gsi.x              000000000176065D  setuprhsall               490  setuprhsall.f90
gsi.x              000000000229C3CB  setuprhsall_.t701           0  glbsoi.f90
gsi.x              000000000229A928  glbsoi                    323  glbsoi.f90
gsi.x              0000000000AC2E4B  glbsoi_.void                0  gsisub.F90
gsi.x              0000000000AC2D7D  gsisub                    200  gsisub.F90
gsi.x              000000000044925B  gsisub_.t10490p.t           0  gsimod.F90
gsi.x              000000000044901D  gsimain_run              2302  gsimod.F90
gsi.x              0000000000413B39  gsi                       631  gsimain.f90
gsi.x              0000000000413ADD  Unknown               Unknown  Unknown
libc-2.17.so       00002B04A2DAA495  __libc_start_main     Unknown  Unknown
gsi.x              00000000004139F6  Unknown               Unknown  Unknown

After encountering this, I disabled -ftrapuv and continued testing, which led to an array out of bounds at setupw.f90:1816:
call nc_diag_metadata("Mitigation_flag_AMVQ", sngl(data(iamvq,i)) )
The value of iamvq is 27, but the array is allocated to 26. The length of the array is determined by the file header from which data is read. For the time being, I'm going to comment out this line. I'm not sure if the file in the regression test needs to be updated or if the mitigation flag should be removed from the diagnostic metadata for setupw. Any thoughts on this, @RussTreadon-NOAA @climbfuji?

@GeorgeVandenberghe-NOAA
Copy link

@RussTreadon-NOAA
Copy link
Contributor

Commit 06ed4d7 to hu5970/intel2022 adds version numbers to modulefiles/gsi_common_wcoss2.lua. Most version numbers and module names used in this file are identical to those specified in gsi_common.lua with the following exceptions:

library gsi_common.lua gsi_common_wcoss2.lua
netcdf 4.7.4
netcdf-c 4.9.2
netcdf-fortran 4.6.0
nemsio 2.5.4 2.5.2
ncdiag 1.1.1
gsi-ncdiag 1.1.1

@RussTreadon-NOAA
Copy link
Contributor

Acorn spack-stack test

@arunchawla-NOAA requested that we test hu5970:intel2022 using spack-stack installation in /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/modulefiles/Core

Since we have built and run hu5970:intel2022 using Mark's intel2021 spack-stack, start from this and simply update prepend_path in gsi_wcoss2.lua with the above path. Also use the gsi_common.lua and build.sh used to previously build hu5970:intel2022 with Mark's intel2021 spack-stack.

First build attempt failed because bufr/11.7.0 could not be found. See that new spack-stack references bufr/12.0.0. Update gsi_common.lua to bufr/12.0.0. Build again failed because gsi looks for bufr_d. bufr/12.0.0 does not provide bufr_d. Update src/gsi/CMakeLists.txt to reference bufr_4. As reported in issue #589, the update to bufr/12.0.0 requires source code changes in src/gsi/read_prepbufr.f90. These change were made in the working copy of hu5970:intel2022. Second build attempt ran to completion.

Most ctests fail with netcdf error. For example, global_3dvar_loproc_updat run time output contains

         -43 NetCDF: Attribute not found
99
nid001030.acorn.wcoss2.ncep.noaa.gov: rank 56 exited with code 99
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
gsi.x              0000000001AB988B  Unknown               Unknown  Unknown
libpthread-2.31.s  0000148E83E608C0  Unknown               Unknown  Unknown
libhdf5.so.103.2.  0000148E809F5B0B  H5CX_pop              Unknown  Unknown
libhdf5.so.103.2.  0000148E80B62A49  H5Pget_chunk          Unknown  Unknown
libnetcdf.so.18.0  0000148E852243FE  nc4_get_var_meta      Unknown  Unknown
libnetcdf.so.18.0  0000148E85222BCB  nc4_hdf5_find_grp     Unknown  Unknown
libnetcdf.so.18.0  0000148E8522AA2D  NC4_HDF5_inq_var_     Unknown  Unknown
libnetcdf.so.18.0  0000148E851AEF64  nc_inq_var            Unknown  Unknown
libnetcdf.so.18.0  0000148E851AEFCB  nc_inq_varndims       Unknown  Unknown
libnetcdff.so.7.0  0000148E854D1637  nf_inq_var_           Unknown  Unknown
libnetcdff.so.7.0  0000148E85512927  netcdf_mp_nf90_in     Unknown  Unknown
gsi.x              00000000016843B0  module_ncio_mp_op         499  module_ncio.f90
gsi.x              0000000000D6D14E  gesinfo_                  345  gesinfo.F90
gsi.x              00000000005CE5EF  gsisub_                   131  gsisub.F90

Notice that build uses production netcdf

-- FindNetCDF defines targets:
--   - NetCDF_VERSION [4.7.4]
--   - NetCDF_PARALLEL [FALSE]
--   - NetCDF_C_CONFIG_EXECUTABLE [/apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4/bin/nc-config]
--   - NetCDF::NetCDF_C [SHARED] [Root: /apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4] Lib: /apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4/lib/libnetcdf.so
--   - NetCDF_Fortran_CONFIG_EXECUTABLE [/apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4/bin/nf-config]
--   - NetCDF::NetCDF_Fortran [SHARED] [Root: /apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4] Lib: /apps/prod/hpc-stack/intel-19.1.3.304/netcdf/4.7.4/lib/libnetcdff.so

This is odd since gsi_common.lua specifies

load("netcdf-c")
load("netcdf-fortran")

Further investigation is needed.

@arunchawla-NOAA
Copy link

@AlexanderRichert-NOAA can you please take a look here ?Thanks

@RussTreadon-NOAA
Copy link
Contributor

Acorn spack-stack test (continued)

Comment out initial

load("netcdf-c")
load("netcdf-fortran")

in gsi_common.lua and add the following to the end of the file

unload("netcdf/4.7.4")
load("netcdf-c")
load("netcdf-fortran")

Rerun build.sh. This time see

-- FindNetCDF defines targets:
--   - NetCDF_VERSION [4.9.2]
--   - NetCDF_PARALLEL [TRUE]
--   - NetCDF_C_CONFIG_EXECUTABLE [/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-c-4.9.2-lnnrgek/bin/nc-config]
--   - NetCDF::NetCDF_C [SHARED] [Root: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-c-4.9.2-lnnrgek] Lib: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-c-4.9.2-lnnrgek/lib/libnetcdf.so
--   - NetCDF_Fortran_CONFIG_EXECUTABLE [/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-fortran-4.6.0-hzledc6/bin/nf-config]
--   - NetCDF::NetCDF_Fortran [SHARED] [Root: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-fortran-4.6.0-hzledc6] Lib: /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/netcdf-fortran-4.6.0-hzledc6/lib/libnetcdff.so

in build log. Seems there are some dependencies among loaded modules which force the load of netcdf/4.7.4. The gsi_common.lua used to build with Mark's spack-stack does not specify versions for most modules. Adding version numbers back to gsi_common.lua could help sort out what's going on.

Given successful build with new spack-stack, rerun ctests. All tests run to completion but 6 our of 9 tests fail

russ.treadon@alogin03:/lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr571_alex/build> ctest -j 9
Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr571_alex/build
    Start 1: global_3dvar
    Start 2: global_4dvar
    Start 3: global_4denvar
    Start 4: hwrf_nmm_d2
    Start 5: hwrf_nmm_d3
    Start 6: rtma
    Start 7: rrfs_3denvar_glbens
    Start 8: netcdf_fv3_regional
    Start 9: global_enkf
1/9 Test #8: netcdf_fv3_regional ..............***Failed  542.23 sec
2/9 Test #7: rrfs_3denvar_glbens ..............   Passed  664.38 sec
3/9 Test #9: global_enkf ......................   Passed  1087.75 sec
4/9 Test #5: hwrf_nmm_d3 ......................***Failed  1511.27 sec
5/9 Test #2: global_4dvar .....................***Failed  1861.54 sec
6/9 Test #1: global_3dvar .....................***Failed  1861.66 sec
7/9 Test #4: hwrf_nmm_d2 ......................***Failed  2045.72 sec
8/9 Test #3: global_4denvar ...................***Failed  2221.85 sec
9/9 Test #6: rtma .............................   Passed  2288.61 sec

33% tests passed, 6 tests failed out of 9

Total Test time (real) = 2288.62 sec

The following tests FAILED:
          1 - global_3dvar (Failed)
          2 - global_4dvar (Failed)
          3 - global_4denvar (Failed)
          4 - hwrf_nmm_d2 (Failed)
          5 - hwrf_nmm_d3 (Failed)
          8 - netcdf_fv3_regional (Failed)
Errors while running CTest

Examine failed cases. For all failed cases except one the initial total radiance penalties differ in the 12th, 13th or 14th digit depending on the test. The one exception is hwrf_nmm_d3. This test does not assimilate radiances. The total initial penalties are identical between the control and update. Differences in the hwrf_nmm_d3 test first show up in the initial gradients. The initial gradients differ in the 8th digit.

The radiance penalty differences are at round off level. The hwrf_nmm_d3 gradient differences are larger but may still reflect numerical round off differences. The control executables are built with intel19. The update executables are built with intel2021.

Bottom line: it is possible to build and run hu5970:intel2022 using modules from /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/modulefiles

@arunchawla-NOAA
Copy link

Thank you for all your help here Russ !!!! So my take home message is that if we build this spack stack on Dogwood and Cactus we will be ok to merge this PR back and start using spack stack.

@AlexanderRichert-NOAA how different is this spack stack from the unified environment that is being built across all the other RDHPCS platforms ? Can we assume that this will work everywhere ?

@bongi-NOAA
Copy link

bongi-NOAA commented Aug 23, 2023 via email

@arunchawla-NOAA
Copy link

@bongi-NOAA thanks for this test!! If there are 2022 hpc-stacks available on Dogwood / Cactus then we should just update the modules to use one of the 2022 stacks and proceed with this PR. We can switch to spack-stack once it is installed by NCO

@AlexanderRichert-NOAA
Copy link
Contributor

@arunchawla-NOAA re: the spack-stack unified environment, I would expect it to be the same as 1.4.1 on all our other platforms (same package versions/build options).

@RussTreadon-NOAA
Copy link
Contributor

CRTM coefficient problem

The crtm/2.4.0 coefficients associated with crtm-fix/2.4.0_emc from /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/modulefiles/Core. are not correct. While ctests run with these CRTM coefficients do not seg fault, the radiance penalties differ from develop in the 2nd digit.

The spack-stack crtm-fix/2.4.0_emc sets CRTM_FIX=/lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/intel/2022.0.2.262/crtm-fix-2.4.0_emc-2i4pkrg/fix.

The production crtm/2.4.0 sets CRTM_FIX=/apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0/fix.

The spack-stack CRTM_FIX contains 1706 files. The production CRTM_FIX cotnains 1410 files. A diff of the two directories finds 445 $sensor.TauCoeff.bin files differ.

The ctests reported above bypassed this problem by explicitly pointing at CRTM coefficients in /apps/ops/prod/libs/intel/19.1.3.304/crtm/2.4.0/fix.

We need to get the correct CRTM coefficients in the spack-stack CRTM_FIX.

@AlexanderRichert-NOAA
Copy link
Contributor

Noted. I'll looked into this.

Created JCSDA/spack-stack#735

@AlexanderRichert-NOAA
Copy link
Contributor

@RussTreadon-NOAA it appears that the file ftp://ftp.ssec.wisc.edu/pub/s4/CRTM/fix_REL-2.4.0_emc.tgz, which is where spack-stack and hpc-stack get the fix files, has changed a couple of times in the last year or so. I don't know why they changed, but in any case I'm fairly certain that if the hpc-stack version were reinstalled, it would look like the current spack-stack one.

@arunchawla-NOAA
Copy link

@RussTreadon-NOAA wow that was a great catch! Who is managing these fix files ?

@AlexanderRichert-NOAA
Copy link
Contributor

I @'d the CRTM manager (Ben Johnson) in JCSDA/spack-stack#735

@RussTreadon-NOAA
Copy link
Contributor

Acorn spack-stack questions

While the spack-stack in /lfs/h1/emc/nceplibs/noscrub/spack-stack/spack-stack-1.4.1/envs/unified-env-intel22/install/modulefiles/Core can be used to build and run hu5970:intel2022, it was done so with the following modifications taken from Mark's intel/2021 spack-stack

  1. ush/build.sh adds the lines
module unload craype-x86-rome
module unuse /apps/ops/prod/libs/modulefiles/compiler/intel/19.1.3.304

after module load gsi_$MACHINE_ID. Why do we need the unload and unuse lines? build.sh is used on various platforms. The above unload and unuse lines are not needed on other platforms.

  1. modulefiles/gsi_wcoss2.lua adds the lines
pushenv("CFLAGS", "-xHOST")
pushenv("FFLAGS", "-xHOST")

Why are these lines added?

  1. modulefiles/gsi_common_wcoss2.lua contains the following modifications
  • increment bufr_ver to 12.0.0
  • replace wrf_io with wrf-io for load
  • move netcdf-c and netcdf-fortran loads to the end of the file
    Why do we need to move the netcdf loads to the end of the file? If we don't place these module loads at the end of the file, we wind up with the production netcdf. Was changing the name of the wrf_io module to wrf-io intentional? Same question for bufr/12.0.0. GSI develop currently uses bufr/11.7.0. I'm ok with moving to 12.0.0 but is this required if we use spack-stack?
  1. module versions specified in gsi_common_wcoss2.lua differ from those in gsi_common.lua. We maintain gsi_$MACHINE_ID.lua and gsi_common.lua in modulefiles with the idea being that gsi_common.lua contains the same versioned modules used to build and run GSI on all platforms. Before this PR can be merged into develop we need to absorb gsi_common_wcoss2.lua into gsi_common.lua. This means we need to ensure other platforms have the versioned modules found in gsi_common_wcoss2.lua or we need to get gsi_wcoss2.lua to work with the versioned modules in gsi_common.lua or some combination of the two.

@AlexanderRichert-NOAA
Copy link
Contributor

AlexanderRichert-NOAA commented Aug 23, 2023

I can speak to some of these items:

1- My guess here is that it's to avoid production packages, but that's just a guess.
3- bufr: Ideally we'd like to get everyone using the latest versions of packages, but at least in principle, we could create a spack-stack environment using bufr 11.7.0 if absolutely necessary. Module load order: My two cents here is that it would probably be worth modifying MODULEPATH to get the production libraries out of the way, as opposed to relying on the load order to avoid them. This is what we're doing for UFS
4- I'm not sure what the latest is on which library versions are available on WCOSS2; I'll tag @Hang-Lei-NOAA to be safe, but at least as far as UFS I know there's a couple libraries we're waiting on (namely zlib and jasper).

@RussTreadon-NOAA
Copy link
Contributor

I'm fine with bufr/12.0.0 as long as it's available on WCOSS2 and NOAA RDHPCS machines. Moving to bufr/12.0.0 requires changes to src/gsi/read_prepbufr.f90.

I don't know what modifications to make to MODULEPATH to get the proper modules loaded. I'm an end user.

@arunchawla-NOAA
Copy link

I have asked NCO to create a spack-stack for us in an experimental space. That will work while we get issues ironed out

RussTreadon-NOAA added a commit to hu5970/GSI that referenced this issue Sep 11, 2023
RussTreadon-NOAA added a commit to hu5970/GSI that referenced this issue Sep 12, 2023
DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/GSI that referenced this issue Sep 15, 2023
DavidHuber-NOAA pushed a commit to DavidHuber-NOAA/GSI that referenced this issue Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment