Bug for regional runs when dycore is compiled in double precision (64 bit) #24

climbfuji · 2020-06-24T14:56:05Z

I discovered a bug in the FV3 dycore that affects regional runs when they dycore is built in double precision (64-bit). A halo exchange routine exch_uv in model/fv_regional_bc.F90 is using MPI_REAL as the variable type, regardless of the precision of the actual Fortran variable. This leads to corrupt data being received at the other end.

When compiled with full optimization flags, this manifests itself in run-to-run differences in the results on Cheyenne with Intel 19 and SGI MPT. For reasons unknown to me, no such run-to-run differences occur on Hera. My assumption is that Intel MPI is handling this mismatch differently. On all systems, however, the code crashes with SIGFPE messages in the corresponding section of the code (adiabatic init, i.e. calling fv_dynamics forward and backward) when the code is compiled with debug flags.

This code can be fixed as follows:

+#ifdef OVERLOAD_R4
+#define _DYN_MPI_REAL MPI_REAL
+#else
+#define _DYN_MPI_REAL MPI_DOUBLE_PRECISION
+#endif


 ! Receive from north
     if( north_pe /= NULL_PE )then
-       call MPI_Irecv(buf1,ibufexch,MPI_REAL,north_pe,north_pe &
+       call MPI_Irecv(buf1,ibufexch,_DYN_MPI_REAL,north_pe,north_pe &
                      ,MPI_COMM_WORLD,ihandle1,irecv)
     endif
...

See PR #25.

Note.

(1) With this bugfix, the debug jobs run further and, for the particular test that I am running, now crash later in the code in nh_utils.F90 or nh_core.F90 - both for the 32-bit and the 64-bit dycore build.

(2) With this bugfix, we still get b4b differences on Cheyenne with Intel. I don't know yet whether this is because of an issue with our test or a second bug in the dycore.

(3) At this occasion, I would also like to note that there is still a ! FIXME: MPI_COMM_WORLD note in routine exch_uv.

The text was updated successfully, but these errors were encountered:

climbfuji · 2020-06-25T03:21:24Z

Update. As it turns out, this is not the only update required in this particular routine. With additional changes in commit b2b0d33 (or the corrected version 2d0479d), results are now bit-for-bit identical from run to run on Cheyenne using Intel 19.1.1 and SGI MPT 2.19. Further, the runs no longer crash in DEBUG mode!

climbfuji · 2020-06-29T15:21:26Z

Fixed in #25.

climbfuji mentioned this issue Jun 24, 2020

Bugfix for regional runs when dycore is compiled in double precision #25

Merged

climbfuji closed this as completed Jun 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug for regional runs when dycore is compiled in double precision (64 bit) #24

Bug for regional runs when dycore is compiled in double precision (64 bit) #24

climbfuji commented Jun 24, 2020 •

edited

Loading

climbfuji commented Jun 25, 2020 •

edited

Loading

climbfuji commented Jun 29, 2020

Bug for regional runs when dycore is compiled in double precision (64 bit) #24

Bug for regional runs when dycore is compiled in double precision (64 bit) #24

Comments

climbfuji commented Jun 24, 2020 • edited Loading

climbfuji commented Jun 25, 2020 • edited Loading

climbfuji commented Jun 29, 2020

climbfuji commented Jun 24, 2020 •

edited

Loading

climbfuji commented Jun 25, 2020 •

edited

Loading