Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug for regional runs when dycore is compiled in double precision (64 bit) #24

Closed
climbfuji opened this issue Jun 24, 2020 · 2 comments
Closed

Comments

@climbfuji
Copy link

climbfuji commented Jun 24, 2020

I discovered a bug in the FV3 dycore that affects regional runs when they dycore is built in double precision (64-bit). A halo exchange routine exch_uv in model/fv_regional_bc.F90 is using MPI_REAL as the variable type, regardless of the precision of the actual Fortran variable. This leads to corrupt data being received at the other end.

When compiled with full optimization flags, this manifests itself in run-to-run differences in the results on Cheyenne with Intel 19 and SGI MPT. For reasons unknown to me, no such run-to-run differences occur on Hera. My assumption is that Intel MPI is handling this mismatch differently. On all systems, however, the code crashes with SIGFPE messages in the corresponding section of the code (adiabatic init, i.e. calling fv_dynamics forward and backward) when the code is compiled with debug flags.

This code can be fixed as follows:

+#ifdef OVERLOAD_R4
+#define _DYN_MPI_REAL MPI_REAL
+#else
+#define _DYN_MPI_REAL MPI_DOUBLE_PRECISION
+#endif


 ! Receive from north
     if( north_pe /= NULL_PE )then
-       call MPI_Irecv(buf1,ibufexch,MPI_REAL,north_pe,north_pe &
+       call MPI_Irecv(buf1,ibufexch,_DYN_MPI_REAL,north_pe,north_pe &
                      ,MPI_COMM_WORLD,ihandle1,irecv)
     endif
...

See PR #25.

Note.

(1) With this bugfix, the debug jobs run further and, for the particular test that I am running, now crash later in the code in nh_utils.F90 or nh_core.F90 - both for the 32-bit and the 64-bit dycore build.

(2) With this bugfix, we still get b4b differences on Cheyenne with Intel. I don't know yet whether this is because of an issue with our test or a second bug in the dycore.

(3) At this occasion, I would also like to note that there is still a ! FIXME: MPI_COMM_WORLD note in routine exch_uv.

@climbfuji
Copy link
Author

climbfuji commented Jun 25, 2020

Update. As it turns out, this is not the only update required in this particular routine. With additional changes in commit b2b0d33 (or the corrected version 2d0479d), results are now bit-for-bit identical from run to run on Cheyenne using Intel 19.1.1 and SGI MPT 2.19. Further, the runs no longer crash in DEBUG mode!

@climbfuji
Copy link
Author

Fixed in #25.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant