Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test ERS_Lh11.C96.GFSv15p2.cheyenne_intel FAILS in restart comparison #62

Closed
jedwards4b opened this issue Jan 16, 2020 · 34 comments
Closed
Assignees
Labels
bug Something isn't working critical

Comments

@jedwards4b
Copy link
Collaborator

This test indicates that restarts are not producing bfb results under cime testing.

The file comparisons show:
run/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200116_085748_nk2uyu.ufsatm.atm.f011.nc.base.cprnc.out: of which 14 had non-zero differences
run/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200116_085748_nk2uyu.ufsatm.sfc.f011.nc.base.cprnc.out: of which 125 had non-zero differences

@jedwards4b jedwards4b added bug Something isn't working critical labels Jan 16, 2020
@jedwards4b jedwards4b self-assigned this Jan 16, 2020
@climbfuji
Copy link
Collaborator

I am currently working on getting the restart tests into the rt.sh regression test system, also because this was reported independently in NOAA-EMC/fv3atm#42. It would make sense to wait for the rt.sh based tests to be implemented before spending more time on this.

@pjpegion
Copy link
Collaborator

@climbfuji Please let me know if you want me to do anything.

@climbfuji
Copy link
Collaborator

I can confirm that with the namelist settings in the ufs_public_release branches for the GFS_v15p2 tests the restarts do not work. I am now trying to fix this, I've got a few ideas what may be the difference to the tests that we know are b4b reproducible in restart runs.

@climbfuji
Copy link
Collaborator

@jedwards4b @mcgibbon I have a solution for this (tested on my Mac for GFSv15p2 thus far). The default namelist settings for both GFSv15p2 and GFSv16beta in the ufs_public_release branch of the ufs-weather-model repository turn on skep, shum and sppt. The stochastic physics do not reproduce in restart runs, because the logic for dealing with restarts hasn't been implemented in the stochastic_physics repo (@pjpegion) and the model isn't writing those fields to the restart files (@DusanJovic-NOAA @junwang-noaa). My suggestion for the public release is to (a) turn off stochastic physics in the default namelists (Phil suggested this anyway, but I missed it) and (b) document that using the stochastic perturbations is an advanced feature that currently does not support b4b identical results through restarts (@ligiabernardet). For our development branches, we need to implement this capability in stochastic_physics and fv3atm in the near future. Any objections?

@jedwards4b
Copy link
Collaborator Author

jedwards4b commented Jan 17, 2020 via email

@ligiabernardet
Copy link
Collaborator

The default configurations for this release are with all stochastic processes turned off.
@climbfuji: can you produce b4b restarts with stochastics off?

@uturuncoglu
Copy link
Collaborator

@jedwards4b Yes, i confirm that. We turned off stochastic physics but it was still not b4b.

@climbfuji
Copy link
Collaborator

On my Mac, I am getting b4b identical results w/o stochastic physics. Now testing on Cheyenne with Intel.

@climbfuji
Copy link
Collaborator

Just to make sure that you are modifying the nstf_name namelist entry as well for the restart runs? The usual regression tests for ufs-weather-model use 2,1,1,0,5 for coldstarts. When restarting, one needs to set the second 1 to 0 (that is the NSST spinup flag, one of the "hidden features" - don't blame me). The input.nml we got from EMC uses 2,1,0,0,0 for coldstarts. I am testing now if 2,0,0,0,0 works for restarts or if we need to switch to "2,1,1,0,5" and "2,0,1,0,5". Just be patient, please.

@junwang-noaa
Copy link

junwang-noaa commented Jan 17, 2020 via email

@climbfuji
Copy link
Collaborator

I agree, Jun, it is not hidden to people who have access to Vlab. I am not sure if it is in the ufs-weather-model documentation for the release (I am lost wrt documentation) and I am not sure if the CIME folks know about it ... let's wait to hear from them!

@uturuncoglu
Copy link
Collaborator

@climbfuji I have already did it. My previous tests are on

Base run:
/glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.base/run
nstf_name = 2, 1, 0, 0, 0

Restart run:
/glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.rest/run
nstf_name = 2, 0, 0, 0, 0

By default, stochastic physics is off.

@climbfuji
Copy link
Collaborator

That's good to know, thanks. if that fails I will test the default 2,1,1,0,5 settings. Just wait, please.

@mcgibbon
Copy link

@climbfuji when you get things working, could you attach an input.nml which is working locally for you? I'd like to test it on my system. I would just ask which options disable skep, shum, and sppt but I can see those are disabled in my log file. I'd like to glance at whatever else might be different in my set-up.

@climbfuji
Copy link
Collaborator

Sure. But note everyone that I will be taking this weekend off (definitely Sunday and Monday), so please don't expect any answers before Tuesday. Thanks ...

@climbfuji
Copy link
Collaborator

Everyone, please see here NOAA-EMC/fv3atm#42 for the solution/namelists/... Thanks!

@jedwards4b
Copy link
Collaborator Author

@climbfuji I tried the cime test with these changes, it still fails.
I am using settings:
nstf_name = 2, 1, 1, 0, 5
for the initial run and
nstf_name = 2, 0, 1, 0, 5
for the restart run. I also updated the stochastic physics source.
My source tree is /glade/u/home/jedwards/sandboxes/ufs-mrweather-app
and the test is in /glade/scratch/jedwards/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.20200119_103112_odlpjt

@climbfuji
Copy link
Collaborator

I don't think I have the time to look at the differences between your runs and mine today. Here is a copy of all the directories you need on Cheyenne:

/glade/work/heinzell/fv3/rundirs_for_cime_restart_issues/

You will be interested in the following directories:

fv3_ccpp_gfs_v15p2_prod # 0-48h fcst
fv3_ccpp_gfs_v15p2_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v15p2_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

fv3_ccpp_gfs_v16beta_prod # 0-48h fcst
fv3_ccpp_gfs_v16beta_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v16beta_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

@climbfuji
Copy link
Collaborator

I am beginning to wonder if this is related to the debug-run problems you have been seeing, i.e. the missing update to the ufs_release_v1.0 branch for chgres_cube from George Gayno and the missing compiler flags for the GNU compiler for this executable.

@jedwards4b
Copy link
Collaborator Author

This test is using the Intel compiler so I'm not sure what GNU would have to do with it. The biggest difference I see is that you are using the cubed_sphere_grid for output_grid and I am using gaussian_grid . I'm looking into this now.

@climbfuji
Copy link
Collaborator

The same tests passed with the GNU compilers as well. They are identical except the modules.fv3 files. I can rerun the tests on Cheyenne with GNU and keep the rundirs, but as I said the differences will be in modules.fv3 and in the actual model output.

@uturuncoglu
Copy link
Collaborator

uturuncoglu commented Jan 21, 2020

@jedwards4b i tested with changing output_grid = 'cubed_sphere_grid' but the restart still fails. I'll try to find other possible differences between namelist files. I could also test by using input.nml and module_configure from following tables

fv3_ccpp_gfs_v15p2_prod # 0-48h fcst
fv3_ccpp_gfs_v15p2_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v15p2_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

fv3_ccpp_gfs_v16beta_prod # 0-48h fcst
fv3_ccpp_gfs_v16beta_coldstart_prod # 0-24h fcst
fv3_ccpp_gfs_v16beta_restart_prod # 24-48h fcst, restart files from coldstart run were copied into INPUT

@uturuncoglu
Copy link
Collaborator

@climbfuji I tested your input.nml with CIME build model for v15p2 and we have still difference in the restart. So, at least the problem is not related with input.nml. I'll continue to dig but let me know if you have any other idea. The runs are in

Base (48 hours): /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.jan16/run.base2
Restart (24+24 hours): /glade/scratch/turuncu/ufs-mrweather-app-workflow.c96.jan16/run.rest2

@climbfuji
Copy link
Collaborator

I can think of

  • differences in the compiler flags
  • differences in the NCEPLIBS (unlikely imo)
  • differences in the initial conditions (unlikely imo)

I need to get this cime setup run by myself. Will try tomorrow.

@uturuncoglu
Copy link
Collaborator

The initial documentation is in

https://ufs-mrapp.readthedocs.io/en/latest/index.html#

I am still working on but i could find lots of information especially in quick start guide.

@jedwards4b
Copy link
Collaborator Author

@climbfuji I ran the cime restart test with your executable and it passed. This points to a difference in the build, perhaps in the build flags, but I also noticed that you were not using the latest model version: 6a93463

@climbfuji
Copy link
Collaborator

@climbfuji I ran the cime restart test with your executable and it passed. This points to a difference in the build, perhaps in the build flags, but I also noticed that you were not using the latest model version: 6a93463

Yes, the code I had used for the testing didn't include the last PR. But the current PR I have and for which I reran the restart tests does (ufs-community/ufs-weather-model#33).

@jedwards4b
Copy link
Collaborator Author

I built using src/model/tests/compile_cmake.sh and it also passed the restart test - I've been studying the build since and still cannot pinpoint the difference.

@climbfuji
Copy link
Collaborator

If you send me build logs (cmake and make; may have to add VERBOSE=1 to the make calls) then I can take a look. Maybe something comes to my mind wrt which files to look at when I stare at this long enough. Thanks ...

@jedwards4b
Copy link
Collaborator Author

/glade/scratch/jedwards/ERS_Lh11.C96.GFSv15p2.cheyenne_intel.try/bld/atm.bldlog.200121-200946.gz

@jedwards4b
Copy link
Collaborator Author

This problem is fixed. The build flags to libfv3core.a were different.

@climbfuji
Copy link
Collaborator

Yeah! Thanks for figuring this out, I was struggling all day to find time to look at your compile logs.

@mcgibbon
Copy link

Can you please elaborate on the fix @jedwards4b? I'm having the same issue with a different build system.

@jedwards4b
Copy link
Collaborator Author

@mcgibbon I found that the noaa build was using the flag -fp-model consistent but the cime build was using -fp-model source in the compilation of the fv3core library. Changing the cime compile to match the noaa compile solved the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working critical
Projects
None yet
Development

No branches or pull requests

7 participants