Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings #492

mkavulich · 2021-05-12T22:30:56Z

DESCRIPTION OF CHANGES:

This PR accomplishes three things:

A new pre-defined domain (RRFS_NA_3km) has been added to the SRW App. Nodes/core settings must be modified for chgres_cube and post due to the size of this domain. A WE2E test was added and more information on all of these settings can be found within the related config.sh script (tests/baseline_configs/config.grid_RRFS_NA_3km.sh).
The default k_split value is updated for a faster model integration. With k_split=2, we see model integration close to X% faster than the previous settings for the same weather model hash. This will not affect physics suites that have specified other k_split values
In order to properly run the above domain with the intended FV3_RRFS_v1alpha physics suite, the weather model needed to be updated to a more recent hash. This more up-to-date weather model version also has renamed the FV3_GFS_v16beta suite to FV3_GFS_v16; this required a number of changes to the workflow and end-to-end tests. In addition, several changes to default settings are occurring in this PR. Changes have also been made to k/n_split values in the namelist template which optimize run time. CPUS_PER_TASK_RUN_FCST is changed from "4" to "2" in this PR. Setting this field to "4" was doubling the requested nodes for the run_fcst task. For example, a 3-km CONUS run that normally requests 25 nodes (based on predefined layout_x/y values) was asking for 50, simply because CPUS_PER_TASK_RUN_FCST=4, which was unacceptable. When it is set to "2", the number of nodes remains unchanged, in line with the layout_x/y values. EMC is using CPUS_PER_TASK_RUN_FCST=2 for their runs, so this should be uncontroversial.

Note that the changes for the new RRFS_NA_3km domain and default decomposition/CPU settings were originally proposed by @JeffBeck-NOAA in PR #480, however since that PR requires the updates to the weather model in order to be tested with the intended physics suite, I have incorporated those changes into mine. As Jeff is on vacation right now I hope he won't mind me stealing his thunder :)

This PR will need to be accompanied by changes in the ufs-srweather-app for updating the weather model hash and incorporating some necessary build changes (including compiling with 32-bit reals by default); this PR has been created (ufs-community/ufs-srweather-app#140) but it still a draft pending some platform-specific fixes and the merger of this PR.

TESTS CONDUCTED:

For the initial changes for PR #480, multiple tests on Hera were run, including a full 36-hr forecast here: /scratch2/BMC/det/beck/FV3-LAM/expt_dirs/test_RRFS_NA_3km_36hr

With the additional changes and updates to the weather model, and updates to the Hera environment file, all end-to-end tests (aside from nco tests) were run on Hera (intel). There were a few pre-existing failures, and aside from an occasional GST failure due to wallclock time issues (see #490) the only new failures were for grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_GSD_SAR, grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_HRRR, and grid_RRFS_CONUS_25km_ics_NAM_lbcs_NAM_suite_RRFS_v1beta, which all had a new failure in make_ics and make_lbcs. Currently investigating this issue, though it is almost certainly related to the build environments which need to be addressed in ufs-community/ufs-srweather-app#140

Tests are ongoing on Cheyenne, Jet and Orion; will update with the results once those are complete. However, I believe the Hera tests are sufficient for this PR, as any failures specific to those platforms need to be addressed at the app level.

CONTRIBUTORS:

@JeffBeck-NOAA authored the half of these changes originating from #480, and offered the following credits on his original PR:

Thanks are due to @JamesAbeles-NOAA for his recommendations for build/namelist changes and help troubleshooting run times. Thanks to @BenjaminBlake-NOAA and @JacobCarley-NOAA for their help with the domain configuration.

…remove redundant test

tests/baseline_configs/config.grid_RRFS_NA_3km.sh

gsketefian · 2021-05-20T15:35:35Z

tests/baseline_configs/config.grid_RRFS_NA_3km.sh

+# 2) 1) with k_split=5/n_split=2, each time step takes ~15 seconds
+# 3) 2) with 32-bit build, each time step takes ~6 seconds
+
+NNODES_MAKE_ICS="12"


Are these settings specific to Hera? If we need to change these according to platform, we can put that kind of code in run_experiments.sh. That's where I've been doing platform-specific differentiation of settings.

@JeffBeck-NOAA Can better answer these questions since these changes came from him.

I think the problem is specific to the domain, which is very large and high-resolution, requiring more nodes for make_ics and make_lbcs for memory reasons. I assume that would mean this is required on all platforms.

I do think that these settings are less than ideal. Making PPN_MAKE_ICS and PPN_MAKE_LBCS lower means we will be under-utilizing the nodes.

We ddo need to have a separate conversation about making PPN_RUN_FCST (and maybe every PPN setting) platform specific in defaults. Right now we are under-utilizing nodes on most platforms using default settings. Maybe this can be rolled in to issue #452 since the OMP settings will have to be taken into account as well?

These settings are required for all platforms due to the domain size and resolution. The PPN_MAKE_ICS and PPN_MAKE_LBCS values are a specific chgres_cube requirement for large domains (need fewer processes but massive amounts of memory) due to an ESMF limitation/memory bug.

@JeffBeck-NOAA Thanks for the clarification. In that case should these be handled somewhere in the generate workflow calling tree in order to have these settings as a default for this domain specifically? Maybe that's something best handled in an issue and follow-up PR in the future?

@mkavulich, yeah, this is where things can get really thorny. Do we want to fill the generate script with nested if/case statements for different domains, resolutions, platforms for changes to PPN for individual tasks? I wasn't sure if we wanted to go that route or just have users source the specific WE2E config.sh file when they want to run this domain? I think this definitely deserves an issue and potential follow-up PR.

There's likely some variability between platforms, but as a first approximation, chgres_cube needs more nodes and less cores for memory management for this domain across all platforms. Since there are no NNODES_* or PPN_* settings currently defined in the set_predef_grid_params.sh script, I left the changes in the WE2E test. The test is designed for Hera, but could be run on any platform.

@gsketefian I was able to run this test on Cheyenne as well (but the forecast step did not complete due to wallclock time, so I can't be sure if it would have fully run successfully).

I agree that k_split and n_split should likely be handled by domain, however, currently they are also set according to the physics scheme, which complicates things. I believe they should also be configurable via config.sh for maximum flexibility. But again, that complicates things.

I can take this change out of the current PR...more discussion definitely needs to happen, the question is do we keep this change for now or not.

@mkavulich, if the NA 3-km WE2E test runs with the old k/n_split settings, then lets stick with them and deal with domain-, physics-, and platform-dependent settings in a later PR.

Sounds good to me.

On AWS for the RRFS work, we run chgres on a single node that has 768 GB of memory. I wonder if there is a way to specify large memory nodes if available, but will leave that to you

tests/baselines_list.txt

gsketefian · 2021-05-20T15:50:30Z

ush/templates/input.nml.FV3

@@ -85,15 +85,15 @@
    hord_vt = 6
    hydrostatic = .false.
    io_layout = 1,1
-    k_split = 4
+    k_split = 5


So the new k/n_split values are better (or at least not worse) for all grids? They are not at all grid-dependent, or maybe even suite-dependent?

@JeffBeck-NOAA will be able to answer this better, but I did notice that the Graduate Student Test now takes longer to run, often hitting the wallclock limit (#490). This may be due to the different k_split and n_split settings, but I did not do a rigorous comparison between the current develop branch and my changes.

I think it is definitely necessary to address these concerns but I do not know enough about the inner workings of the dycore to have an educated opinion.

@gsketefian, yes these k/n_split values are better for all grids and are the recommended values from EMC. I'm not sure why the GST would take longer to run. Was #490 a problem before this PR?

@gsketefian, @mkavulich, I have confirmed that the n_split/k_split values that I received need to be flipped. So it should be k_split=2/n_split=5 and not k_split=5/n_split=2. This can go in a future PR, but @mkavulich, this is likely why your GST run is slower. If we want, we can keep the previous values for now (k_split=4/n_split=5).

I think I'd rather fix it here if this is what needs to go in. I'll swap those values and see what the result is. Assuming it is not slower I'll apply those changes to this PR.

I am actually not sure if #490 was a specific problem before making these changes but I did notice that my unmodified runs were quite close to reaching the wallclock limit (within a few minutes).

I can't see how k_split=5,n_split=2 can be faster than k_split=2,n_split=5. The is more expensive computation done using the former (more calls done with k_split=5)

@JeffBeck-NOAA @JamesAbeles-NOAA Sorry, I just read my initial message and I agree it was super-unclear.

The new test (k_split=2 and n_split=5) is the faster one, approximately 20% faster than (k_split=5 and n_split=2). So it sounds like this is good and expected.

@mkavulich, Ah! That makes much more sense. I also couldn't understand how more vertical remappings was resulting in a faster run. In that case, are we ready for a merge with those new settings?

I pinged Gerard to make sure I had addressed all of his concerns before merging. I'm also hoping to finish up some follow-up tests with the old settings so I can be sure I have the right figures in the commit message, but if that doesn't finish before Gerard gets back to me I'll just merge it anyway, this PR needs to get in.

My original concern was that with the new k_split/n_split values, although we're improving the computational efficiency (and maybe also the forecast quality?) for one grid (e.g. CONUS 3km), we might be degrading the forecast quality for another and not know it -- maybe the Alaska grids where the vertical mapping may have to be done more frequently because of the steep terrain. This is probably beyond the scope of this PR, but we should keep it in mind. Maybe the bigger question is whether k_split and n_split should be the same for all grids or should they be grid-dependent.

…l (hash 2f1c8e1) (#140) ## DESCRIPTION OF CHANGES: This PR (along with https://github.com/NOAA-EMC/regional_workflow/pull/492) update the UFS SRW App to work with a more up-to-date hash of the ufs-weather-model (ufs-community/ufs-weather-model@2f1c8e1) ## TESTS CONDUCTED: Ran full suite of tests on Hera (aside from nco tests) with updated environment files (/scratch2/BMC/det/kavulich/workdir/update_app_master/step-by-step/expt_dirs/). The only new failures are only for older versions of NAM input; this is due to a change in the weather model, and will need to be handled in a separate PR. Ran several end-to-end tests on Cheyenne (Intel 19.1.1) and Jet. Also ran the Graduate Student Test case on Orion. No failures outside of those outlined above and a few previously-known failures. **Tests have not been run on WCOSS platforms; these will likely fail without being updated to the latest ESMF modules but DTC does not have access to update and test** ## ISSUE: Fixes issue #134

mkavulich added 3 commits May 12, 2021 15:54

Rename FV3_GFS_v16beta to FV3_GFS_v16

1337d9b

Add Jeffs changes for new NA domain, plus an RRFS v1alpha suite test

597c40a

Make grid_RRFS_NA_3km test use RRFS_v1alpha suite, add to test list, …

008bf3d

…remove redundant test

mkavulich requested a review from gsketefian May 12, 2021 22:30

mkavulich requested review from BenjaminBlake-NOAA, JeffBeck-NOAA and JulieSchramm as code owners May 12, 2021 22:30

Forgot to add new test to baselines_list.txt

e3ef798

mkavulich changed the title ~~Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update~~ Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings May 13, 2021

mkavulich mentioned this pull request May 13, 2021

[develop] Update build environments for latest version of weather model (hash 2f1c8e1) ufs-community/ufs-srweather-app#140

Merged

gsketefian reviewed May 20, 2021

View reviewed changes

tests/baseline_configs/config.grid_RRFS_NA_3km.sh Outdated Show resolved Hide resolved

gsketefian reviewed May 20, 2021

View reviewed changes

tests/baselines_list.txt Show resolved Hide resolved

gsketefian reviewed May 20, 2021

View reviewed changes

JeffBeck-NOAA mentioned this pull request May 24, 2021

Add the RRFS_NA_3km pre-defined domain to the SRW App #480

Closed

mkavulich added 2 commits May 24, 2021 17:52

Remove old runtime notes

63783c9

Fix incorrect settings; k_split and n_split should be switched

c3dfbb9

JeffBeck-NOAA approved these changes May 25, 2021

View reviewed changes

gsketefian approved these changes May 25, 2021

View reviewed changes

mkavulich merged commit 60e633a into ufs-community:develop May 25, 2021

shoyokota pushed a commit to shoyokota/regional_workflow that referenced this pull request May 31, 2023

Bringing in the latest UPP fix files. (ufs-community#492)

974f4fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings #492

Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings #492

mkavulich commented May 12, 2021 •

edited

Loading

gsketefian May 20, 2021

mkavulich May 20, 2021

JeffBeck-NOAA May 20, 2021

mkavulich May 20, 2021

JeffBeck-NOAA May 20, 2021

JeffBeck-NOAA May 25, 2021

mkavulich May 25, 2021

JeffBeck-NOAA May 25, 2021

gsketefian May 25, 2021

JamesAbeles-NOAA May 25, 2021

gsketefian May 20, 2021

mkavulich May 20, 2021

JeffBeck-NOAA May 20, 2021

JeffBeck-NOAA May 20, 2021

mkavulich May 20, 2021

JamesAbeles-NOAA May 24, 2021

mkavulich May 24, 2021

JeffBeck-NOAA May 24, 2021

mkavulich May 24, 2021

gsketefian May 24, 2021

Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings #492

Update for latest version of weather model (hash 2f1c8e1), add RRFS_NA_3km pre-defined domain, update timestep and MPI settings #492

Conversation

mkavulich commented May 12, 2021 • edited Loading

DESCRIPTION OF CHANGES:

TESTS CONDUCTED:

CONTRIBUTORS:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mkavulich commented May 12, 2021 •

edited

Loading