Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MPI_Bcast instead of multiple p2p messages to update nest from parent #272

Merged
merged 14 commits into from
Jan 22, 2024

Conversation

dkokron
Copy link

@dkokron dkokron commented May 15, 2023

Performance profiling of a HAFS case on NOAA systems revealed significant of time was spent in fill_nested_grid_cpl(). The fill_nested_grid_cpl() routine from FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90 is showing up as a performance bottleneck. This routine gathers a global SST field (6,336,000 bytes) onto rank 0, then broadcasts that field to all ranks in the nest. The code uses point-to-point (p2p) messages (Isend/Recv) from rank 0 to the nest ranks. This communication pattern is maxing out the SlingShot-10 link on the first node resulting in a .15s hit every fifth time step. This works out to 189 seconds for a full 5-day simulation. These timings are from the 26 node HAFS case on acorn.

Given the way its currently coded coded, I expect the bottleneck to grow with the number of ranks in the nest. Timings from a 47-node production case run for 24-hour simulation on cactus with and without using the code changes contained in this PR.

An offline reproducer showed that setting the following environment variables improved the performance of using MPI_Bcast in this scenario.

export MPICH_BCAST_ONLY_TREE=0
export MPICH_COLL_OPT_OFF=mpi_bcast

Scenarios:

  1. Unmodified code (Baseline)
  2. Unmodified code with above environment settings
  3. Code modified following this PR with environment settings
  4. Code modified to use MPI_Bcast directly with above environment settings
  5. Code modified to use MPI_Bcast directly without environment settings

Performance metric:
Add up the phase1 and phase2 timings printed in the output listing
grep PASS stdout | awk '{t+=$10;print t}' | tail -1
The units are seconds.

Scenario Trial1 Trial2 Trial3 Trial4 Trial5 Trial6 Mean
1 966.2 960.9
2 964.9 966.9 986.6 976.4 x x 973.7
3 971.3 955.8 934.4 941.6 944.8 936.7 947.4
4 899.3 896.5 915.6 911.3 909.6 912.0 907.3
5 895.1 899.9

Based on these timings, I expect savings from scenario 3 (this PR) of ~(26s)*5=131 seconds for a full 5-day simulation.
Based on these timings, I expect savings from scenario 4 (previous version of this PR) of ~(66s)*5=332 seconds for a full 5-day simulation.

This PR does not address any open issues.

I ran the UFS regression suite on acorn and cactus. Both runs resulted in "REGRESSION TEST WAS SUCCESSFUL"

Checklist:

Please check all whether they apply or not

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

@@ -452,6 +456,9 @@ subroutine fv_control_init(Atm, dt_atmos, this_grid, grids_on_this_pe, p_split,
allocate(global_pelist(npes))
call mpp_get_current_pelist(global_pelist, commID=global_commID) ! for commID

! Need the fcst_mpi_comm in order to construct the communicators used by MPI_Bcast in fill_nested_grid_cpl()
call ESMF_VMGetCurrent(vm=vm,rc=rc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing attention to this problem and glad to see that there is a solution. This idea could be useful for a lot of the tasks in creating and filling grids especially nests.

I believe there should be FMS calls that will allow us to do the communications directly between processors instead of using low-level MPI routines or the ESMF calls which are not part of FV3. @bensonr or @laurenchilutti will know for sure; if we don't have these please open an issue in FMS.

Copy link
Author

@dkokron dkokron May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FMS does have a similar call (mpp_broadcast), but, at least the way I read the code, it creates the communicator on-the-fly every time its called. My testing shows that the creation of the communicator takes about the same amount of time as the actual Bcast. That's why I create the communicator in the fv_control_init(). Some numbers from an offline reproducer run on acorn. The offline code creates the MPI communicator every iteration. The first run takes a long time (cycles) probably due to MPI initialization. "setup subcomm" refers to the call to MPI_Comm_split()

Cycles to setup subcomm: 845895398
Cycles Bcast with a comm: 352677465

Subsequent calls within the same execution settle into a nice pattern.
Cycles to setup subcomm: 35054707
Cycles Bcast with a comm: 3219772

Cycles to setup subcomm: 1966522
Cycles Bcast with a comm: 3727102

Cycles to setup subcomm: 2671380
Cycles Bcast with a comm: 3211785

Cycles to setup subcomm: 1973183
Cycles Bcast with a comm: 3195247

Cycles to setup subcomm: 1884240
Cycles Bcast with a comm: 3205283

Cycles to setup subcomm: 2135385
Cycles Bcast with a comm: 3193492

Cycles to setup subcomm: 2105010
Cycles Bcast with a comm: 3203685

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkokron - please send my your test program so I can confirm your findings with mpp_broadcast. As I look at the code, if the pelist has been defined previously, it should be using a predefined communicator. A new communicator is defined only when a new, unique pelist is encountered.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on cactus. I will upload when cactus comes back from dedicated time.

Perhaps then it's enough to create a pelist with the appropriate PEs in fv_control_init() instead of an MPI_Comm.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know about the cactus downtime.

There should be no need to create the pelist in fv_control. It can be defined/used within the fill_nested_grid_cpl function. The call to get_peset here will execute this code in mpp_util_mpi.inc. If the pelist is recognized, it will return the mpp-managed internal list location and mpp_broadcast will use the communicator associated with that list entry. If the pelist isn't recognized, it will add it to the linked list of known pelists and create a communicator via MPI_Comm_create_group. If you want to look over the code with me, let me know and we can schedule it.

Copy link
Author

@dkokron dkokron May 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point about being able to use get_peset() to create the communicator once storing it in the pset structure, then mpp_broadcast() reuses that communicator. My implementation does essentially the same thing while storing the communicator in the fv_atmos_type structure.

I'm a little worried about the extra work done in the MPP approach. Every call to get_peset() will check for monotonicity and check for a match to an existing peset. I'll see if I can code up the MPP approach in my unit tester and determine if my worry is justified.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My worry was about the extra work was unfounded. I will re-code the PR using mpp_broadcast.

You can find my unit testers under clogin02:/u/daniel.kokron/Bcast/
The testers simulate what happens in a HAFS case with one nest run on 736 ranks.

Some timings:
grep Cycles Bcast.o58131230

Using MPI_Bcast with a pre-made communicator (C implementation)
Cycles to setup subcomm: 789298313
Cycles Bcast with a comm: 350650485
Cycles Bcast with a comm: 96031373
Cycles Bcast with a comm: 3138885
Cycles Bcast with a comm: 3160485
Cycles Bcast with a comm: 3150698
Cycles Bcast with a comm: 3127387
Cycles Bcast with a comm: 3128850
Cycles Bcast with a comm: 3114293
Cycles Bcast with a comm: 3123540
Cycles Bcast with a comm: 3147165
Cycles Bcast with a comm: 3136793
Cycles Bcast with a comm: 3124845
Cycles Bcast with a comm: 3137737
Cycles Bcast with a comm: 3132450
Cycles Bcast with a comm: 3156840

Using MPI_Bcast with a pre-made communicator (Fortran implementation)
Cycles to setup subcomm 136254240
Cycles rank/thread 332684775
Cycles rank/thread 4624537
Cycles rank/thread 3276248
Cycles rank/thread 3257415
Cycles rank/thread 3283357
Cycles rank/thread 3271545
Cycles rank/thread 3424860
Cycles rank/thread 3280073
Cycles rank/thread 3279398
Cycles rank/thread 3288915
Cycles rank/thread 3359723
Cycles rank/thread 3296475
Cycles rank/thread 3246457
Cycles rank/thread 3312000
Cycles rank/thread 3278633

Using mpp_broadcast()
Cycles rank/thread 548630797
Cycles rank/thread 3106642
Cycles rank/thread 3273503
Cycles rank/thread 3308108
Cycles rank/thread 3264412
Cycles rank/thread 3284077
Cycles rank/thread 3473212
Cycles rank/thread 3260768
Cycles rank/thread 3294653
Cycles rank/thread 3311325
Cycles rank/thread 3297983
Cycles rank/thread 3296722
Cycles rank/thread 3306937
Cycles rank/thread 3325117
Cycles rank/thread 3270375

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dkokron - I appreciate your willingness to work with us on this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never easy.

Swapping in mpp_broadcast in place of MPI_Bcast in fill_nested_grid_cpl() results in runtime failure.

   subroutine fill_nested_grid_cpl(this_grid, proc_in)
-    use mpi
+    !use mpi
+    use mpp_mod, only: mpp_broadcast
     integer, intent(in) :: this_grid
     logical, intent(in), optional :: proc_in
 
@@ -2460,7 +2461,8 @@ contains
                             g_dat(isg:,jsg:,1), position=CENTER)
     endif
     if(any(mpp_pe() == Atm(this_grid)%Bcast_ranks)) then
-      call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+      !call MPI_Bcast(g_dat, size(g_dat), MPI_REAL, 0, Atm(this_grid)%Bcast_comm, ierr) ! root==0 because sending rank (Atm(this_grid)%sending) is rank zero in Bcast_comm
+      call mpp_broadcast(g_dat, size(g_dat),Atm(this_grid)%sending_proc,Atm(this_grid)%Bcast_ranks)
     endif
     call timing_off('COMM_TOTAL')
     if (process) then

The stack trace points to a part of the code I didn't change.

nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: eval_move_nest: move_cd_x= 0 move_cd_y= 0 do_move= F
nid001001.acorn.wcoss2.ncep.noaa.gov 0: NOTE from PE 0: Opened BC file: INPUT/gfs_bndy.tile7.003.nc
nid001001.acorn.wcoss2.ncep.noaa.gov 0: forrtl: error (78): process killed (SIGTERM)
nid001001.acorn.wcoss2.ncep.noaa.gov 0: Image PC Routine Line Source
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000258B397 fv_regional_mod_m 3387 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000257BCD5 fv_regional_mod_m 1792 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000259E347 fv_regional_mod_m 1592 fv_regional_bc.F90
nid001001.acorn.wcoss2.ncep.noaa.gov 0: fv3_32bit.exe 000000000291AABA atmosphere_mod_mp 673 atmosphere.F90

@lharris4
Copy link
Contributor

lharris4 commented May 23, 2023 via email

@dkokron
Copy link
Author

dkokron commented May 23, 2023

I was able to get it working with mpp_broadcast() by playing games with mpp_set_current_pelist(). I'm running the regression suite now and will redo some of the performance testing to make sure it still helps.

@lharris4
Copy link
Contributor

lharris4 commented May 23, 2023 via email

@dkokron
Copy link
Author

dkokron commented May 24, 2023

I gathered more timings and updated the table. Using MPI_Bcast directly is significantly faster than using mpp_broadcast(). Please review the code and let me know if I'm using MPP incorrectly.

@bensonr
Copy link
Contributor

bensonr commented May 24, 2023

@dkokron - what timings did you get using mpp_broadcast for comparison with the data in the PR description

@dkokron
Copy link
Author

dkokron commented May 24, 2023 via email

@dkokron
Copy link
Author

dkokron commented May 29, 2023

Given the significant difference in timings between scenarios 3 (mpp_braodcast) and 4 (MPI_Bcast), which proposal should we move froward with?

@lharris4
Copy link
Contributor

Hello @dkokron. @bensonr and I are discussing how to move forward. There may be an underlying problem in the way the SST is being gathered in the first place that isn't addressed by this fix.

@dkokron
Copy link
Author

dkokron commented May 31, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Jun 1, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 1, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Jun 1, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 1, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@bensonr
Copy link
Contributor

bensonr commented Jun 2, 2023

Unfortunately, github blocked the addresses you just included. I was planning to send to your NOAA account, if that's okay with you.

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 2, 2023 via email

@dkokron
Copy link
Author

dkokron commented Jun 6, 2023

Submitted NOAA-GFDL/FMS#1246 to address the failure seen above when using mpp_broadcast().

Please consider using the pure MPI approach to address this performance issue.
We can revisit mpp if that PR is accepted, released, and approved by NOAA.

@uturuncoglu
Copy link

@junwang-noaa We are not using FMS for I/O anymore. We switched to ESMF multi-tile support for I/O. This will be available once we merge the ufs-community/ufs-weather-model#1845

@bensonr
Copy link
Contributor

bensonr commented Aug 15, 2023

@junwang-noaa - I know this is a little off topic for the dycore repo here... While you are working through that list, you still have the ability to compile in the deprecated io modules via a -Duse_deprecated_io flag (see FMS release notes). We test MOM6 as part of our FMS release testing and there are some updates from the main trunk that will need to be pulled into the ufs branch to resolve any issues.

@junwang-noaa
Copy link
Collaborator

Rusty, thanks for the information!

@dkokron
Copy link
Author

dkokron commented Aug 23, 2023

As far as GFDL_atmos_cubed_sphere is concerned, is this PR fit to move forward?

@laurenchilutti
Copy link
Contributor

As long as FMS is compiled with the flag -Duse_deprecated_io, this PR is fit to move forward. @jkbk2004 as UFS code manager, when do you think we can have this PR merged?

@jkbk2004
Copy link

We are working on Spack stack PR. For FMS-2023-02, JEDI team need to use -Duse_deprecated_io. A bit coordination and test is needed to confirm the test results. I think we can schedule to commit next week or so. But I will keep checking as test going on.

@dkokron
Copy link
Author

dkokron commented Aug 23, 2023

In that case, should I extract the FMS-2023.02 porting modifications from my branch?

@laurenchilutti
Copy link
Contributor

In that case, should I extract the FMS-2023.02 porting modifications from my branch?

I see no harm in keeping these changes in.

@@ -1843,6 +1848,7 @@ end subroutine allocate_fv_atmos_type

!>@brief The subroutine 'deallocate_fv_atmos_type' deallocates the fv_atmos_type.
subroutine deallocate_fv_atmos_type(Atm)
use mpi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed anymore since we are using mpp_broadcast() now instead of MPI_Bcast().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would like to see this use statement removed prior to accepting.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed unneeded use of MPI module in model/fv_arrays.F90

tools/fv_grid_tools.F90 Show resolved Hide resolved
model/fv_control.F90 Show resolved Hide resolved
@dkokron
Copy link
Author

dkokron commented Sep 29, 2023 via email

@laurenchilutti
Copy link
Contributor

@bensonr are there anymore changes that you would like to see in this PR?

@dkokron
Copy link
Author

dkokron commented Nov 7, 2023

The FMS modification that supports this PR was merged. See details at NOAA-GFDL/FMS#1246

@bensonr
Copy link
Contributor

bensonr commented Nov 8, 2023

@dkokron - at the bi-weekly UFS code managers meeting we have been telling them this is ready to merge - they have not yet blessed it.

@laurenchilutti
Copy link
Contributor

The UFS code managers will be ready to merge this change in in the coming weeks. @dkokron could you create a fv3atm PR and UFS-weather-model PR for this change?

@dkokron
Copy link
Author

dkokron commented Dec 15, 2023

@laurenchilutti Will do.

@dkokron
Copy link
Author

dkokron commented Dec 20, 2023

Opened PR against fv3atm
NOAA-EMC/fv3atm#743

@dkokron
Copy link
Author

dkokron commented Dec 20, 2023

Opened PR with UFS
ufs-community/ufs-weather-model#2059

…stedGridCpl

Using MPI broadcast instead of multiple point-to-points message to improve performance
@FernandoAndrade-NOAA
Copy link

Testing on #2059 has completed successfully, please proceed with the merge process for this PR.

@zach1221
Copy link

Testing on #2059 has completed successfully, please proceed with the merge process for this PR.

Copying in @laurenchilutti as well for this request.

@jkbk2004
Copy link

@bensonr can you merge this pr?

@bensonr bensonr merged commit 65301dd into NOAA-GFDL:dev/emc Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants