From 9c342a5f5841656629d85589fec5507657a561f7 Mon Sep 17 00:00:00 2001 From: "Timothy S. Sliwinski" Date: Tue, 15 Aug 2023 00:23:30 +0000 Subject: [PATCH 1/3] Adding OpenACC statements to accelerate MYNN surface scheme performance through GPU offloading Overview: --------- With very minimal changes to the original code of the scheme, the MYNN surface scheme has been enhanced with OpenACC statements which introduce the capability for offloading computational execution to OpenACC-supported accelerator devices (e.g. Nvidia GPUs). Since the scheme operates by looping multiple times over independent vertical columns, the overall computational strategy maps well to GPU hardware where multiple iterations of each loop can be run in parallel with SIMD methods. Data movement has been optimized to ensure data transfers from host memory to device memory are limited as data movement is a significant source of latency when performing offloading to accelerator devices. Performance increases on a GPU ranged from a 3.3x slowdown to a 41.9x speedup versus CPU execution (See the Performance section for more information). MYNN Scheme Code Changes: ------------------------- A few minor code changes were unavoidable due to certain limitations on what OpenACC is able to execute on the accelerator within kernel and parallel blocks. A complete list of these changes is below: 1. Adding preprocessor directives to disable multiple standard output statements, including those used for debug output. The challenges of these are different depending on the view from the host or accelerator. When run in parallel on the accelerator, these statements are not guaranteed to be presented to the user in-order. Also, in limited cases, these statements would have to output variables that were not transferred to the GPU because they were not necessary for computation, introducing additional transfer overhead to ensure they were present only for these output statements. Further, with hundreds of threads executing at once, the output could be quite large and unwieldy. That said, some of these statements could have been run on the host to alleviate the problems introduced by parallelization on the device. However, this would have necessitated device-to-host transfers of variables to ensure values being output were correct while introducing additional transfer overhead costs to performance. Disabling these for accelerators only seemed the best course of action. These are disabled based on the presence of the __OPENACC compile time variable to ensure these are only disabled when the code is compiled for accelerator usage and does not affect CPU execution. 2. Changing the CCPP errmsg variable declaration on line 349 of module_sf_mynn.F90 to be a fixed 200 length character array. Since this variable is set at times in the middle of accelerator kernel regions, it must be present on the accelerator. However, when defined with "len=*", it is an assumed-size array, which OpenACC does not support on the accelerator. Rather than disable this variable completely, changing it to a fixed length allows it to be transferred to/from the accelerator and used. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 3. Adding preprocessor directives to "move" return statement on line 1399 of module_sf_mynn.F90 out of the main i-loop and instead execute it at line 2006 if errflg is set to 1. This change is necessary as OpenACC accelerator code cannot execute branching such as this, so this conditional return statement can only be executed by the host. This change is enforced by preprocessor directives based on the presence of the __OPENACC compile time variable and ensures this change only occurs when the code is compiled for accelerator usage, therefore it does not affect CPU execution. 4. Commenting out the zLhux local variable in the zolri function over lines 3671 to 3724. The zLhux variable appears to have been used only to capture values of zolri over multiple iterations, but is never used or passed along after this collection is completed. Since this array would be an assumed-size array based on the value of nmax at runtime, it would have been unsupported by OpenACC. But, since it is unused, the choice was made to simply comment out the variable and all lines related to it, allow the remaining code of the function to executed on the accelerator. Performance: ------------ Performance testing was performed on a single Nvidia P100 GPU versus a single 10-core Haswell CPU on Hera. Since the MYNN Surface scheme is a serial code, parallelization on the 10-core Haswell was performed using simple data partitioning across the 10 cores using OpenMP threads such that each thread received a near equal amount of data. When data movement was fully optimized for the accelerator -- meaning all CCPP Physics input variables were pre-loaded on the GPU as they would be when the CCPP infrastructure fully supports accelerator offloading -- GPU performance speedups range between 11.8X and 41.8X over the 10-core Haswell when the number of vertical columns (i) was varied between 150k and 750k, respectively. Performance Timings (optimized data movement) Columns (i) \ Compute | CPU | GPU | GPU Speedup | --------------------------------------------------------------------------- 150,000 | 263 ms | 22 ms | 11.9x | --------------------------------------------------------------------------- 450,000 | 766 ms | 28 ms | 27.0x | --------------------------------------------------------------------------- 750,000 | 1314 ms | 31 ms | 41.9x | --------------------------------------------------------------------------- However, standalone performance -- meaning all CCPP Physics input variables were initially loaded onto the GPU only after being declared in the MYNN subroutine calls -- was slightly less performant than the 10-core Haswell due to the additional overhead incurred by the data transfers. In this case, the decreasing performance lag for the GPU behind the CPU as the number of columns increases is due to the GPU performing better with more data (i.e. higher computational throughput) than the CPU despite more data needing to be transferred to the device. Performance Timings (standalone) Columns (i) \ Compute | CPU | GPU | GPU Speedup | ------------------------------------------------------------------------------- 150,000 | 263 ms | 862 ms | -3.3x | ------------------------------------------------------------------------------- 450,000 | 766 ms | 1767 ms | -2.3x | ------------------------------------------------------------------------------- 750,000 | 1314 ms | 2776 ms | -2.1x | ------------------------------------------------------------------------------- With these results, it is clear that this scheme will perform at its best on accelerators once the CCPP infrastructure also supports OpenACC. Contact Information: -------------------- This enhancement was performed by Timothy Sliwinski at NOAA GSL. Questions regarding these changes should be directed to timothy.s.sliwinski@noaa.gov --- physics/module_sf_mynn.F90 | 331 ++++++++++++++++++++++++++++++++++-- physics/mynnsfc_wrapper.F90 | 31 ++++ 2 files changed, 351 insertions(+), 11 deletions(-) diff --git a/physics/module_sf_mynn.F90 b/physics/module_sf_mynn.F90 index c60247cf6..399b1ee83 100644 --- a/physics/module_sf_mynn.F90 +++ b/physics/module_sf_mynn.F90 @@ -106,6 +106,7 @@ MODULE module_sf_mynn REAL(kind_phys), DIMENSION(0:1000 ),SAVE :: psim_stab,psim_unstab, & psih_stab,psih_unstab +!$acc declare create(psim_stab, psim_unstab, psih_stab, psih_unstab) CONTAINS @@ -344,7 +345,12 @@ SUBROUTINE SFCLAY_mynn( & & qsfc_wat, qsfc_lnd, qsfc_ice ! CCPP error handling +#ifndef _OPENACC character(len=*), intent(inout) :: errmsg +#else +!Necessary since OpenACC does not support assumed-size arrays + character(len=200), intent(inout) :: errmsg +#endif integer, intent(inout) :: errflg !ADDITIONAL OUTPUT @@ -371,6 +377,20 @@ SUBROUTINE SFCLAY_mynn( & errflg = 0 errmsg = '' +!$acc enter data copyin( dz8w,U3D,V3D,QV3D,QC3D,P3D,T3D, & +!$acc pattern_spp_sfc, errmsg) + +!$acc enter data copyin( UST_WAT(:), UST_LND(:), UST_ICE(:), & +!$acc MOL(:), QFLX(:), HFLX(:), & +!$acc QSFC(:), QSFC_WAT(:), QSFC_LND(:), & +!$acc QSFC_ICE(:)) + +!$acc enter data create( dz8w1d(:), dz2w1d(:), U1D(:), & +!$acc V1D(:), U1D2(:), V1D2(:), & +!$acc QV1D(:), QC1D(:), P1D(:), & +!$acc T1D(:), rstoch1D(:), qstar(:)) + + IF (debug_code >= 1) THEN write(*,*)"======= printing of constants:" write(*,*)"cp=", cp," g=", grav @@ -382,6 +402,10 @@ SUBROUTINE SFCLAY_mynn( & itf=ite !MIN0(ite,ide-1) ktf=kte !MIN0(kte,kde-1) +!$acc parallel loop present(dz8w,U3D,V3D,QV3D,QC3D,P3D,T3D, & +!$acc pattern_spp_sfc,dz8w1d,dz2w1d,U1D, & +!$acc V1D,U1D2,V1D2,QV1D,QC1D,P1D,T1D, & +!$acc rstoch1D,qstar) DO i=its,ite dz8w1d(I) = dz8w(i,kts) dz2w1d(I) = dz8w(i,kts+1) @@ -403,6 +427,9 @@ SUBROUTINE SFCLAY_mynn( & ENDDO IF (itimestep==1 .AND. iter==1) THEN +!$acc parallel loop present(U1D,V1D,UST_WAT,UST_LND,UST_ICE,MOL, & +!$acc QFLX,HFLX,QV3D,QSFC,QSFC_WAT, & +!$acc QSFC_LND,QSFC_ICE) DO i=its,ite IF (.not. flag_restart) THEN !Everything here is used before calculated @@ -432,6 +459,9 @@ SUBROUTINE SFCLAY_mynn( & ENDDO ENDIF +!$acc exit data delete( dz8w,U3D,V3D,QV3D,QC3D,P3D,T3D, & +!$acc pattern_spp_sfc, QC1D) + CALL SFCLAY1D_mynn(flag_iter, & J,U1D,V1D,T1D,QV1D,P1D,dz8w1d, & U1D2,V1D2,dz2w1d, & @@ -471,6 +501,16 @@ SUBROUTINE SFCLAY_mynn( & its,ite, jts,jte, kts,kte, & errmsg, errflg ) +!$acc exit data copyout( UST_WAT(:), UST_LND(:), UST_ICE(:), & +!$acc MOL(:), QFLX(:), HFLX(:), & +!$acc QSFC(:), QSFC_WAT(:), QSFC_LND(:), & +!$acc QSFC_ICE(:), errmsg) + +!$acc exit data delete( dz8w1d(:), dz2w1d(:), U1D(:), & +!$acc V1D(:), U1D2(:), V1D2(:), & +!$acc QV1D(:), T1D(:), P1D(:), & +!$acc rstoch1D(:), qstar(:)) + END SUBROUTINE SFCLAY_MYNN !------------------------------------------------------------------- @@ -626,7 +666,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !JOE-end ! CCPP error handling +#ifndef _OPENACC character(len=*), intent(inout) :: errmsg +#else +! Necessary since OpenACC does not support assumed-size arrays + character(len=200), intent(inout) :: errmsg +#endif integer, intent(inout) :: errflg !---------------------------------------------------------------- @@ -679,6 +724,58 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & errflg = 0 errmsg = '' !------------------------------------------------------------------- +!$acc update device(psim_stab, psim_unstab, psih_stab, psih_unstab) + +!$acc enter data create( ZA, ZA2, THV1D, TH1D, TC1D, TV1D, & +!$acc RHO1D, QVSH, PSIH2, PSIM10, PSIH10, WSPDI, & +!$acc GOVRTH, PSFC, THCON, & +!$acc zratio_lnd, zratio_ice, zratio_wat, & +!$acc TSK_lnd, TSK_ice, TSK_wat, & +!$acc THSK_lnd, THSK_ice, THSK_wat, & +!$acc THVSK_lnd, THVSK_ice, THVSK_wat, & +!$acc GZ1OZ0_lnd, GZ1OZ0_ice, GZ1OZ0_wat, & +!$acc GZ1OZt_lnd, GZ1OZt_ice, GZ1OZt_wat, & +!$acc GZ2OZ0_lnd, GZ2OZ0_ice, GZ2OZ0_wat, & +!$acc GZ2OZt_lnd, GZ2OZt_ice, GZ2OZt_wat, & +!$acc GZ10OZ0_lnd, GZ10OZ0_ice, GZ10OZ0_wat, & +!$acc GZ10OZt_lnd, GZ10OZt_ice, GZ10OZt_wat, & +!$acc ZNTstoch_lnd, ZNTstoch_ice, ZNTstoch_wat, & +!$acc ZT_lnd, ZT_ice, ZT_wat, & +!$acc ZQ_lnd, ZQ_ice, ZQ_wat, & +!$acc PSIQ_lnd, PSIQ_ice, PSIQ_wat, & +!$acc PSIQ2_lnd, PSIQ2_ice, PSIQ2_wat, & +!$acc QSFCMR_lnd, QSFCMR_ice, QSFCMR_wat ) + +!$acc enter data copyin(flag_iter, dry, wet, icy, CPM, MAVAIL, & +!$acc QFX, FLHC, FLQC, CHS, CH, CHS2, CQS2, USTM, & +!$acc HFX, LH, wstar, qstar, PBLH, ZOL, MOL, RMOL, & +!$acc T2, TH2, Q2, QV1D, PSFCPA, & +!$acc WSPD, U10, V10, U1D, V1D, U1D2, V1D2, & +!$acc T1D, P1D, rstoch1D, sigmaf, & +!$acc shdmax, vegtype, z0pert, ztpert, dx, QGH, & +!$acc dz2w1d, dz8w1d, & +!$acc stress_wat, stress_lnd, stress_ice, & +!$acc rb_wat, rb_lnd, rb_ice, & +!$acc tskin_wat, tskin_lnd, tskin_ice, & +!$acc tsurf_wat, tsurf_lnd, tsurf_ice, & +!$acc psim, psih, & +!$acc UST_wat, UST_lnd, UST_ice, & +!$acc ZNT_wat, ZNT_lnd, ZNT_ice, & +!$acc QSFC, QSFC_lnd, QSFC_wat, QSFC_ice, & +!$acc QFLX, QFLX_lnd, QFLX_wat, QFLX_ice, & +!$acc HFLX, HFLX_lnd, HFLX_wat, HFLX_ice, & +!$acc PSIX_wat, PSIX_lnd, PSIX_ice, & +!$acc PSIX10_wat, PSIX10_lnd, PSIX10_ice, & +!$acc PSIT2_lnd, PSIT2_wat, PSIT2_ice, & +!$acc PSIT_lnd, PSIT_wat, PSIT_ice, & +!$acc ch_lnd, ch_wat, ch_ice, & +!$acc cm_lnd, cm_wat, cm_ice, & +!$acc snowh_lnd, errmsg) + +!$acc parallel loop present(PSFCPA, PSFC, QSFC, T1D, flag_iter, & +!$acc QSFC_wat, QSFCMR_wat, wet, TSK_wat, tskin_wat, & +!$acc QSFC_lnd, QSFCMR_lnd, dry, TSK_lnd, tskin_lnd, & +!$acc QSFC_ice, QSFCMR_ice, icy, TSK_ice, tskin_ice) DO I=its,ite ! PSFC ( in cmb) is used later in saturation checks @@ -700,7 +797,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF QSFC_wat(I)=EP2*E1/(PSFC(I)-ep3*E1) !specific humidity QSFCMR_wat(I)=EP2*E1/(PSFC(I)-E1) !mixing ratio +#ifndef _OPENACC IF(QSFC_wat(I)>1..or.QSFC_wat(I)<0.) print *,' QSFC_wat(I)',itimestep,i,QSFC_wat(I),TSK_wat(i) +#endif ENDIF IF (dry(i)) THEN TSK_lnd(I) = tskin_lnd(i) @@ -720,7 +819,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & QSFC_lnd(I)=0.5*(QSFC_lnd(I) + QSFC(I)) QSFCMR_lnd(I)=QSFC_lnd(I)/(1.-QSFC_lnd(I)) !mixing ratio endif ! lsm +#ifndef _OPENACC IF(QSFC_lnd(I)>1..or.QSFC_lnd(I)<0.) print *,' QSFC_lnd(I)',itimestep,i,QSFC_lnd(I),Tskin_lnd(i),tsurf_lnd(i),qsfc(i) +#endif ENDIF IF (icy(i)) THEN TSK_ice(I) = tskin_ice(i) @@ -738,7 +839,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & QSFC_ice(I)=EP2*E1/(PSFC(I)-ep3*E1) !specific humidity QSFCMR_ice(I)=EP2*E1/(PSFC(I)-E1) !mixing ratio endif ! lsm +#ifndef _OPENACC IF(QSFC_ice(I)>1..or.QSFC_ice(I)<0.) print *,' QSFC_ice(I)',itimestep,i,QSFC_ice(I),TSK_ice(i) +#endif ENDIF ELSE @@ -791,6 +894,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & endif ! flag_iter ENDDO +#ifndef _OPENACC IF (debug_code >= 1) THEN write(0,*)"ITIMESTEP=",ITIMESTEP," iter=",iter DO I=its,ite @@ -815,7 +919,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF ENDDO ENDIF +#endif +!$acc parallel loop present(PSFC, PSFCPA, QVSH, QV1D, THCON, flag_iter, & +!$acc dry, tskin_lnd, TSK_lnd, tsurf_lnd, THSK_lnd, THVSK_lnd, qsfc_lnd, & +!$acc icy, tskin_ice, TSK_ice, tsurf_ice, THSK_ice, THVSK_ice, qsfc_ice, & +!$acc wet, tskin_wat, TSK_wat, tsurf_wat, THSK_wat, THVSK_wat) DO I=its,ite ! PSFC ( in cmb) is used later in saturation checks PSFC(I)=PSFCPA(I)/1000. @@ -829,8 +938,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_lnd(I) = TSK_lnd(I)*THCON(I) !(K) THVSK_lnd(I) = THSK_lnd(I)*(1.+EP1*qsfc_lnd(I)) +#ifndef _OPENACC if(THVSK_lnd(I) < 170. .or. THVSK_lnd(I) > 360.) & print *,'THVSK_lnd(I)',itimestep,i,THVSK_lnd(I),THSK_lnd(i),tsurf_lnd(i),tskin_lnd(i),qsfc_lnd(i) +#endif endif if(icy(i)) then TSK_ice(I) = tskin_ice(i) @@ -838,8 +949,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_ice(I) = TSK_ice(I)*THCON(I) !(K) THVSK_ice(I) = THSK_ice(I)*(1.+EP1*qsfc_ice(I)) !(K) +#ifndef _OPENACC if(THVSK_ice(I) < 170. .or. THVSK_ice(I) > 360.) & print *,'THVSK_ice(I)',itimestep,i,THVSK_ice(I),THSK_ice(i),tsurf_ice(i),tskin_ice(i),qsfc_ice(i) +#endif endif if(wet(i)) then TSK_wat(I) = tskin_wat(i) @@ -847,24 +960,29 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_wat(I) = TSK_wat(I)*THCON(I) !(K) THVSK_wat(I) = THSK_wat(I)*(1.+EP1*QVSH(I)) !(K) +#ifndef _OPENACC if(THVSK_wat(I) < 170. .or. THVSK_wat(I) > 360.) & print *,'THVSK_wat(I)',i,THVSK_wat(I),THSK_wat(i),tsurf_wat(i),tskin_wat(i),qsfc_wat(i) +#endif endif endif ! flag_iter ENDDO +!$acc parallel loop present(TH1D, T1D, P1D, TC1D) DO I=its,ite ! CONVERT LOWEST LAYER TEMPERATURE TO POTENTIAL TEMPERATURE: TH1D(I)=T1D(I)*(100000./P1D(I))**ROVCP !(Theta, K) TC1D(I)=T1D(I)-273.15 !(T, Celsius) ENDDO +!$acc parallel loop present(THV1D, TH1D, QVSH, TV1D, T1D) DO I=its,ite ! CONVERT TO VIRTUAL TEMPERATURE THV1D(I)=TH1D(I)*(1.+EP1*QVSH(I)) !(K) TV1D(I)=T1D(I)*(1.+EP1*QVSH(I)) !(K) ENDDO +!$acc parallel loop present(RHO1D, P1D, TV1D, TH1D, ZA, ZA2, dz2w1d, dz8w1d, GOVRTH) DO I=its,ite RHO1D(I)=P1D(I)/(Rd*TV1D(I)) !now using value calculated in sfc driver ZA(I)=0.5*dz8w1d(I) !height of first half-sigma level @@ -873,11 +991,13 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDDO !tgs - should QFX and HFX be separate for land, ice and water? +!$acc parallel loop present(QFX, QFLX, RHO1D, HFX, HFLX) DO I=its,ite QFX(i)=QFLX(i)*RHO1D(I) HFX(i)=HFLX(i)*RHO1D(I)*cp ENDDO +#ifndef _OPENACC IF (debug_code ==2) THEN !write(*,*)"ITIMESTEP=",ITIMESTEP DO I=its,ite @@ -890,7 +1010,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & write(*,*)"RHO1D=", RHO1D(i)," GOVRTH=",GOVRTH(i) ENDDO ENDIF +#endif +!$acc parallel loop present(T1D,P1D,QGH,QV1D,CPM) DO I=its,ite ! QGH CHANGED TO USE LOWEST-LEVEL AIR TEMP ! Q2SAT = QGH IN LSM @@ -908,6 +1030,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & CPM(I)=CP*(1.+0.84*QV1D(I)) ENDDO +#ifndef _OPENACC IF (debug_code == 2) THEN write(*,*)"ITIMESTEP=",ITIMESTEP DO I=its,ite @@ -925,7 +1048,13 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & endif ENDDO ENDIF +#endif +!$acc parallel loop present(flag_iter,U1D,V1D,WSPD,wet,dry,icy, & +!$acc THV1D,THVSK_wat,THVSK_lnd,THVSK_ice, & +!$acc hfx,RHO1D,qfx,WSTAR,pblh,dx,GOVRTH,ZA, & +!$acc TSK_wat,TSK_lnd,TSK_ice, & +!$acc rb_wat,rb_lnd,rb_ice) DO I=its,ite if( flag_iter(i) ) then ! DH* 20200401 - note. A weird bug in Intel 18 on hera prevents using the @@ -1041,6 +1170,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & WSPD(I) = MAX(WSPD_ice,WSPD_wat) WSPD(I) = MAX(WSPD_lnd,WSPD(I)) +#ifndef _OPENACC IF (debug_code == 2) THEN write(*,*)"===== After rb calc in mynn sfc layer:" write(*,*)"ITIMESTEP=",ITIMESTEP @@ -1049,6 +1179,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & IF (wet(i))write(*,*)"rb_wat=", rb_wat(I)," DTHVDZ=",DTHVDZ IF (dry(i))write(*,*)"rb_lnd=", rb_lnd(I)," DTHVDZ=",DTHVDZ ENDIF +#endif ! IF PREVIOUSLY UNSTABLE, DO NOT LET INTO REGIMES 1 AND 2 (STABLE) !if (itimestep .GT. 1) THEN @@ -1067,6 +1198,29 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !-------------------------------------------------------------------- !-------------------------------------------------------------------- +!$acc parallel loop present(flag_iter, errmsg, & +!$acc wet, dry, icy, & +!$acc ZT_wat, ZT_lnd, ZT_ice, & +!$acc ZNT_wat, ZNT_lnd, ZNT_ice, & +!$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & +!$acc UST_wat, UST_lnd, UST_ice, & +!$acc ZQ_wat, ZQ_lnd, ZQ_ice, & +!$acc snowh_lnd, & +!$acc THVSK_wat, THVSK_lnd, THVSK_ice, & +!$acc qsfc_wat, qsfc_lnd, qsfc_ice, & +!$acc GZ1OZ0_wat, GZ1OZt_wat, GZ2OZ0_wat, GZ2OZt_wat, GZ10OZ0_wat, GZ10OZt_wat, & +!$acc GZ1OZ0_lnd, GZ1OZt_lnd, GZ2OZ0_lnd, GZ2OZt_lnd, GZ10OZ0_lnd, GZ10OZt_lnd, & +!$acc GZ1OZ0_ice, GZ1OZt_ice, GZ2OZ0_ice, GZ2OZt_ice, GZ10OZ0_ice, GZ10OZt_ice, & +!$acc zratio_wat, zratio_lnd, zratio_ice, & +!$acc stress_wat, stress_lnd, stress_ice, & +!$acc rb_wat, rb_lnd, rb_ice, & +!$acc psim, psih, psim10, psih10, psih2, & +!$acc psix_wat, psix10_wat, psit_wat, psit2_wat, psiq_wat, psiq2_wat, & +!$acc psix_lnd, psix10_lnd, psit_lnd, psit2_lnd, psiq_lnd, psiq2_lnd, & +!$acc psix_ice, psix10_ice, psit_ice, psit2_ice, psiq_ice, psiq2_ice, & +!$acc WSPD, WSPDI, U1D, V1D, TC1D, THV1D, rstoch1D, USTM, ZA, ZOL, QVSH, & +!$acc shdmax, vegtype, z0pert, ztpert, mol, rmol, qstar, sigmaf) + DO I=its,ite if( flag_iter(i) ) then @@ -1082,10 +1236,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & if (sfc_z0_type >= 0) then ! Avoid calculation is using wave model ! CALCULATE z0 (znt) !-------------------------------------- +#ifndef _OPENACC IF (debug_code == 2) THEN write(*,*)"=============Input to ZNT over water:" write(*,*)"u*:",UST_wat(i)," wspd=",WSPD(i)," visc=",visc," za=",ZA(I) ENDIF +#endif IF ( PRESENT(ISFTCFLX) ) THEN IF ( ISFTCFLX .EQ. 0 ) THEN IF (COARE_OPT .EQ. 3.0) THEN @@ -1122,10 +1278,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZNTstoch_wat(I) = ZNT_wat(I) endif +#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"==========Output ZNT over water:" write(*,*)"ZNT:",ZNTstoch_wat(i) ENDIF +#endif !COMPUTE ROUGHNESS REYNOLDS NUMBER (restar) USING NEW ZNT ! AHW: Garrattt formula: Calculate roughness Reynolds number @@ -1136,10 +1294,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !-------------------------------------- !CALCULATE z_t and z_q !-------------------------------------- +#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"=============Input to ZT over water:" write(*,*)"u*:",UST_wat(i)," restar=",restar," visc=",visc ENDIF +#endif IF ( PRESENT(ISFTCFLX) ) THEN IF ( ISFTCFLX .EQ. 0 ) THEN @@ -1183,10 +1343,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & rstoch1D(i),spp_sfc) ENDIF ENDIF +#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"=============Output ZT & ZQ over water:" write(*,*)"ZT:",ZT_wat(i)," ZQ:",ZQ_wat(i) ENDIF +#endif GZ1OZ0_wat(I)= LOG((ZA(I)+ZNTstoch_wat(I))/ZNTstoch_wat(I)) GZ1OZt_wat(I)= LOG((ZA(I)+ZNTstoch_wat(i))/ZT_wat(i)) @@ -1232,7 +1394,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! or initialized to zero, but certainly not set correctly errmsg = 'Logic error: qstar is not set correctly when calling Yang_2008' errflg = 1 +#ifndef _OPENACC +! Necessary since OpenACC does not support branching in parallel code return +#endif CALL Yang_2008(ZNTSTOCH_lnd(i),ZT_lnd(i),ZQ_lnd(i),UST_lnd(i),MOL(I),& qstar(I),restar,visc) ELSEIF ( IZ0TLND .EQ. 3 ) THEN @@ -1249,6 +1414,8 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & UST_lnd(I),KARMAN,1.0_kind_phys,0,spp_sfc,rstoch1D(i)) ENDIF ENDIF + +#ifndef _OPENACC IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i write(0,*)" ZNT=", ZNTstoch_lnd(i)," ZT=",Zt_lnd(i) @@ -1257,7 +1424,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " ust=", ust_lnd(i)," snowh=", snowh_lnd(i),"psfcpa=",PSFCPA(i), & " dz=",dz8w1d(i)," qflx=",qflx_lnd(i)," hflx=",hflx_lnd(i)," hpbl=",pblh(i) ENDIF - +#endif GZ1OZ0_lnd(I)= LOG((ZA(I)+ZNTstoch_lnd(I))/ZNTstoch_lnd(I)) GZ1OZt_lnd(I)= LOG((ZA(I)+ZNTstoch_lnd(i))/ZT_lnd(i)) @@ -1323,6 +1490,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_wat(i) < 1E-8 .OR. Zt_wat(i) < 1E-10) THEN write(0,*)"===(wet) capture bad input in mynn sfc layer, i=:",i @@ -1333,6 +1501,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_wat(I),ZA(I),ZNTstoch_wat(I),ZT_wat(I),ZOL(I),psi_opt) @@ -1390,6 +1559,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_wat(i) < 1E-8 .OR. Zt_wat(i) < 1E-10) THEN write(0,*)"===(wet) capture bad input in mynn sfc layer, i=:",i @@ -1400,6 +1570,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_wat(I),ZA(I),ZNTstoch_wat(I),ZT_wat(I),ZOL(I),psi_opt) @@ -1460,6 +1631,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i @@ -1470,6 +1642,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_lnd(I),ZA(I),ZNTstoch_lnd(I),ZT_lnd(I),ZOL(I),psi_opt) @@ -1526,6 +1699,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i @@ -1536,6 +1710,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_lnd(I),ZA(I),ZNTstoch_lnd(I),ZT_lnd(I),ZOL(I),psi_opt) @@ -1595,6 +1770,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_ice(i) < 1E-8 .OR. Zt_ice(i) < 1E-10) THEN write(0,*)"===(ice) capture bad input in mynn sfc layer, i=:",i @@ -1605,6 +1781,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_ice(I),ZA(I),ZNTstoch_ice(I),ZT_ice(I),ZOL(I),psi_opt) @@ -1661,6 +1838,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) +#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_ice(i) < 1E-8 .OR. Zt_ice(i) < 1E-10) THEN write(0,*)"===(ice) capture bad input in mynn sfc layer, i=:",i @@ -1671,6 +1849,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF +#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_ice(I),ZA(I),ZNTstoch_ice(I),ZT_ice(I),ZOL(I),psi_opt) @@ -1821,6 +2000,14 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & endif ! flag_iter ENDDO ! end i-loop +#ifdef _OPENACC + ! Necessary since OpenACC does not support branching in parallel code + IF (errflg == 1) THEN + return + ENDIF +#endif + +#ifndef _OPENACC IF (debug_code == 2) THEN DO I=its,ite IF(wet(i))write(*,*)"==== AT END OF MAIN LOOP, i=",i, "(wet)" @@ -1841,10 +2028,29 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & write(*,*)"=============================================" ENDDO ! end i-loop ENDIF +#endif !---------------------------------------------------------- ! COMPUTE SURFACE HEAT AND MOISTURE FLUXES !---------------------------------------------------------- +!$acc parallel loop present(flag_iter, dry, wet, icy, & +!$acc QFX, HFX, FLHC, FLQC, LH, CHS, CH, CHS2, CQS2, & +!$acc RHO1D, MAVAIL, USTM, & +!$acc UST_lnd, UST_wat, UST_ice, & +!$acc PSIQ_lnd, PSIT_lnd, PSIX_lnd, & +!$acc PSIQ_wat, PSIT_wat, PSIX_wat, & +!$acc PSIQ_ice, PSIT_ice, PSIX_ice, & +!$acc PSIQ2_lnd, PSIT2_lnd, & +!$acc PSIQ2_wat, PSIT2_wat, & +!$acc PSIQ2_ice, PSIT2_ice, & +!$acc QSFC, QSFC_lnd, QSFC_wat, QSFC_ice, & +!$acc QFLX, QFLX_lnd, QFLX_wat, QFLX_ice, & +!$acc HFLX, HFLX_lnd, HFLX_wat, HFLX_ice, & +!$acc QSFCMR_lnd, QSFCMR_wat, QSFCMR_ice, & +!$acc QV1D, WSPD, WSPDI, CPM, TH1D, & +!$acc THSK_lnd, THSK_wat, THSK_ice, & +!$acc ch_lnd, ch_wat, ch_ice, & +!$acc cm_lnd, cm_wat, cm_ice) DO I=its,ite if( flag_iter(i) ) then @@ -2008,12 +2214,14 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF +#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"QFX=",QFX(I),"FLQC=",FLQC(I) if(icy(i))write(*,*)"ice, MAVAIL:",MAVAIL(I)," u*=",UST_ice(I)," psiq=",PSIQ_ice(i) if(dry(i))write(*,*)"lnd, MAVAIL:",MAVAIL(I)," u*=",UST_lnd(I)," psiq=",PSIQ_lnd(i) if(wet(i))write(*,*)"ocn, MAVAIL:",MAVAIL(I)," u*=",UST_wat(I)," psiq=",PSIQ_wat(i) ENDIF +#endif ! The exchange coefficient for cloud water is assumed to be the ! same as that for heat. CH is multiplied by WSPD. @@ -2040,6 +2248,18 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDDO ! end i-loop IF (compute_diag) then + !$acc parallel loop present(flag_iter, dry, wet, icy, & + !$acc ZA, ZA2, T2, TH2, TH1D, Q2, QV1D, PSFCPA, & + !$acc THSK_lnd, THSK_wat, THSK_ice, & + !$acc QSFC_lnd, QSFC_wat, QSFC_ice, & + !$acc U10, V10, U1D, V1D, U1D2, V1D2, & + !$acc ZNTstoch_lnd, ZNTstoch_lnd, ZNTstoch_ice, & + !$acc PSIX_lnd, PSIX_wat, PSIX_ice, & + !$acc PSIX10_lnd, PSIX10_wat, PSIX10_ice, & + !$acc PSIT2_lnd, PSIT2_wat, PSIT2_ice, & + !$acc PSIT_lnd, PSIT_wat, PSIT_ice, & + !$acc PSIQ2_lnd, PSIQ2_wat, PSIQ2_ice, & + !$acc PSIQ_lnd, PSIQ_wat, PSIQ_ice) DO I=its,ite if( flag_iter(i) ) then !----------------------------------------------------- @@ -2154,6 +2374,16 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! DEBUG - SUSPICIOUS VALUES !----------------------------------------------------- IF ( debug_code == 2) THEN + !$acc parallel loop present(dry, wet, icy, CPM, MAVAIL, & + !$acc HFX, LH, wstar, RHO1D, PBLH, ZOL, ZA, MOL, & + !$acc PSIM, PSIH, WSTAR, T1D, TH1D, THV1D, QVSH, & + !$acc UST_wat, UST_lnd, UST_ice, & + !$acc THSK_wat, THSK_lnd, THSK_ice, & + !$acc THVSK_wat, THVSK_lnd, THVSK_ice, & + !$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & + !$acc ZT_wat, ZT_lnd, ZT_ice, & + !$acc QSFC_wat, QSFC_lnd, QSFC_ice, & + !$acc PSIX_wat, PSIX_lnd, PSIX_ice) DO I=its,ite yesno = 0 IF (compute_flux) THEN @@ -2258,6 +2488,54 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDDO ! end i-loop ENDIF ! end debug option +!$acc exit data copyout(CPM, FLHC, FLQC, CHS, CH, CHS2, CQS2,& +!$acc USTM, wstar, qstar, ZOL, MOL, RMOL, & +!$acc HFX, QFX, LH, QSFC, QFLX, HFLX, & +!$acc T2, TH2, Q2, WSPD, U10, V10, & +!$acc QGH, psim, psih, & +!$acc stress_wat, stress_lnd, stress_ice, & +!$acc rb_wat, rb_lnd, rb_ice, & +!$acc UST_wat, UST_lnd, UST_ice, & +!$acc ZNT_wat, ZNT_lnd, ZNT_ice, & +!$acc QSFC_lnd, QSFC_wat, QSFC_ice, & +!$acc QFLX_lnd, QFLX_wat, QFLX_ice, & +!$acc HFLX_lnd, HFLX_wat, HFLX_ice, & +!$acc PSIX_wat, PSIX_lnd, PSIX_ice, & +!$acc PSIX10_wat, PSIX10_lnd, PSIX10_ice, & +!$acc PSIT2_lnd, PSIT2_wat, PSIT2_ice, & +!$acc PSIT_lnd, PSIT_wat, PSIT_ice, & +!$acc ch_lnd, ch_wat, ch_ice, & +!$acc cm_lnd, cm_wat, cm_ice, & +!$acc errmsg) + +!$acc exit data delete( flag_iter, dry, wet, icy, dx, & +!$acc MAVAIL, PBLH, PSFCPA, z0pert, ztpert, & +!$acc QV1D, U1D, V1D, U1D2, V1D2, T1D, P1D, & +!$acc rstoch1D, sigmaf, shdmax, vegtype, & +!$acc dz2w1d, dz8w1d, snowh_lnd, & +!$acc tskin_wat, tskin_lnd, tskin_ice, & +!$acc tsurf_wat, tsurf_lnd, tsurf_ice) + +!$acc exit data delete( ZA, ZA2, THV1D, TH1D, TC1D, TV1D, & +!$acc RHO1D, QVSH, PSIH2, PSIM10, PSIH10, WSPDI, & +!$acc GOVRTH, PSFC, THCON, & +!$acc zratio_lnd, zratio_ice, zratio_wat, & +!$acc TSK_lnd, TSK_ice, TSK_wat, & +!$acc THSK_lnd, THSK_ice, THSK_wat, & +!$acc THVSK_lnd, THVSK_ice, THVSK_wat, & +!$acc GZ1OZ0_lnd, GZ1OZ0_ice, GZ1OZ0_wat, & +!$acc GZ1OZt_lnd, GZ1OZt_ice, GZ1OZt_wat, & +!$acc GZ2OZ0_lnd, GZ2OZ0_ice, GZ2OZ0_wat, & +!$acc GZ2OZt_lnd, GZ2OZt_ice, GZ2OZt_wat, & +!$acc GZ10OZ0_lnd, GZ10OZ0_ice, GZ10OZ0_wat, & +!$acc GZ10OZt_lnd, GZ10OZt_ice, GZ10OZt_wat, & +!$acc ZNTstoch_lnd, ZNTstoch_ice, ZNTstoch_wat, & +!$acc ZT_lnd, ZT_ice, ZT_wat, & +!$acc ZQ_lnd, ZQ_ice, ZQ_wat, & +!$acc PSIQ_lnd, PSIQ_ice, PSIQ_wat, & +!$acc PSIQ2_lnd, PSIQ2_ice, PSIQ2_wat, & +!$acc QSFCMR_lnd, QSFCMR_ice, QSFCMR_wat ) + END SUBROUTINE SFCLAY1D_mynn !------------------------------------------------------------------- !>\ingroup mynn_sfc @@ -2272,6 +2550,7 @@ END SUBROUTINE SFCLAY1D_mynn SUBROUTINE zilitinkevich_1995(Z_0,Zt,Zq,restar,ustar,KARMAN,& & landsea,IZ0TLND2,spp_sfc,rstoch) + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Z_0,restar,ustar,KARMAN,landsea INTEGER, OPTIONAL, INTENT(IN) :: IZ0TLND2 @@ -2341,6 +2620,7 @@ SUBROUTINE davis_etal_2008(Z_0,ustar) !This is an update version from Davis et al. 2008, which !corrects a small-bias in Z_0 (AHW real-time 2012). + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: ustar REAL(kind_phys), INTENT(OUT) :: Z_0 @@ -2368,7 +2648,7 @@ END SUBROUTINE davis_etal_2008 !>This formulation for roughness length was designed account for. !!wave steepness. SUBROUTINE Taylor_Yelland_2001(Z_0,ustar,wsp10) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: ustar,wsp10 REAL(kind_phys), INTENT(OUT) :: Z_0 @@ -2396,7 +2676,7 @@ END SUBROUTINE Taylor_Yelland_2001 !! The Charnock parameter CZC is varied from .011 to .018. !! between 10-m wsp = 10 and 18.. SUBROUTINE charnock_1955(Z_0,ustar,wsp10,visc,zu) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: ustar, visc, wsp10, zu REAL(kind_phys), INTENT(OUT) :: Z_0 @@ -2421,7 +2701,7 @@ END SUBROUTINE charnock_1955 !!The Charnock parameter CZC is varied from about .005 to .028 !!between 10-m wind speeds of 6 and 19 m/s. SUBROUTINE edson_etal_2013(Z_0,ustar,wsp10,visc,zu) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: ustar, visc, wsp10, zu REAL(kind_phys), INTENT(OUT) :: Z_0 @@ -2450,7 +2730,7 @@ END SUBROUTINE edson_etal_2013 !!data. The formula for land uses a constant ratio (Z_0/7.4) taken !!from Garratt (1992). SUBROUTINE garratt_1992(Zt,Zq,Z_0,Ren,landsea) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Ren, Z_0,landsea REAL(kind_phys), INTENT(OUT) :: Zt,Zq @@ -2486,7 +2766,7 @@ END SUBROUTINE garratt_1992 !! !!This is for use over water only. SUBROUTINE fairall_etal_2003(Zt,Zq,Ren,ustar,visc,rstoch,spp_sfc) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Ren,ustar,visc,rstoch INTEGER, INTENT(IN) :: spp_sfc @@ -2530,7 +2810,7 @@ END SUBROUTINE fairall_etal_2003 !! The actual reference is unknown. This was passed along by Jim Edson (personal communication). !! This is for use over water only, preferably open ocean. SUBROUTINE fairall_etal_2014(Zt,Zq,Ren,ustar,visc,rstoch,spp_sfc) - + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Ren,ustar,visc,rstoch INTEGER, INTENT(IN) :: spp_sfc @@ -2578,6 +2858,7 @@ END SUBROUTINE fairall_etal_2014 !!This should only be used over land! SUBROUTINE Yang_2008(Z_0,Zt,Zq,ustar,tstar,qst,Ren,visc) + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Z_0, Ren, ustar, tstar, qst, visc REAL(kind_phys) :: ht, &! roughness height at critical Reynolds number @@ -2613,6 +2894,7 @@ END SUBROUTINE Yang_2008 !>\ingroup mynn_sfc SUBROUTINE GFS_z0_lnd(z0max,shdmax,z1,vegtype,ivegsrc,z0pert) + !$acc routine seq REAL(kind_phys), INTENT(OUT) :: z0max REAL(kind_phys), INTENT(IN) :: shdmax,z1,z0pert INTEGER, INTENT(IN) :: vegtype,ivegsrc @@ -2673,6 +2955,7 @@ END SUBROUTINE GFS_z0_lnd !>\ingroup mynn_sfc SUBROUTINE GFS_zt_lnd(ztmax,z0max,sigmaf,ztpert,ustar_lnd) + !$acc routine seq REAL(kind_phys), INTENT(OUT) :: ztmax REAL(kind_phys), INTENT(IN) :: z0max,sigmaf,ztpert,ustar_lnd REAL(kind_phys) :: czilc, tem1, tem2 @@ -2701,6 +2984,7 @@ END SUBROUTINE GFS_zt_lnd !>\ingroup mynn_sfc SUBROUTINE GFS_z0_wat(z0rl_wat,ustar_wat,WSPD,z1,sfc_z0_type,redrag) + !$acc routine seq REAL(kind_phys), INTENT(OUT) :: z0rl_wat REAL(kind_phys), INTENT(INOUT):: ustar_wat REAL(kind_phys), INTENT(IN) :: wspd,z1 @@ -2753,11 +3037,16 @@ END SUBROUTINE GFS_z0_wat !-------------------------------------------------------------------- !>\ingroup mynn_sfc SUBROUTINE GFS_zt_wat(ztmax,z0rl_wat,restar,WSPD,z1,sfc_z0_type,errmsg,errflg) - + !$acc routine seq real(kind_phys), INTENT(OUT) :: ztmax real(kind_phys), INTENT(IN) :: wspd,z1,z0rl_wat,restar INTEGER, INTENT(IN) :: sfc_z0_type +#ifndef _OPENACC character(len=*), intent(out) :: errmsg +#else +! Necessary since OpenACC does not support assumed-size arrays + character(len=200), intent(out) :: errmsg +#endif integer, intent(out) :: errflg real(kind_phys) :: z0,z0max,wind10m,rat,ustar_wat real(kind_phys), PARAMETER :: charnock = 0.014, z0s_max=.317e-2 @@ -2798,6 +3087,7 @@ SUBROUTINE GFS_zt_wat(ztmax,z0rl_wat,restar,WSPD,z1,sfc_z0_type,errmsg,errflg) errflg = 1 errmsg = 'ERROR(GFS_zt_wat): sfc_z0_type not valid.' return + endif END SUBROUTINE GFS_zt_wat @@ -2807,6 +3097,7 @@ END SUBROUTINE GFS_zt_wat !! Weiguo Wang, 2019-0425 SUBROUTINE znot_m_v6(uref, znotm) + !$acc routine seq use machine , only : kind_phys IMPLICIT NONE ! Calculate areodynamical roughness over water with input 10-m wind @@ -2856,6 +3147,7 @@ END SUBROUTINE znot_m_v6 !! SUBROUTINE znot_t_v6(uref, znott) + !$acc routine seq IMPLICIT NONE !> Calculate scalar roughness over water with input 10-m wind !! For low-to-moderate winds, try to match the Ck-U10 relationship from COARE algorithm @@ -2922,6 +3214,7 @@ END SUBROUTINE znot_t_v6 !! SUBROUTINE znot_m_v7(uref, znotm) + !$acc routine seq IMPLICIT NONE !> Calculate areodynamical roughness over water with input 10-m wind !! For low-to-moderate winds, try to match the Cd-U10 relationship from COARE V3.5 (Edson et al. 2013) @@ -2971,6 +3264,7 @@ END SUBROUTINE znot_m_v7 !! SUBROUTINE znot_t_v7(uref, znott) + !$acc routine seq IMPLICIT NONE !> Calculate scalar roughness over water with input 10-m wind !! For low-to-moderate winds, try to match the Ck-U10 relationship from COARE algorithm @@ -3040,6 +3334,7 @@ END SUBROUTINE znot_t_v7 !! This should only be used over snow/ice! SUBROUTINE Andreas_2002(Z_0,bvisc,ustar,Zt,Zq) + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(IN) :: Z_0, bvisc, ustar REAL(kind_phys), INTENT(OUT) :: Zt, Zq @@ -3313,6 +3608,7 @@ END SUBROUTINE PSI_CB2005 !! and Holtslag (1991) for stable conditions. SUBROUTINE Li_etal_2010(zL, Rib, zaz0, z0zt) + !$acc routine seq IMPLICIT NONE REAL(kind_phys), INTENT(OUT) :: zL REAL(kind_phys), INTENT(IN) :: Rib, zaz0, z0zt @@ -3471,6 +3767,7 @@ REAL(kind_phys) function zolri2(zol2,ri2,za,z0,zt,psi_opt) REAL(kind_phys) function zolrib(ri,za,z0,zt,logz0,logzt,zol1,psi_opt) + !$acc routine seq ! This iterative algorithm to compute z/L from bulk-Ri IMPLICIT NONE @@ -3480,7 +3777,7 @@ REAL(kind_phys) function zolrib(ri,za,z0,zt,logz0,logzt,zol1,psi_opt) REAL(kind_phys) :: zol20,zol3,zolt,zolold INTEGER :: n INTEGER, PARAMETER :: nmax = 20 - REAL(kind_phys), DIMENSION(nmax):: zLhux + !REAL(kind_phys), DIMENSION(nmax):: zLhux REAL(kind_phys) :: psit2,psix2 !print*,"+++++++INCOMING: z/L=",zol1," ri=",ri @@ -3522,7 +3819,7 @@ REAL(kind_phys) function zolrib(ri,za,z0,zt,logz0,logzt,zol1,psi_opt) endif !print*,"n=",n," psit2=",psit2," psix2=",psix2 zolrib=ri*psix2**2/psit2 - zLhux(n)=zolrib + !zLhux(n)=zolrib n=n+1 enddo @@ -3530,7 +3827,7 @@ REAL(kind_phys) function zolrib(ri,za,z0,zt,logz0,logzt,zol1,psi_opt) !print*,"iter FAIL, n=",n," Ri=",ri," z/L=",zolri !if convergence fails, use approximate values: CALL Li_etal_2010(zolrib, ri, za/z0, z0/zt) - zLhux(n)=zolrib + !zLhux(n)=zolrib !print*,"FAILED, n=",n," Ri=",ri," z0=",z0 !print*,"z/L=",zLhux(1:nmax) else @@ -3595,6 +3892,7 @@ END SUBROUTINE psi_init ! !>\ingroup mynn_sfc real(kind_phys) function psim_stable_full(zolf) + !$acc routine seq real(kind_phys) :: zolf !psim_stable_full=-6.1*log(zolf+(1+zolf**2.5)**(1./2.5)) @@ -3605,6 +3903,7 @@ real(kind_phys) function psim_stable_full(zolf) !>\ingroup mynn_sfc real(kind_phys) function psih_stable_full(zolf) + !$acc routine seq real(kind_phys) :: zolf !psih_stable_full=-5.3*log(zolf+(1+zolf**1.1)**(1./1.1)) @@ -3615,6 +3914,7 @@ real(kind_phys) function psih_stable_full(zolf) !>\ingroup mynn_sfc real(kind_phys) function psim_unstable_full(zolf) + !$acc routine seq real(kind_phys) :: zolf,x,ym,psimc,psimk x=(1.-16.*zolf)**.25 @@ -3633,6 +3933,7 @@ real(kind_phys) function psim_unstable_full(zolf) !>\ingroup mynn_sfc real(kind_phys) function psih_unstable_full(zolf) + !$acc routine seq real(kind_phys) :: zolf,y,yh,psihc,psihk y=(1.-16.*zolf)**.5 @@ -3654,6 +3955,7 @@ real(kind_phys) function psih_unstable_full(zolf) !>\ingroup mynn_sfc !! REAL(kind_phys) function psim_stable_full_gfs(zolf) + !$acc routine seq REAL(kind_phys) :: zolf REAL(kind_phys), PARAMETER :: alpha4 = 20. REAL(kind_phys) :: aa @@ -3667,6 +3969,7 @@ REAL(kind_phys) function psim_stable_full_gfs(zolf) !>\ingroup mynn_sfc !! real(kind_phys) function psih_stable_full_gfs(zolf) + !$acc routine seq real(kind_phys) :: zolf real(kind_phys), PARAMETER :: alpha4 = 20. real(kind_phys) :: bb @@ -3680,6 +3983,7 @@ real(kind_phys) function psih_stable_full_gfs(zolf) !>\ingroup mynn_sfc !! real(kind_phys) function psim_unstable_full_gfs(zolf) + !$acc routine seq real(kind_phys) :: zolf real(kind_phys) :: hl1,tem1 real(kind_phys), PARAMETER :: a0=-3.975, a1=12.32, & @@ -3700,6 +4004,7 @@ real(kind_phys) function psim_unstable_full_gfs(zolf) !>\ingroup mynn_sfc !! real(kind_phys) function psih_unstable_full_gfs(zolf) + !$acc routine seq real(kind_phys) :: zolf real(kind_phys) :: hl1,tem1 real(kind_phys), PARAMETER :: a0p=-7.941, a1p=24.75, & @@ -3720,6 +4025,7 @@ real(kind_phys) function psih_unstable_full_gfs(zolf) !>\ingroup mynn_sfc !! look-up table functions - or, if beyond -10 < z/L < 10, recalculate real(kind_phys) function psim_stable(zolf,psi_opt) + !$acc routine seq integer :: nzol,psi_opt real(kind_phys) :: rzol,zolf @@ -3740,6 +4046,7 @@ real(kind_phys) function psim_stable(zolf,psi_opt) !>\ingroup mynn_sfc real(kind_phys) function psih_stable(zolf,psi_opt) + !$acc routine seq integer :: nzol,psi_opt real(kind_phys) :: rzol,zolf @@ -3760,6 +4067,7 @@ real(kind_phys) function psih_stable(zolf,psi_opt) !>\ingroup mynn_sfc real(kind_phys) function psim_unstable(zolf,psi_opt) + !$acc routine seq integer :: nzol,psi_opt real(kind_phys) :: rzol,zolf @@ -3780,6 +4088,7 @@ real(kind_phys) function psim_unstable(zolf,psi_opt) !>\ingroup mynn_sfc real(kind_phys) function psih_unstable(zolf,psi_opt) + !$acc routine seq integer :: nzol,psi_opt real(kind_phys) :: rzol,zolf diff --git a/physics/mynnsfc_wrapper.F90 b/physics/mynnsfc_wrapper.F90 index 1a970c9f4..3c033e65e 100644 --- a/physics/mynnsfc_wrapper.F90 +++ b/physics/mynnsfc_wrapper.F90 @@ -191,6 +191,16 @@ SUBROUTINE mynnsfc_wrapper_run( & & IMS,IME,JMS,JME,KMS,KME, & & ITS,ITE,JTS,JTE,KTS,KTE +!$acc enter data create(hfx, znt, psim, psih, chs, & +!$acc mavail, xland, GZ1OZ0, cpm, qgh, & +!$acc qfx, snowh_wat) + +!$acc enter data create(dz, th, qv) + +!$acc enter data copyin(rmol, phii, t3d, exner, qvsh, slmsk, xland) + +!$acc enter data copyin(dry, wet, icy, znt_lnd, znt_wat, znt_ice, qsfc_lnd, qsfc_ice, qsfc_lnd_ruc, qsfc_ice_ruc) + ! Initialize CCPP error handling variables errmsg = '' errflg = 0 @@ -203,6 +213,7 @@ SUBROUTINE mynnsfc_wrapper_run( & ! write(0,*)"iter=",iter ! endif +!$acc kernels ! prep MYNN-only variables dz(:,:) = 0 th(:,:) = 0 @@ -210,6 +221,9 @@ SUBROUTINE mynnsfc_wrapper_run( & hfx(:) = 0 qfx(:) = 0 rmol(:) = 0 +!$acc end kernels + +!$acc parallel loop collapse(2) present(dz, phii, th, t3d, exner, qv, qvsh) do k=1,2 !levs do i=1,im dz(i,k)=(phii(i,k+1) - phii(i,k))*g_inv @@ -219,6 +233,7 @@ SUBROUTINE mynnsfc_wrapper_run( & enddo enddo +!$acc parallel loop present(slmsk, xland, qgh, mavail, cpm, snowh_wat) do i=1,im if (slmsk(i)==1. .or. slmsk(i)==2.)then !sea/land/ice mask (=0/1/2) in FV3 xland(i)=1.0 !but land/water = (1/2) in SFCLAY_mynn @@ -235,6 +250,7 @@ SUBROUTINE mynnsfc_wrapper_run( & snowh_wat(i) = 0.0 enddo +!$acc kernels ! cm -> m where (dry) znt_lnd=znt_lnd*0.01 where (wet) znt_wat=znt_wat*0.01 @@ -245,6 +261,7 @@ SUBROUTINE mynnsfc_wrapper_run( & where (dry) qsfc_lnd = qsfc_lnd_ruc/(1.+qsfc_lnd_ruc) ! spec. hum where (icy) qsfc_ice = qsfc_ice_ruc/(1.+qsfc_ice_ruc) ! spec. hum. end if +!$acc end kernels ! if (lprnt) then ! write(0,*)"CALLING SFCLAY_mynn; input:" @@ -274,6 +291,8 @@ SUBROUTINE mynnsfc_wrapper_run( & ! write(0,*)"PBLH=",pblh(1)," xland=",xland(1) ! endif +!$acc exit data delete(qsfc_lnd_ruc, qsfc_ice_ruc) +!$acc exit data delete(phii, qvsh, slmsk) CALL SFCLAY_mynn( & u3d=u,v3d=v,t3d=t3d,qv3d=qv,p3d=prsl,dz8w=dz, & @@ -318,6 +337,13 @@ SUBROUTINE mynnsfc_wrapper_run( & errmsg=errmsg, errflg=errflg ) if (errflg/=0) return +!$acc exit data delete(hfx, znt, psim, psih, chs, & +!$acc mavail, xland, GZ1OZ0, cpm, qgh, & +!$acc qfx, snowh_wat, t3d, exner) +!$acc exit data delete(dz, th, qv) +!$acc exit data copyout(rmol) +!$acc exit data copyout(qsfc_lnd, qsfc_ice) + !! POST MYNN SURFACE LAYER (INTERSTITIAL) WORK: !do i = 1, im ! !* Taken from sfc_nst.f @@ -336,10 +362,15 @@ SUBROUTINE mynnsfc_wrapper_run( & ! znt_ice(i)=znt_ice(i)*100. !enddo +!$acc kernels ! m -> cm where (dry) znt_lnd=znt_lnd*100. where (wet) znt_wat=znt_wat*100. where (icy) znt_ice=znt_ice*100. +!$acc end kernels + +!$acc exit data delete(dry, wet, icy) +!$acc exit data copyout(znt_lnd, znt_wat, znt_ice) ! if (lprnt) then ! write(0,*) From 95e9ff2268ad03f3004da886634812c5f7bdf32c Mon Sep 17 00:00:00 2001 From: "Timothy S. Sliwinski" Date: Thu, 24 Aug 2023 16:10:55 +0000 Subject: [PATCH 2/3] Reworking how errmsg is treated in device code to remove some preprocessor variable substitutions through the use of new local variables. The changes in this commit affect 3 main areas of module_sf_mynn.F90: 1.) Subroutine SFCLAY_mynn 2.) Subroutine SFCLAY1D_mynn 3.) Subroutine GFS_zt_wat Each of these areas are described in more detail below. 1.) SFCLAY_mynn In the SFCLAY_mynn subroutine, it was possible to remove all #ifdef substitutions of errmsg(len=*) for errmsg(len=200) because errmsg is not used in any code regions of this subroutine that may be run on an accelerator device. Since this is the case, errmsg(len=*) is perfectly acceptable, and can be left alone. The OpenACC data statements within the subroutine were also updated to remove references to errmsg as well since, again, it was not necessary to have errmsg on the device for this subroutine. 2.) SFCLAY1D_mynn - Creation of device_errmsg and device_errflg and proper syncing with errmsg and errflg In the SFCLAY1D_mynn subroutine, it was also possible to remove all #ifdef substitutions by instead creating a new local variable called device_errmsg that is a copy of errmsg but with a fixed size of 512 such that it is acceptable for use on the device. This is necessary because at certain points in the subroutine, loops that are good to be offloaded to the device set errmsg under certain conditions. Since these areas cannot be isolated from the parent loop without a major rework of the loop, we must preserve a way for errmsg to be set on the device. Since device_errmsg is a fixed size, we can do that. However, this complicates the code a bit for error handling purposes as we now have errmsg and device_errmsg which must be synced properly to ensure error messages are returned to CCPP as expected. Therefore, we must keep track of when device_errmsg is set so we can know to sync device_errmsg with errmsg. This is done by making a new local variable called device_errflg to be device_errmsg's complement on the device as errflg is errmsg's complement on the host. When device_errflg is set to a nonzero integer, we then know that device_errmsg must be synced with errmsg. This is simple to do at the end of the subroutine after the device_errmsg on the device is copyout-ed by OpenACC, and a new IF-block has been added for this general case. - Special case of mid-loop return (line 1417), and the creation of device_special_errflg and device_special_errmsg However, there is a special case we must handle a bit differently. In the mid-loop return statement near line 1417, we also must perform this sync to ensure the proper errmsg is returned in the event this return is needed. Therefore, a similar IF-block has been created within the corresponding #ifdef near line 2027 to ensure errmsg has the proper value before the subroutine returns. However, since this block is in the middle of the entire code and only executed on the host, we must first perform an OpenACC sync operation to make sure the device_errmsg and the device_errflg on the host matches the device_errmsg and device_errflg on the host, otherwise the incorrect values could lead to the return statement not executing as expected. This special case seems simple, but an extra trap lay exposed. If device_errmsg and device_errflg is set on the device at any point now before this IF-block, then the return statement we moved out of the loop will now be executed for *ANY* error message, whether that was the intended course or not. Therefore, we need to ensure this special case is only triggered for this specific case. Unfortunately, there appears no other way than to create two additional variables (device_special_errmsg and device_special_errflg) to isolate this case from all other error cases. With these installed in place of just device_errmsg and device_errflg, this special return case is now properly handled. - Complete Ifdef/Ifndef removal not possible Overall, due to the nature of this special case, we have no choice but to leave the #ifdef and #ifndef preprocessor statements in place as they are the only things capable of moving this return statement out of the loop without additional invasive changes to how the code operates. 3.) GFS_zt_wat In the GFS_zt_wat subroutine, since this subroutine is called on the device from within the main I-loop of SFCLAY1D_mynn, we have no choice but to change all errmsg and errflg usage to device_errmsg or device_errflg, otherwise this subroutine and the entire parent loop could not be run on the device. Therefore, all errmsg and errflg lines have been commented out and new, comparable lines using device_errmsg and device_errflg added in their place. Additionally, the subroutine call variable list was updated. --- physics/module_sf_mynn.F90 | 99 ++++++++++++++++++++++++++------------ 1 file changed, 67 insertions(+), 32 deletions(-) diff --git a/physics/module_sf_mynn.F90 b/physics/module_sf_mynn.F90 index 399b1ee83..dd181c99c 100644 --- a/physics/module_sf_mynn.F90 +++ b/physics/module_sf_mynn.F90 @@ -345,12 +345,7 @@ SUBROUTINE SFCLAY_mynn( & & qsfc_wat, qsfc_lnd, qsfc_ice ! CCPP error handling -#ifndef _OPENACC character(len=*), intent(inout) :: errmsg -#else -!Necessary since OpenACC does not support assumed-size arrays - character(len=200), intent(inout) :: errmsg -#endif integer, intent(inout) :: errflg !ADDITIONAL OUTPUT @@ -378,7 +373,7 @@ SUBROUTINE SFCLAY_mynn( & errmsg = '' !$acc enter data copyin( dz8w,U3D,V3D,QV3D,QC3D,P3D,T3D, & -!$acc pattern_spp_sfc, errmsg) +!$acc pattern_spp_sfc) !$acc enter data copyin( UST_WAT(:), UST_LND(:), UST_ICE(:), & !$acc MOL(:), QFLX(:), HFLX(:), & @@ -504,7 +499,7 @@ SUBROUTINE SFCLAY_mynn( & !$acc exit data copyout( UST_WAT(:), UST_LND(:), UST_ICE(:), & !$acc MOL(:), QFLX(:), HFLX(:), & !$acc QSFC(:), QSFC_WAT(:), QSFC_LND(:), & -!$acc QSFC_ICE(:), errmsg) +!$acc QSFC_ICE(:)) !$acc exit data delete( dz8w1d(:), dz2w1d(:), U1D(:), & !$acc V1D(:), U1D2(:), V1D2(:), & @@ -666,14 +661,25 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !JOE-end ! CCPP error handling -#ifndef _OPENACC character(len=*), intent(inout) :: errmsg -#else -! Necessary since OpenACC does not support assumed-size arrays - character(len=200), intent(inout) :: errmsg -#endif integer, intent(inout) :: errflg +! Local fixed-size errmsg character array for error messages on accelerator +! devices distinct from the host (e.g. GPUs). Necessary since OpenACC does +! not support assumed-size (len=*) arrays like errmsg. Additional +! device_errflg integer to denote when device_errmsg needs to be synced +! with errmsg. + character(len=512) :: device_errmsg + integer :: device_errflg + +! Special versions of the fixed-size errmsg character array for error messages +! on the device and it's errflag counterpart. These are necessary to ensure +! the return statements at lines 1417 and 2030 are executed only for this +! special case, and not any and all error messages set on the device. + character(len=512) :: device_special_errmsg + integer :: device_special_errflg + + !---------------------------------------------------------------- ! LOCAL VARS !---------------------------------------------------------------- @@ -723,6 +729,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! Initialize error-handling errflg = 0 errmsg = '' + device_errflg = errflg + device_errmsg = errmsg + device_special_errflg = errflg + device_special_errmsg = errmsg !------------------------------------------------------------------- !$acc update device(psim_stab, psim_unstab, psih_stab, psih_unstab) @@ -770,7 +780,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc PSIT_lnd, PSIT_wat, PSIT_ice, & !$acc ch_lnd, ch_wat, ch_ice, & !$acc cm_lnd, cm_wat, cm_ice, & -!$acc snowh_lnd, errmsg) +!$acc snowh_lnd, & +!$acc device_errmsg, device_errflg, & +!$acc device_special_errmsg, device_special_errflg) !$acc parallel loop present(PSFCPA, PSFC, QSFC, T1D, flag_iter, & !$acc QSFC_wat, QSFCMR_wat, wet, TSK_wat, tskin_wat, & @@ -1198,7 +1210,9 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !-------------------------------------------------------------------- !-------------------------------------------------------------------- -!$acc parallel loop present(flag_iter, errmsg, & +!$acc parallel loop present(flag_iter, & +!$acc device_errmsg, device_errflg, & +!$acc device_special_errmsg, device_special_errflg, & !$acc wet, dry, icy, & !$acc ZT_wat, ZT_lnd, ZT_ice, & !$acc ZNT_wat, ZNT_lnd, ZNT_ice, & @@ -1330,7 +1344,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF ELSEIF ( ISFTCFLX .EQ. 4 ) THEN !GFS zt formulation - CALL GFS_zt_wat(ZT_wat(i),ZNTstoch_wat(i),restar,WSPD(i),ZA(i),sfc_z0_type,errmsg,errflg) + CALL GFS_zt_wat(ZT_wat(i),ZNTstoch_wat(i),restar,WSPD(i),ZA(i),sfc_z0_type,device_errmsg,device_errflg) ZQ_wat(i)=ZT_wat(i) ENDIF ELSE @@ -1392,10 +1406,14 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ELSEIF ( IZ0TLND .EQ. 2 ) THEN ! DH note - at this point, qstar is either not initialized ! or initialized to zero, but certainly not set correctly - errmsg = 'Logic error: qstar is not set correctly when calling Yang_2008' - errflg = 1 + device_special_errmsg = 'Logic error: qstar is not set correctly when calling Yang_2008' + device_special_errflg = 1 #ifndef _OPENACC ! Necessary since OpenACC does not support branching in parallel code +! Must sync errmsg and errflg with device_errmsg and device_errflg, respectively +! so that proper error message and error flag codes are returned. + errmsg = device_special_errmsg + errflg = device_special_errflg return #endif CALL Yang_2008(ZNTSTOCH_lnd(i),ZT_lnd(i),ZQ_lnd(i),UST_lnd(i),MOL(I),& @@ -2001,8 +2019,14 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDDO ! end i-loop #ifdef _OPENACC - ! Necessary since OpenACC does not support branching in parallel code - IF (errflg == 1) THEN +! Necessary since OpenACC does not support branching in parallel code. +! Must sync host errflg, errmsg to determine if return must be triggered +! and correct error message and error flag code returned. +! This code is being executed on the HOST side only, pulling data from DEVICE. +!$acc exit data copyout(device_special_errflg, device_special_errmsg) + IF (device_special_errflg /= 0) THEN + errflg = device_special_errflg + errmsg = device_special_errmsg return ENDIF #endif @@ -2506,7 +2530,13 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc PSIT_lnd, PSIT_wat, PSIT_ice, & !$acc ch_lnd, ch_wat, ch_ice, & !$acc cm_lnd, cm_wat, cm_ice, & -!$acc errmsg) +!$acc device_errmsg, device_errflg) + +! Final sync of device and host error flags and messages +IF (device_errflg /= 0) THEN + errflg = device_errflg + errmsg = device_errmsg +ENDIF !$acc exit data delete( flag_iter, dry, wet, icy, dx, & !$acc MAVAIL, PBLH, PSFCPA, z0pert, ztpert, & @@ -3036,24 +3066,27 @@ SUBROUTINE GFS_z0_wat(z0rl_wat,ustar_wat,WSPD,z1,sfc_z0_type,redrag) END SUBROUTINE GFS_z0_wat !-------------------------------------------------------------------- !>\ingroup mynn_sfc - SUBROUTINE GFS_zt_wat(ztmax,z0rl_wat,restar,WSPD,z1,sfc_z0_type,errmsg,errflg) + SUBROUTINE GFS_zt_wat(ztmax,z0rl_wat,restar,WSPD,z1,sfc_z0_type,device_errmsg,device_errflg) !$acc routine seq real(kind_phys), INTENT(OUT) :: ztmax real(kind_phys), INTENT(IN) :: wspd,z1,z0rl_wat,restar INTEGER, INTENT(IN) :: sfc_z0_type -#ifndef _OPENACC - character(len=*), intent(out) :: errmsg -#else -! Necessary since OpenACC does not support assumed-size arrays - character(len=200), intent(out) :: errmsg -#endif - integer, intent(out) :: errflg + +! Using device_errmsg and device_errflg rather than the CCPP errmsg and errflg +! so that this subroutine can be run on an accelerator device with OpenACC. +! character(len=*), intent(out) :: errmsg +! integer, intent(out) :: errflg + character(len=512), intent(out) :: device_errmsg + integer, intent(out) :: device_errflg + real(kind_phys) :: z0,z0max,wind10m,rat,ustar_wat real(kind_phys), PARAMETER :: charnock = 0.014, z0s_max=.317e-2 ! Initialize error-handling - errflg = 0 - errmsg = '' +! errflg = 0 +! errmsg = '' + device_errflg = 0 + device_errmsg = '' ! z0 = 0.01 * z0rl_wat !Already converted to meters in the wrapper @@ -3084,8 +3117,10 @@ SUBROUTINE GFS_zt_wat(ztmax,z0rl_wat,restar,WSPD,z1,sfc_z0_type,errmsg,errflg) call znot_t_v7(wind10m, ztmax) ! 10-m wind,m/s, ztmax(m) else if (sfc_z0_type > 0) then write(0,*)'no option for sfc_z0_type=',sfc_z0_type - errflg = 1 - errmsg = 'ERROR(GFS_zt_wat): sfc_z0_type not valid.' +! errflg = 1 +! errmsg = 'ERROR(GFS_zt_wat): sfc_z0_type not valid.' + device_errflg = 1 + device_errmsg = 'ERROR(GFS_zt_wat): sfc_z0_type not valid.' return endif From 36a313e91bd7089a3069a72d1326d200e4bbcde0 Mon Sep 17 00:00:00 2001 From: "Timothy S. Sliwinski" Date: Mon, 28 Aug 2023 21:11:38 +0000 Subject: [PATCH 3/3] Removing preprocessor directives to re-enable print statements on GPU for debug and other conditions. Original problem: ----------------- Following feedback that debug information was still desirable for OpenACC device- executed code where possible, this change removes all preprocessor directives which were guarding against the compilation of statements which wrote to standard output. These directives were originally used because debug statements and other standard output had the potential to greatly reduce performance because of the need to copy over certain variables from the host to the device just for debug output purposes. Additionally, when statements were located within parallel-execution regions, the output was not guaranteed to be presented in any specific order and the additional IF-branches in the code also would have reduced performance as branching is not efficient when on SIMD architectures. Resolutions: ------------ However, with a bit of extra work, a few of these issues are alleviated to allow output to work again as requested. First, on the data optimization side of the problem, the impact of pulling in variables just for debugging was minimized by ensuring the data was pulled in and resident on the GPU for the entire subroutine execution. While this increases the memory footprint on the device which may have very limited memory, it reduces the data transfer related performance hit. Next, in the cases where debug output was not within parallel regions but still needing to be executed on the GPU to show the proper values at that state of the overall program execution, OpenACC serial regions were used. These allow the data to not have to be transferred off the GPU mid-execution of the program just to be shown as debug output and also partially solve the problem of out-of-order output. Since debug regions are guarded by IF blocks, these serial regions do not significantly impact performance when debug output is turned off (debug_code=0). However, slowdown is significant for any other debug-levels which should be acceptable for debugging situations. Performance Changes: -------------------- Overall, these changes accomplish the goal of re-enabling debugging output, but not completely without a cost. Overall runtime was slightly impacted on the GPU when tested with 150k and 750k vertical columns (the value of ite used in the i-loops) and debugging turned off (debug_code=0). For 150k columns, the GPU decreased in speed from the original baseline of 22ms to 30ms. For 750k columns, the GPU decreased in speed from the original baseline of 31ms to 70ms. The impact is greater for the larger number of columns due to the impact of the number of times the mid-loop IF branches are evaluated on the GPU. While these are slight declines in performance, these are still significant speedups over the CPU-only tests (8.7x and 18.7x speedups for 150k and 750k, respectively). Compilation Time Changes: ------------------------- One additional noted observation regarding performance is compilation time. When all debug output is disabled (debug_code=0), compilation time is approximately 90 seconds with the additional serial blocks, IF-branches, and so forth as each of these require more work from the OpenACC compiler to generate code for the GPU. This problem is compounded when the debug_code option is increase to either 1 (some debug output) or 2 (full debug output). At a value of 1, compilation time jumps up to approximately 12.5 minutes on the Hera GPU nodes. At a value of 2, compilation time increases further to approximately 18.5 minutes on the same GPU nodes. The explanation for this is the need for the OpenACC compiler to enable greater amounts of serial and branching code that (again) are less optimal on the GPU and so the compiler must do more work to try to optimize them as best it can. --- physics/module_sf_mynn.F90 | 112 ++++++++++++++++--------------------- 1 file changed, 49 insertions(+), 63 deletions(-) diff --git a/physics/module_sf_mynn.F90 b/physics/module_sf_mynn.F90 index dd181c99c..eecc5493c 100644 --- a/physics/module_sf_mynn.F90 +++ b/physics/module_sf_mynn.F90 @@ -780,11 +780,11 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc PSIT_lnd, PSIT_wat, PSIT_ice, & !$acc ch_lnd, ch_wat, ch_ice, & !$acc cm_lnd, cm_wat, cm_ice, & -!$acc snowh_lnd, & +!$acc snowh_lnd, snowh_wat, snowh_ice, & !$acc device_errmsg, device_errflg, & !$acc device_special_errmsg, device_special_errflg) -!$acc parallel loop present(PSFCPA, PSFC, QSFC, T1D, flag_iter, & +!$acc parallel loop present(PSFCPA, PSFC, QSFC, T1D, flag_iter, tsurf_lnd, & !$acc QSFC_wat, QSFCMR_wat, wet, TSK_wat, tskin_wat, & !$acc QSFC_lnd, QSFCMR_lnd, dry, TSK_lnd, tskin_lnd, & !$acc QSFC_ice, QSFCMR_ice, icy, TSK_ice, tskin_ice) @@ -809,9 +809,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF QSFC_wat(I)=EP2*E1/(PSFC(I)-ep3*E1) !specific humidity QSFCMR_wat(I)=EP2*E1/(PSFC(I)-E1) !mixing ratio -#ifndef _OPENACC IF(QSFC_wat(I)>1..or.QSFC_wat(I)<0.) print *,' QSFC_wat(I)',itimestep,i,QSFC_wat(I),TSK_wat(i) -#endif ENDIF IF (dry(i)) THEN TSK_lnd(I) = tskin_lnd(i) @@ -831,9 +829,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & QSFC_lnd(I)=0.5*(QSFC_lnd(I) + QSFC(I)) QSFCMR_lnd(I)=QSFC_lnd(I)/(1.-QSFC_lnd(I)) !mixing ratio endif ! lsm -#ifndef _OPENACC IF(QSFC_lnd(I)>1..or.QSFC_lnd(I)<0.) print *,' QSFC_lnd(I)',itimestep,i,QSFC_lnd(I),Tskin_lnd(i),tsurf_lnd(i),qsfc(i) -#endif ENDIF IF (icy(i)) THEN TSK_ice(I) = tskin_ice(i) @@ -851,9 +847,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & QSFC_ice(I)=EP2*E1/(PSFC(I)-ep3*E1) !specific humidity QSFCMR_ice(I)=EP2*E1/(PSFC(I)-E1) !mixing ratio endif ! lsm -#ifndef _OPENACC IF(QSFC_ice(I)>1..or.QSFC_ice(I)<0.) print *,' QSFC_ice(I)',itimestep,i,QSFC_ice(I),TSK_ice(i) -#endif ENDIF ELSE @@ -906,7 +900,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & endif ! flag_iter ENDDO -#ifndef _OPENACC +!$acc serial present(pblh, PSFCPA, dz8w1d, qflx, hflx, & +!$acc dry, tskin_lnd, tsurf_lnd, qsfc_lnd, znt_lnd, ust_lnd, snowh_lnd, & +!$acc icy, tskin_ice, tsurf_ice, qsfc_ice, znt_ice, ust_ice, snowh_ice, & +!$acc wet, tskin_wat, tsurf_wat, qsfc_wat, znt_wat, ust_wat, snowh_wat) IF (debug_code >= 1) THEN write(0,*)"ITIMESTEP=",ITIMESTEP," iter=",iter DO I=its,ite @@ -931,12 +928,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF ENDDO ENDIF -#endif +!$acc end serial !$acc parallel loop present(PSFC, PSFCPA, QVSH, QV1D, THCON, flag_iter, & !$acc dry, tskin_lnd, TSK_lnd, tsurf_lnd, THSK_lnd, THVSK_lnd, qsfc_lnd, & !$acc icy, tskin_ice, TSK_ice, tsurf_ice, THSK_ice, THVSK_ice, qsfc_ice, & -!$acc wet, tskin_wat, TSK_wat, tsurf_wat, THSK_wat, THVSK_wat) +!$acc wet, tskin_wat, TSK_wat, tsurf_wat, THSK_wat, THVSK_wat, qsfc_wat) DO I=its,ite ! PSFC ( in cmb) is used later in saturation checks PSFC(I)=PSFCPA(I)/1000. @@ -950,10 +947,8 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_lnd(I) = TSK_lnd(I)*THCON(I) !(K) THVSK_lnd(I) = THSK_lnd(I)*(1.+EP1*qsfc_lnd(I)) -#ifndef _OPENACC if(THVSK_lnd(I) < 170. .or. THVSK_lnd(I) > 360.) & print *,'THVSK_lnd(I)',itimestep,i,THVSK_lnd(I),THSK_lnd(i),tsurf_lnd(i),tskin_lnd(i),qsfc_lnd(i) -#endif endif if(icy(i)) then TSK_ice(I) = tskin_ice(i) @@ -961,10 +956,8 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_ice(I) = TSK_ice(I)*THCON(I) !(K) THVSK_ice(I) = THSK_ice(I)*(1.+EP1*qsfc_ice(I)) !(K) -#ifndef _OPENACC if(THVSK_ice(I) < 170. .or. THVSK_ice(I) > 360.) & print *,'THVSK_ice(I)',itimestep,i,THVSK_ice(I),THSK_ice(i),tsurf_ice(i),tskin_ice(i),qsfc_ice(i) -#endif endif if(wet(i)) then TSK_wat(I) = tskin_wat(i) @@ -972,10 +965,8 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ! CONVERT SKIN TEMPERATURES TO POTENTIAL TEMPERATURE: THSK_wat(I) = TSK_wat(I)*THCON(I) !(K) THVSK_wat(I) = THSK_wat(I)*(1.+EP1*QVSH(I)) !(K) -#ifndef _OPENACC if(THVSK_wat(I) < 170. .or. THVSK_wat(I) > 360.) & print *,'THVSK_wat(I)',i,THVSK_wat(I),THSK_wat(i),tsurf_wat(i),tskin_wat(i),qsfc_wat(i) -#endif endif endif ! flag_iter ENDDO @@ -1009,7 +1000,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & HFX(i)=HFLX(i)*RHO1D(I)*cp ENDDO -#ifndef _OPENACC +!$acc serial present(THV1D, TV1D, RHO1D, GOVRTH, & +!$acc dry, tsk_lnd, thvsk_lnd, & +!$acc icy, tsk_ice, thvsk_ice, & +!$acc wet, tsk_wat, thvsk_wat) IF (debug_code ==2) THEN !write(*,*)"ITIMESTEP=",ITIMESTEP DO I=its,ite @@ -1022,7 +1016,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & write(*,*)"RHO1D=", RHO1D(i)," GOVRTH=",GOVRTH(i) ENDDO ENDIF -#endif +!$acc end serial !$acc parallel loop present(T1D,P1D,QGH,QV1D,CPM) DO I=its,ite @@ -1042,7 +1036,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & CPM(I)=CP*(1.+0.84*QV1D(I)) ENDDO -#ifndef _OPENACC +!$acc serial present(QGH, & +!$acc wet, QSFC_wat, QSFCMR_wat, & +!$acc dry, QSFC_lnd, QSFCMR_lnd, & +!$acc icy, QSFC_ice, QSFCMR_ice) IF (debug_code == 2) THEN write(*,*)"ITIMESTEP=",ITIMESTEP DO I=its,ite @@ -1060,7 +1057,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & endif ENDDO ENDIF -#endif +!$acc end serial !$acc parallel loop present(flag_iter,U1D,V1D,WSPD,wet,dry,icy, & !$acc THV1D,THVSK_wat,THVSK_lnd,THVSK_ice, & @@ -1182,7 +1179,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & WSPD(I) = MAX(WSPD_ice,WSPD_wat) WSPD(I) = MAX(WSPD_lnd,WSPD(I)) -#ifndef _OPENACC IF (debug_code == 2) THEN write(*,*)"===== After rb calc in mynn sfc layer:" write(*,*)"ITIMESTEP=",ITIMESTEP @@ -1191,7 +1187,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & IF (wet(i))write(*,*)"rb_wat=", rb_wat(I)," DTHVDZ=",DTHVDZ IF (dry(i))write(*,*)"rb_lnd=", rb_lnd(I)," DTHVDZ=",DTHVDZ ENDIF -#endif ! IF PREVIOUSLY UNSTABLE, DO NOT LET INTO REGIMES 1 AND 2 (STABLE) !if (itimestep .GT. 1) THEN @@ -1210,7 +1205,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !-------------------------------------------------------------------- !-------------------------------------------------------------------- -!$acc parallel loop present(flag_iter, & +!$acc parallel loop present(flag_iter, PSFCPA, dz8w1d, pblh, & !$acc device_errmsg, device_errflg, & !$acc device_special_errmsg, device_special_errflg, & !$acc wet, dry, icy, & @@ -1219,8 +1214,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & !$acc UST_wat, UST_lnd, UST_ice, & !$acc ZQ_wat, ZQ_lnd, ZQ_ice, & -!$acc snowh_lnd, & +!$acc snowh_wat, snowh_lnd, snowh_ice, & !$acc THVSK_wat, THVSK_lnd, THVSK_ice, & +!$acc tskin_wat, tskin_lnd, tskin_ice, & +!$acc tsurf_wat, tsurf_lnd, tsurf_ice, & !$acc qsfc_wat, qsfc_lnd, qsfc_ice, & !$acc GZ1OZ0_wat, GZ1OZt_wat, GZ2OZ0_wat, GZ2OZt_wat, GZ10OZ0_wat, GZ10OZt_wat, & !$acc GZ1OZ0_lnd, GZ1OZt_lnd, GZ2OZ0_lnd, GZ2OZt_lnd, GZ10OZ0_lnd, GZ10OZt_lnd, & @@ -1228,12 +1225,14 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc zratio_wat, zratio_lnd, zratio_ice, & !$acc stress_wat, stress_lnd, stress_ice, & !$acc rb_wat, rb_lnd, rb_ice, & +!$acc qflx, qflx_lnd, & +!$acc hflx, hflx_lnd, & !$acc psim, psih, psim10, psih10, psih2, & !$acc psix_wat, psix10_wat, psit_wat, psit2_wat, psiq_wat, psiq2_wat, & !$acc psix_lnd, psix10_lnd, psit_lnd, psit2_lnd, psiq_lnd, psiq2_lnd, & !$acc psix_ice, psix10_ice, psit_ice, psit2_ice, psiq_ice, psiq2_ice, & !$acc WSPD, WSPDI, U1D, V1D, TC1D, THV1D, rstoch1D, USTM, ZA, ZOL, QVSH, & -!$acc shdmax, vegtype, z0pert, ztpert, mol, rmol, qstar, sigmaf) +!$acc shdmax, vegtype, z0pert, ztpert, mol, rmol, wstar, qstar, sigmaf) DO I=its,ite if( flag_iter(i) ) then @@ -1250,12 +1249,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & if (sfc_z0_type >= 0) then ! Avoid calculation is using wave model ! CALCULATE z0 (znt) !-------------------------------------- -#ifndef _OPENACC + IF (debug_code == 2) THEN write(*,*)"=============Input to ZNT over water:" write(*,*)"u*:",UST_wat(i)," wspd=",WSPD(i)," visc=",visc," za=",ZA(I) ENDIF -#endif + IF ( PRESENT(ISFTCFLX) ) THEN IF ( ISFTCFLX .EQ. 0 ) THEN IF (COARE_OPT .EQ. 3.0) THEN @@ -1292,12 +1291,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZNTstoch_wat(I) = ZNT_wat(I) endif -#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"==========Output ZNT over water:" write(*,*)"ZNT:",ZNTstoch_wat(i) ENDIF -#endif !COMPUTE ROUGHNESS REYNOLDS NUMBER (restar) USING NEW ZNT ! AHW: Garrattt formula: Calculate roughness Reynolds number @@ -1308,12 +1305,10 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !-------------------------------------- !CALCULATE z_t and z_q !-------------------------------------- -#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"=============Input to ZT over water:" write(*,*)"u*:",UST_wat(i)," restar=",restar," visc=",visc ENDIF -#endif IF ( PRESENT(ISFTCFLX) ) THEN IF ( ISFTCFLX .EQ. 0 ) THEN @@ -1357,12 +1352,11 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & rstoch1D(i),spp_sfc) ENDIF ENDIF -#ifndef _OPENACC + IF (debug_code > 1) THEN write(*,*)"=============Output ZT & ZQ over water:" write(*,*)"ZT:",ZT_wat(i)," ZQ:",ZQ_wat(i) ENDIF -#endif GZ1OZ0_wat(I)= LOG((ZA(I)+ZNTstoch_wat(I))/ZNTstoch_wat(I)) GZ1OZt_wat(I)= LOG((ZA(I)+ZNTstoch_wat(i))/ZT_wat(i)) @@ -1433,7 +1427,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF ENDIF -#ifndef _OPENACC IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i write(0,*)" ZNT=", ZNTstoch_lnd(i)," ZT=",Zt_lnd(i) @@ -1442,7 +1435,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " ust=", ust_lnd(i)," snowh=", snowh_lnd(i),"psfcpa=",PSFCPA(i), & " dz=",dz8w1d(i)," qflx=",qflx_lnd(i)," hflx=",hflx_lnd(i)," hpbl=",pblh(i) ENDIF -#endif GZ1OZ0_lnd(I)= LOG((ZA(I)+ZNTstoch_lnd(I))/ZNTstoch_lnd(I)) GZ1OZt_lnd(I)= LOG((ZA(I)+ZNTstoch_lnd(i))/ZT_lnd(i)) @@ -1508,7 +1500,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_wat(i) < 1E-8 .OR. Zt_wat(i) < 1E-10) THEN write(0,*)"===(wet) capture bad input in mynn sfc layer, i=:",i @@ -1519,7 +1510,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_wat(I),ZA(I),ZNTstoch_wat(I),ZT_wat(I),ZOL(I),psi_opt) @@ -1577,7 +1567,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_wat(i) < 1E-8 .OR. Zt_wat(i) < 1E-10) THEN write(0,*)"===(wet) capture bad input in mynn sfc layer, i=:",i @@ -1588,7 +1577,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_wat(I),ZA(I),ZNTstoch_wat(I),ZT_wat(I),ZOL(I),psi_opt) @@ -1649,7 +1637,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i @@ -1660,7 +1647,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_lnd(I),ZA(I),ZNTstoch_lnd(I),ZT_lnd(I),ZOL(I),psi_opt) @@ -1717,7 +1703,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_lnd(i) < 1E-8 .OR. Zt_lnd(i) < 1E-10) THEN write(0,*)"===(land) capture bad input in mynn sfc layer, i=:",i @@ -1728,7 +1713,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_lnd(I),ZA(I),ZNTstoch_lnd(I),ZT_lnd(I),ZOL(I),psi_opt) @@ -1788,7 +1772,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),0.0_kind_phys) ZOL(I)=MIN(ZOL(I),20._kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_ice(i) < 1E-8 .OR. Zt_ice(i) < 1E-10) THEN write(0,*)"===(ice) capture bad input in mynn sfc layer, i=:",i @@ -1799,7 +1782,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_ice(I),ZA(I),ZNTstoch_ice(I),ZT_ice(I),ZOL(I),psi_opt) @@ -1856,7 +1838,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ZOL(I)=MAX(ZOL(I),-20.0_kind_phys) ZOL(I)=MIN(ZOL(I),0.0_kind_phys) -#ifndef _OPENACC IF (debug_code >= 1) THEN IF (ZNTstoch_ice(i) < 1E-8 .OR. Zt_ice(i) < 1E-10) THEN write(0,*)"===(ice) capture bad input in mynn sfc layer, i=:",i @@ -1867,7 +1848,6 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & " dz=",dz8w1d(i)," qflx=",qflx(i)," hflx=",hflx(i)," hpbl=",pblh(i) ENDIF ENDIF -#endif !Use Pedros iterative function to find z/L !zol(I)=zolri(rb_ice(I),ZA(I),ZNTstoch_ice(I),ZT_ice(I),ZOL(I),psi_opt) @@ -2031,7 +2011,13 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF #endif -#ifndef _OPENACC +!$acc serial present(wet, dry, icy, & +!$acc PSIM, PSIH, CPM, RHO1D, ZOL, wspd, MOL, & +!$acc wstar, qstar, THV1D, HFX, MAVAIL, QVSH, & +!$acc THVSK_wat, THVSK_lnd, THVSK_ice, & +!$acc UST_wat, UST_lnd, UST_ice, & +!$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & +!$acc zt_wat, zt_lnd, zt_ice) IF (debug_code == 2) THEN DO I=its,ite IF(wet(i))write(*,*)"==== AT END OF MAIN LOOP, i=",i, "(wet)" @@ -2052,7 +2038,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & write(*,*)"=============================================" ENDDO ! end i-loop ENDIF -#endif +!$acc end serial !---------------------------------------------------------- ! COMPUTE SURFACE HEAT AND MOISTURE FLUXES @@ -2238,14 +2224,12 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF -#ifndef _OPENACC IF (debug_code > 1) THEN write(*,*)"QFX=",QFX(I),"FLQC=",FLQC(I) if(icy(i))write(*,*)"ice, MAVAIL:",MAVAIL(I)," u*=",UST_ice(I)," psiq=",PSIQ_ice(i) if(dry(i))write(*,*)"lnd, MAVAIL:",MAVAIL(I)," u*=",UST_lnd(I)," psiq=",PSIQ_lnd(i) if(wet(i))write(*,*)"ocn, MAVAIL:",MAVAIL(I)," u*=",UST_wat(I)," psiq=",PSIQ_wat(i) ENDIF -#endif ! The exchange coefficient for cloud water is assumed to be the ! same as that for heat. CH is multiplied by WSPD. @@ -2397,17 +2381,17 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !----------------------------------------------------- ! DEBUG - SUSPICIOUS VALUES !----------------------------------------------------- +!$acc serial present(dry, wet, icy, CPM, MAVAIL, & +!$acc HFX, LH, wstar, RHO1D, PBLH, ZOL, ZA, MOL, & +!$acc PSIM, PSIH, WSTAR, T1D, TH1D, THV1D, QVSH, & +!$acc UST_wat, UST_lnd, UST_ice, & +!$acc THSK_wat, THSK_lnd, THSK_ice, & +!$acc THVSK_wat, THVSK_lnd, THVSK_ice, & +!$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & +!$acc ZT_wat, ZT_lnd, ZT_ice, & +!$acc QSFC_wat, QSFC_lnd, QSFC_ice, & +!$acc PSIX_wat, PSIX_lnd, PSIX_ice) IF ( debug_code == 2) THEN - !$acc parallel loop present(dry, wet, icy, CPM, MAVAIL, & - !$acc HFX, LH, wstar, RHO1D, PBLH, ZOL, ZA, MOL, & - !$acc PSIM, PSIH, WSTAR, T1D, TH1D, THV1D, QVSH, & - !$acc UST_wat, UST_lnd, UST_ice, & - !$acc THSK_wat, THSK_lnd, THSK_ice, & - !$acc THVSK_wat, THVSK_lnd, THVSK_ice, & - !$acc ZNTstoch_wat, ZNTstoch_lnd, ZNTstoch_ice, & - !$acc ZT_wat, ZT_lnd, ZT_ice, & - !$acc QSFC_wat, QSFC_lnd, QSFC_ice, & - !$acc PSIX_wat, PSIX_lnd, PSIX_ice) DO I=its,ite yesno = 0 IF (compute_flux) THEN @@ -2511,6 +2495,7 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & ENDIF ENDDO ! end i-loop ENDIF ! end debug option +!$acc end serial !$acc exit data copyout(CPM, FLHC, FLQC, CHS, CH, CHS2, CQS2,& !$acc USTM, wstar, qstar, ZOL, MOL, RMOL, & @@ -2542,7 +2527,8 @@ SUBROUTINE SFCLAY1D_mynn(flag_iter, & !$acc MAVAIL, PBLH, PSFCPA, z0pert, ztpert, & !$acc QV1D, U1D, V1D, U1D2, V1D2, T1D, P1D, & !$acc rstoch1D, sigmaf, shdmax, vegtype, & -!$acc dz2w1d, dz8w1d, snowh_lnd, & +!$acc dz2w1d, dz8w1d, & +!$acc snowh_wat, snowh_lnd, snowh_ice, & !$acc tskin_wat, tskin_lnd, tskin_ice, & !$acc tsurf_wat, tsurf_lnd, tsurf_ice)