Skip to content

Commit

Permalink
Features/jet partitions (#253)
Browse files Browse the repository at this point in the history
## DESCRIPTION OF CHANGES: 

Add the ability for forecast jobs on Jet and Hera to use resources more efficiently, and to allow for Jet jobs to run on more partitions, which will be helpful during the HFIP Allocation Season. This involved additional Slurm tags in the rocoto xml, and changing pre-defined grids' layouts and blocksize to be consistent with 

Other changes:

- Converted nthreads to a configurable option in model_configure
- Updated layout and blocksize for GSD_HRRR25km pre-defined grid. 
- Changed dependency for post jobs to also allow for completion of run_fcst to trigger jobs.
- Reduce the number of tries for all tasks to 1. Anything higher is wasteful of resources and time, especially when there are forecast tasks known to fail in the test suite.


The concept adopted here is based on tests performed by Sam Trahan on the AVID real-time setup of HRRR. He found that a more efficient way to run the forecast model is by using the following settings:

    <partition>kjet,xjet,sjet,vjet</partition>
    <cores>240</cores>
    <native>--cpus-per-task 4</native>
    <native>--exclusive</native>

along with `OMP_NUM_THREADS=4`

To adopt these settings in a more general way, he also provided a two rules to follow in order to avoid crashes when choosing layouts and blocksize: 

blocksize <= grid_size_y / layout_y / 2
blocksize >= grid_size_x / layout_x / 2 \

## TESTS CONDUCTED: 
I ran the test suite (except for nco tests) on both Jet and Hera. Applying the above techniques, I fixed a couple of the tests that were already failing for develop -- regional_003, regional_004, and new_JPgrid. (After merging with develop, I realize that new_JPgrid may have already been fixed)

## CONTRIBUTORS: 
@SamuelTrahanNOAA, Dom Heinzeller, and several colleagues in GSL/ATD.
  • Loading branch information
christinaholtNOAA authored Jul 29, 2020
1 parent 5d75a73 commit a3d5e1b
Show file tree
Hide file tree
Showing 16 changed files with 59 additions and 35 deletions.
6 changes: 5 additions & 1 deletion scripts/exregional_run_fcst.sh
Original file line number Diff line number Diff line change
Expand Up @@ -107,13 +107,15 @@ case $MACHINE in
ulimit -a
APRUN="srun"
LD_LIBRARY_PATH="${UFS_WTHR_MDL_DIR}/FV3/ccpp/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
OMP_NUM_THREADS=4
;;
#
"JET")
ulimit -s unlimited
ulimit -a
APRUN="srun"
LD_LIBRARY_PATH="${UFS_WTHR_MDL_DIR}/FV3/ccpp/lib${LD_LIBRARY_PATH:+:$LD_LIBRARY_PATH}"
OMP_NUM_THREADS=4
;;
#
"ODIN")
Expand Down Expand Up @@ -427,6 +429,7 @@ fi
#
create_model_config_file \
cdate="$cdate" \
nthreads=${OMP_NUM_THREADS:-1} \
run_dir="${run_dir}" || print_err_msg_exit "\
Call to function to create a model configuration file for the current
cycle's (cdate) run directory (run_dir) failed:
Expand Down Expand Up @@ -456,8 +459,9 @@ fi
#-----------------------------------------------------------------------
#
export KMP_AFFINITY=scatter
export OMP_NUM_THREADS=1 #Needs to be 1 for dynamic build of CCPP with GFDL fast physics, was 2 before.
export OMP_NUM_THREADS=${OMP_NUM_THREADS:-1} #Needs to be 1 for dynamic build of CCPP with GFDL fast physics, was 2 before.
export OMP_STACKSIZE=1024m

#
#-----------------------------------------------------------------------
#
Expand Down
4 changes: 2 additions & 2 deletions tests/baseline_configs/config.new_JPgrid.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ JPgrid_WIDE_HALO_WIDTH=6
DT_ATMOS="40"

LAYOUT_X="8"
LAYOUT_Y="6"
BLOCKSIZE="26"
LAYOUT_Y="12"
BLOCKSIZE="13"

if [ "$QUILTING" = "TRUE" ]; then
WRTCMP_write_groups="1"
Expand Down
2 changes: 2 additions & 0 deletions ush/create_model_config_file.sh
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ function create_model_config_file() {
local valid_args=(
cdate \
run_dir \
nthreads \
)
process_args valid_args "$@"
#
Expand Down Expand Up @@ -115,6 +116,7 @@ cycle directory..."
set_file_param "${model_config_fp}" "ncores_per_node" "${NCORES_PER_NODE}"
set_file_param "${model_config_fp}" "quilting" "${dot_quilting_dot}"
set_file_param "${model_config_fp}" "print_esmf" "${dot_print_esmf_dot}"
set_file_param "${model_config_fp}" "atmos_nthreads" "${nthreads:-1}"
#
#-----------------------------------------------------------------------
#
Expand Down
7 changes: 7 additions & 0 deletions ush/generate_FV3SAR_wflow.sh
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ settings="\
'queue_hpss_tag': ${QUEUE_HPSS_TAG}
'queue_fcst': ${QUEUE_FCST}
'queue_fcst_tag': ${QUEUE_FCST_TAG}
'machine': ${MACHINE}
#
# Workflow task names.
#
Expand All @@ -138,6 +139,12 @@ settings="\
'nnodes_run_fcst': ${NNODES_RUN_FCST}
'nnodes_run_post': ${NNODES_RUN_POST}
#
# Number of cores used for a task
#
'ncores_run_fcst': ${PE_MEMBER01}
'native_run_fcst': --cpus-per-task 4 --exclusive
'partition_run_fcst': sjet,vjet,kjet,xjet
#
# Number of logical processes per node for each task. If running without
# threading, this is equal to the number of MPI processes per node.
#
Expand Down
16 changes: 8 additions & 8 deletions ush/set_predef_grid_params.sh
Original file line number Diff line number Diff line change
Expand Up @@ -347,9 +347,9 @@ predefined domain:

DT_ATMOS="300"

LAYOUT_X="2"
LAYOUT_X="20"
LAYOUT_Y="2"
BLOCKSIZE="2"
BLOCKSIZE="10"

if [ "$QUILTING" = "TRUE" ]; then
WRTCMP_write_groups="1"
Expand Down Expand Up @@ -460,9 +460,9 @@ predefined domain:

DT_ATMOS="40"

LAYOUT_X="36"
LAYOUT_Y="24"
BLOCKSIZE="26"
LAYOUT_X="18"
LAYOUT_Y="12"
BLOCKSIZE="46"

QUILTING="TRUE"

Expand Down Expand Up @@ -497,9 +497,9 @@ predefined domain:

DT_ATMOS="40"

LAYOUT_X="34"
LAYOUT_Y="24"
BLOCKSIZE="34"
LAYOUT_X="18"
LAYOUT_Y="12"
BLOCKSIZE="46"

if [ "$QUILTING" = "TRUE" ]; then
WRTCMP_write_groups="1"
Expand Down
39 changes: 25 additions & 14 deletions ush/templates/FV3SAR_wflow.xml
Original file line number Diff line number Diff line change
Expand Up @@ -116,7 +116,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&MAKE_GRID_TN;" cycledefs="at_start" maxtries="4">
<task name="&MAKE_GRID_TN;" cycledefs="at_start" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_GRID_TN;" "&JOBSDIR;/JREGIONAL_MAKE_GRID"</command>
Expand All @@ -136,7 +136,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&MAKE_OROG_TN;" cycledefs="at_start" maxtries="4">
<task name="&MAKE_OROG_TN;" cycledefs="at_start" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_OROG_TN;" "&JOBSDIR;/JREGIONAL_MAKE_OROG"</command>
Expand Down Expand Up @@ -164,7 +164,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&MAKE_SFC_CLIMO_TN;" cycledefs="at_start" maxtries="2">
<task name="&MAKE_SFC_CLIMO_TN;" cycledefs="at_start" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_SFC_CLIMO_TN;" "&JOBSDIR;/JREGIONAL_MAKE_SFC_CLIMO"</command>
Expand Down Expand Up @@ -198,7 +198,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&GET_EXTRN_ICS_TN;" maxtries="3">
<task name="&GET_EXTRN_ICS_TN;" maxtries="1">

&RSRV_HPSS;
<command>&LOAD_MODULES_RUN_TASK_FP; "&GET_EXTRN_ICS_TN;" "&JOBSDIR;/JREGIONAL_GET_EXTRN_MDL_FILES"</command>
Expand All @@ -220,7 +220,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&GET_EXTRN_LBCS_TN;" maxtries="3">
<task name="&GET_EXTRN_LBCS_TN;" maxtries="1">

&RSRV_HPSS;
<command>&LOAD_MODULES_RUN_TASK_FP; "&GET_EXTRN_LBCS_TN;" "&JOBSDIR;/JREGIONAL_GET_EXTRN_MDL_FILES"</command>
Expand Down Expand Up @@ -252,7 +252,7 @@ MODULES_RUN_TASK_FP script.
{%- endfor %} </var>
{%- endif %}

<task name="&MAKE_ICS_TN;{{ uscore_ensmem_name }}" maxtries="3">
<task name="&MAKE_ICS_TN;{{ uscore_ensmem_name }}" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_ICS_TN;" "&JOBSDIR;/JREGIONAL_MAKE_ICS"</command>
Expand Down Expand Up @@ -294,7 +294,7 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&MAKE_LBCS_TN;{{ uscore_ensmem_name }}" maxtries="3">
<task name="&MAKE_LBCS_TN;{{ uscore_ensmem_name }}" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_LBCS_TN;" "&JOBSDIR;/JREGIONAL_MAKE_LBCS"</command>
Expand Down Expand Up @@ -336,13 +336,21 @@ MODULES_RUN_TASK_FP script.
************************************************************************
************************************************************************
-->
<task name="&RUN_FCST_TN;{{ uscore_ensmem_name }}" maxtries="3">
<task name="&RUN_FCST_TN;{{ uscore_ensmem_name }}" maxtries="1">

&RSRV_FCST;
<command>&LOAD_MODULES_RUN_TASK_FP; "&RUN_FCST_TN;" "&JOBSDIR;/JREGIONAL_RUN_FCST"</command>
{% if machine in ["JET", "HERA"] %}
<cores>{{ ncores_run_fcst }}</cores>
<native>{{ native_run_fcst }}</native>
{% if machine == "JET" %}
<partition>{{ partition_run_fcst }} </partition>
{% endif %}
{% else %}
<nodes>{{ nnodes_run_fcst }}:ppn={{ ppn_run_fcst }}</nodes>
<walltime>{{ wtime_run_fcst }}</walltime>
<nodesize>&NCORES_PER_NODE;</nodesize>
{% endif %}
<walltime>{{ wtime_run_fcst }}</walltime>
<jobname>&RUN_FCST_TN;{{ uscore_ensmem_name }}</jobname>
<join><cyclestr>&LOGDIR;/&RUN_FCST_TN;{{ uscore_ensmem_name }}_@Y@m@d@H.log</cyclestr></join>

Expand All @@ -369,7 +377,7 @@ MODULES_RUN_TASK_FP script.

<var name="fhr"> {% for h in range(0, fcst_len_hrs+1) %}{{ " %02d" % h }}{% endfor %} </var>

<task name="&RUN_POST_TN;{{ uscore_ensmem_name }}_f#fhr#" maxtries="2">
<task name="&RUN_POST_TN;{{ uscore_ensmem_name }}_f#fhr#" maxtries="1">

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&RUN_POST_TN;" "&JOBSDIR;/JREGIONAL_RUN_POST"</command>
Expand All @@ -388,10 +396,13 @@ MODULES_RUN_TASK_FP script.
<envar><name>fhr</name><value>#fhr#</value></envar>

<dependency>
<and>
<datadep age="05:00"><cyclestr>&CYCLE_BASEDIR;/@Y@m@d@H{{ slash_ensmem_subdir }}/dynf0#fhr#.nc</cyclestr></datadep>
<datadep age="05:00"><cyclestr>&CYCLE_BASEDIR;/@Y@m@d@H{{ slash_ensmem_subdir }}/phyf0#fhr#.nc</cyclestr></datadep>
</and>
<or>
<taskdep task="&RUN_FCST_TN;"/>
<and>
<datadep age="05:00"><cyclestr>&CYCLE_BASEDIR;/@Y@m@d@H{{ slash_ensmem_subdir }}/dynf0#fhr#.nc</cyclestr></datadep>
<datadep age="05:00"><cyclestr>&CYCLE_BASEDIR;/@Y@m@d@H{{ slash_ensmem_subdir }}/phyf0#fhr#.nc</cyclestr></datadep>
</and>
</or>
</dependency>

</task>
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_CPT_v0
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GFS_2017_gfdlmp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GFS_2017_gfdlmp_regional
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GFS_v15p2
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GFS_v16beta
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GSD_SAR
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GSD_SAR_v1
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_GSD_v0
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down
2 changes: 1 addition & 1 deletion ush/templates/model_configure.FV3_RRFS_v0
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ dt_atmos: <dt_atmos>
cpl: .false.
calendar: 'julian'
memuse_verbose: .false.
atmos_nthreads: 2
atmos_nthreads: <threads>
use_hyper_thread: .false.
ncores_per_node: <ncores_per_node>
debug_affinity: .true.
Expand Down

0 comments on commit a3d5e1b

Please sign in to comment.