-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use non-login shell for launch_FV3LAM_wflow.sh
; remove support for WCOSS_CRAY
; fix cron capability for tcsh
users on Cheyenne
#675
Conversation
…calls since that's the proper calling method; edit comments.
* Add a wrapper (source_machine_file.sh) for sourcing the machine file that allows other commands common to all machines to be called. * Change the scalar variable MODULE_INIT_PATH in the machine files to the array variable ENV_INIT_SCRIPTS_FPS that specifies the list of system scripts that need to be sourced (e.g. to make the "module" command available in a given script). This is needed because on Cheyenne, at least two system scripts need to be sourced (to enable "module" and "qsub"). * Move the "ulimit" commands at the ends of the machine files into the new variable PRE_TASK_CMDS so that they are not called every time the machine file is sourced. They will be called only if a given script issues an "eval ${PRE_TASK_CMDS}" (which all the ex-scripts will do).
…se the wrapper source_machine_file.sh; (2) Use "eval" to evaluate the contents of PRE_TASK_CMDS.
…apper source_machine_file.sh.
… can be sourced and the valid values for a boolean can be made available to any other script.
…or tcsh users). Details: * Introduce new function get_crontab_contents() that takes as input whether or not the calling script is itself being called from a cron job and returns (1) the path to the appropriate crontab command and (2) the contents of the user's cron table. * Such a function is needed because on Cheyenne, the location of the crontab command is different depending on whether or not the script that's calling crontab is itself called from a cron job (because on Cheyenne, "crontab" is containerized, and that complicates things). * Use get_crontab_contents() in generate_FV3LAM_wflow.sh and launch_FV3LAM_wflow.sh (instead of simply calling "crontab" because the latter approach doesn't work on Cheyenne, at least not with users whose login shell is tcsh). * Add "called_from_cron" as an optional argument to launch_FV3LAM_wflow.sh [so that it can then be passed on to get_crontab_contents()]. This argument is only used in the cron job that relaunches the workflow (which is created only if USE_CRON_TO_RELAUNCH is set to "TRUE"). * Having an optional argument like this seems to be the best way to tell launch_FV3LAM_wflow.sh whether or not it is running from a cron job. * launch_FV3LAM_wflow.sh can still be called from the command line without any arguments (since the default value of "called_from_cron" is "FALSE").
…tem scripts can be sourced in a given script (currently, only "module" is initialized). Details: * Introduce the new function init_env() that initializes the envrionment of a script by sourcing necessary system scripts. The full paths to these system scripts are specified in the array ENV_INIT_SCRIPTS_FPS in the machine files. * This function is needed because (1) this sourcing needs to be done in a couple of different scripts in the SRW App and (2) on some machines (e.g. Cheyenne), more than one system script may need to be sourced. * Use the new init_env() function in launch_FV3LAM_wflow.sh and load_modules_run_task.sh. * In load_modules_run_task.sh, init_env() replaces sourcing of only the system script that defines the "module" command. That is because on Cheyenne, in addition to the "module" command, the "qsub" command needs to be defined/initialized (by sourcing a second system script named pbs.sh).
MODULE_INIT_PATH=${MODULE_INIT_PATH:-/glade/u/apps/ch/opt/lmod/8.1.7/lmod/8.1.7/init/sh} | ||
# System scripts to source to initialize various commands within workflow | ||
# scripts (e.g. "module"). | ||
if [ -z ${ENV_INIT_SCRIPTS_FPS:-""} ]; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generalized the MODULE_INIT_PATH
variable to the array ENV_INIT_SCRIPTS_FPS
(in all the macine files) because on some machines, more than one system script may need to be sourced to enable the module
and other commands.
fi | ||
|
||
# Commands to run at the start of each workflow task. | ||
PRE_TASK_CMDS='{ ulimit -s unlimited; ulimit -a; }' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Placed the commands that were previously called at the end of the machine files in the new variable PRE_TASK_COMMANDS
so they aren't run every time the machine file is sourced but only as necessary.
scripts/exregional_make_grid.sh
Outdated
@@ -107,8 +107,8 @@ case "$MACHINE" in | |||
;; | |||
|
|||
*) | |||
source ${MACHINE_FILE} | |||
;; | |||
source $USHDIR/source_machine_file.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a wrapper around sourcing of the machine file in order to include actions we may want to do for all those files (e.g. do set -x
depending on the DEBUG
variable). See the new file source_machine_file.sh
below.
scripts/exregional_make_grid.sh
Outdated
source ${MACHINE_FILE} | ||
;; | ||
source $USHDIR/source_machine_file.sh | ||
eval ${PRE_TASK_CMDS} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the commands (ulimit ...
) at the end of the machine files into the new variable PRE_TASK_CMDS
so that they are not automatically executed when the machine file is sourced. Execute them here though.
# Other. | ||
# | ||
#----------------------------------------------------------------------- | ||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Convenience variable since this array shows up in multiple places. Will use it in valid_param_vals.sh
in a future PR.
@@ -59,6 +59,7 @@ ushdir="${scrfunc_dir}" | |||
#----------------------------------------------------------------------- | |||
# | |||
. $ushdir/source_util_funcs.sh | |||
. $ushdir/get_crontab_contents.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new function is needed because of the way the crontab
function is set up on Cheyenne. The path to the command depends on whether or not it is being called from a cron job. Also, on Cheyenne, crontab
adds header lines to its output. This function strips out those headers so the user's cron table doesn't unnecessarily grow.
# | ||
#----------------------------------------------------------------------- | ||
# | ||
valid_args=( \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FV3LAM_wflow.sh
now needs an optional argument (called_from_cron
) in order to tell whether or not it is being called from a cron job. This is needed on Cheyenne because the path to the crontab
command (which is called in this script) depends on this distinction.
# | ||
esac | ||
env_init_scripts_fps_str="( "$(printf "\"%s\" " "${ENV_INIT_SCRIPTS_FPS[@]}")")" | ||
init_env env_init_scripts_fps="${env_init_scripts_fps_str}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced case
statement with the new init_env()
function that sources the necessary system scripts to make module
and other commands (e.g. qsub
on Cheyenne) available. I created a separate function because the block of code is also needed in launch_FV3LAM_wflow.sh
.
#----------------------------------------------------------------------- | ||
# | ||
env_init_scripts_fps_str="( "$(printf "\"%s\" " "${ENV_INIT_SCRIPTS_FPS[@]}")")" | ||
init_env env_init_scripts_fps="${env_init_scripts_fps_str}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new init_env()
function sources one or more system scripts (now specified by the array ENV_INIT_SCRIPTS_FPS
in the machine files) that enable use of certain commands like module
(and qsub
on Cheyenne). These commands are not defined by default if not using a login shell.
@@ -1,4 +1,4 @@ | |||
#!/bin/bash -l | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the -l
flag so that this script runs in a non-login shell. This means a user's specific environment doesn't get loaded (e.g. the files ~/.bash_profile
, ~/.bash_login
, and ~/.profile
aren't sourced). This is the only remaining workflow script in which the -l
is (was) still used. But removing -l
also means /etc/profile
isn't sourced, which means certain commands (like module
) will have to be initialized (which is what the init_env
function and ENV_INIT_SCRIPTS_FPS
variable now do).
…. Fix comments and informational messages.
…gfix/crontab_cheyenne
…sourcing "/etc/profile" is enough to make both the "module" and "qsub" commands (and probably all other system-supported commands) available in non-login scripts.
…" (and other commands) to work.
… for-loop so it's different than the variable i used (and unset) by the system script on Hera.
…gfix/crontab_cheyenne
launch_FV3LAM_wflow.sh
launch_FV3LAM_wflow.sh
launch_FV3LAM_wflow.sh
@gsketefian, this pr works on the wcoss dell and cray well now. However, it doesn't work on hera: ./generate_FV3LAM_wflow.sh: line 509: crontab_contents: unbound variable |
@chan-hoo, thanks for getting this working on the WCOSS Cray and Dell. I think @gsketefian was able to get this to run on Hera today without any problem, so I'll let him comment. A note on the WCOSS Cray, I spoke with Jacob today regarding the move to WCOSS2 that's coming soon. Since the WCOSS Cray will be retired this summer, we should be able to wind down our support for that platform at this point. |
@chan-hoo I'm going to include parts of our email exchange here (and my responses) so others understand the background.
It's strange that on the cray, using
doesn't work (i.e. the exit status of
works (i.e. returns a non-zero exit status if the line is not found). Since both commands use
I am reluctant to use this fix because it defeats the purpose of having the new function But -- since it is such a hassle going back and forth to debug this on the cray (because I don't have access to it) and since @JacobCarley-NOAA gave the ok to remove support for the cray from the SRW App, unless you have an objection I am going to go ahead and remove the cray in my next commit. Btw, I did test this latest version on Hera and it worked for me. I will re-test once I make the latest changes to remove the cray. |
@gsketefian, I don't have any objection for the change at all. It will make my work easier than before :) |
… whether the RUN_CMD_... variable is empty, i.e. -z "${RUN_CMD_...}". This is needed because on Cheyenne, not having the double quotes generates an error when RUN_CMD_... consists of a command that contains spaces (e.g. "mpirun -np ...").
launch_FV3LAM_wflow.sh
; fix cron capability for tcsh
users on Cheyennelaunch_FV3LAM_wflow.sh
; remove support for WCOSS_CRAY
; fix cron capability for tcsh
users on Cheyenne
@chan-hoo Ok, great! Makes life a lot easier! I just pushed a commit that removes support for WCOSS_CRAY. Do you mind testing on WCOSS_DELL_P3 one more time? @willmayfield Do you mind testing the latest (hash 930f4cc) on Cheyenne? I tested with I tested on Hera and it works. Currently testing on Jet. I can't test on Orion today since it's down for maintenance, so I will try tomorrow. Thanks all. |
|
||
nprocs=$(( NNODES_MAKE_ICS*PPN_MAKE_ICS )) | ||
|
||
if [ -z ${RUN_CMD_UTILS:-} ] ; then | ||
if [ -z "${RUN_CMD_UTILS:-}" ] ; then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to have this in (double) quotes because on Cheyenne, if RUN_CMD_UTILS
contains spaces (e.g. mpirun -np ...
), the -z
operator will fail with a too many arguments
error. Same in all the other ex-scripts.
@@ -177,7 +177,7 @@ MODULES_RUN_TASK_FP script. | |||
|
|||
&RSRV_DEFAULT; | |||
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_GRID_TN;" "&JOBSDIR;/JREGIONAL_MAKE_GRID"</command> | |||
{%- if machine in ["WCOSS_DELL_P3", "WCOSS_CRAY"] %} | |||
{%- if machine in ["WCOSS_DELL_P3"] %} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing support for WCOSS_CRAY.
@gsketefian, it works well on the dell. |
Great, thanks @chan-hoo! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving based on previous assessment. We need to have a final test run on Cheyenne to confirm things work. @willmayfield or @mkavulich, do either of you have time to test (with tcsh and/or bash)?
@JeffBeck-NOAA I'll be testing it today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested on Cheyenne with bash and tcsh.
DESCRIPTION OF CHANGES:
This PR:
launch_FV3LAM_wflow.sh
script run in a non-login shell to avoid sourcing of user-specific login files (e.g~/.bash_profile
,~/.bash_login
, and~/.profile
).module
command but others (by sourcing/etc/profile
). This is necessary if not using a login shell inlaunch_FV3LAM_wflow.sh
.launch_FV3LAM_wflow.sh
that specifies whether it is being called from acron
job. This is necessary because on Cheyenne, the version ofcrontab
to use depends on whether or not the parent script is being called from acron
job.WCOSS_CRAY
since it will be decommissioned soon and is difficult to support.case
statements from ex-scripts.One of the consequences of this use of non-login shell for
launch_FV3LAM_wflow.sh
is that the workflow'scron
capability now works fortcsh
users on Cheyenne.TESTS CONDUCTED:
Ran the two WE2E tests
deactivate_tasks
andgrid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2
on Hera, Jet, Cheyenne, and Orion. All passed.Note that I used
tcsh
as my login shell on Cheyenne for the purposes of testing this PR (and the tests succeeded).ISSUE (optional):
Resolves Issue #603.
CONTRIBUTORS:
@chan-hoo, @willmayfield, @mkavulich