Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use non-login shell for launch_FV3LAM_wflow.sh; remove support for WCOSS_CRAY; fix cron capability for tcsh users on Cheyenne #675

Merged

Conversation

gsketefian
Copy link
Collaborator

@gsketefian gsketefian commented Feb 7, 2022

DESCRIPTION OF CHANGES:

This PR:

  • Makes the launch_FV3LAM_wflow.sh script run in a non-login shell to avoid sourcing of user-specific login files (e.g ~/.bash_profile, ~/.bash_login, and ~/.profile).
  • Generalizes the loading of system scripts to initialize not only the module command but others (by sourcing /etc/profile). This is necessary if not using a login shell in launch_FV3LAM_wflow.sh.
  • Adds an optional argument to launch_FV3LAM_wflow.sh that specifies whether it is being called from a cron job. This is necessary because on Cheyenne, the version of crontab to use depends on whether or not the parent script is being called from a cron job.
  • Removes support for WCOSS_CRAY since it will be decommissioned soon and is difficult to support.
  • Removes machine-dependent case statements from ex-scripts.

One of the consequences of this use of non-login shell for launch_FV3LAM_wflow.sh is that the workflow's cron capability now works for tcsh users on Cheyenne.

TESTS CONDUCTED:

Ran the two WE2E tests deactivate_tasks and grid_RRFS_CONUS_25km_ics_FV3GFS_lbcs_FV3GFS_suite_GFS_v15p2 on Hera, Jet, Cheyenne, and Orion. All passed.

Note that I used tcsh as my login shell on Cheyenne for the purposes of testing this PR (and the tests succeeded).

ISSUE (optional):

Resolves Issue #603.

CONTRIBUTORS:

@chan-hoo, @willmayfield, @mkavulich

…calls since that's the proper calling method; edit comments.
* Add a wrapper (source_machine_file.sh) for sourcing the machine file that allows other commands common to all machines to be called.
* Change the scalar variable MODULE_INIT_PATH in the machine files to the array variable ENV_INIT_SCRIPTS_FPS that specifies the list of system scripts that need to be sourced (e.g. to make the "module" command available in a given script).  This is needed because on Cheyenne, at least two system scripts need to be sourced (to enable "module" and "qsub").
* Move the "ulimit" commands at the ends of the machine files into the new variable PRE_TASK_CMDS so that they are not called every time the machine file is sourced.  They will be called only if a given script issues an "eval ${PRE_TASK_CMDS}" (which all the ex-scripts will do).
…se the wrapper source_machine_file.sh; (2) Use "eval" to evaluate the contents of PRE_TASK_CMDS.
… can be sourced and the valid values for a boolean can be made available to any other script.
…or tcsh users). Details:

* Introduce new function get_crontab_contents() that takes as input whether or not the calling script is itself being called from a cron job and returns (1) the path to the appropriate crontab command and (2) the contents of the user's cron table.
  * Such a function is needed because on Cheyenne, the location of the crontab command is different depending on whether or not the script that's calling crontab is itself called from a cron job (because on Cheyenne, "crontab" is containerized, and that complicates things).
* Use get_crontab_contents() in generate_FV3LAM_wflow.sh and launch_FV3LAM_wflow.sh (instead of simply calling "crontab" because the latter approach doesn't work on Cheyenne, at least not with users whose login shell is tcsh).
* Add "called_from_cron" as an optional argument to launch_FV3LAM_wflow.sh [so that it can then be passed on to get_crontab_contents()].  This argument is only used in the cron job that relaunches the workflow (which is created only if USE_CRON_TO_RELAUNCH is set to "TRUE").
  * Having an optional argument like this seems to be the best way to tell launch_FV3LAM_wflow.sh whether or not it is running from a cron job.
  * launch_FV3LAM_wflow.sh can still be called from the command line without any arguments (since the default value of "called_from_cron" is "FALSE").
…tem scripts can be sourced in a given script (currently, only "module" is initialized). Details:

* Introduce the new function init_env() that initializes the envrionment of a script by sourcing necessary system scripts.  The full paths to these system scripts are specified in the array ENV_INIT_SCRIPTS_FPS in the machine files.
  * This function is needed because (1) this sourcing needs to be done in a couple of different scripts in the SRW App and (2) on some machines (e.g. Cheyenne), more than one system script may need to be sourced.
* Use the new init_env() function in launch_FV3LAM_wflow.sh and load_modules_run_task.sh.
  * In load_modules_run_task.sh, init_env() replaces sourcing of only the system script that defines the "module" command.  That is because on Cheyenne, in addition to the "module" command, the "qsub" command needs to be defined/initialized (by sourcing a second system script named pbs.sh).
MODULE_INIT_PATH=${MODULE_INIT_PATH:-/glade/u/apps/ch/opt/lmod/8.1.7/lmod/8.1.7/init/sh}
# System scripts to source to initialize various commands within workflow
# scripts (e.g. "module").
if [ -z ${ENV_INIT_SCRIPTS_FPS:-""} ]; then
Copy link
Collaborator Author

@gsketefian gsketefian Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generalized the MODULE_INIT_PATH variable to the array ENV_INIT_SCRIPTS_FPS (in all the macine files) because on some machines, more than one system script may need to be sourced to enable the module and other commands.

fi

# Commands to run at the start of each workflow task.
PRE_TASK_CMDS='{ ulimit -s unlimited; ulimit -a; }'
Copy link
Collaborator Author

@gsketefian gsketefian Feb 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Placed the commands that were previously called at the end of the machine files in the new variable PRE_TASK_COMMANDS so they aren't run every time the machine file is sourced but only as necessary.

@@ -107,8 +107,8 @@ case "$MACHINE" in
;;

*)
source ${MACHINE_FILE}
;;
source $USHDIR/source_machine_file.sh
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a wrapper around sourcing of the machine file in order to include actions we may want to do for all those files (e.g. do set -x depending on the DEBUG variable). See the new file source_machine_file.sh below.

source ${MACHINE_FILE}
;;
source $USHDIR/source_machine_file.sh
eval ${PRE_TASK_CMDS}
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the commands (ulimit ...) at the end of the machine files into the new variable PRE_TASK_CMDS so that they are not automatically executed when the machine file is sourced. Execute them here though.

# Other.
#
#-----------------------------------------------------------------------
#
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Convenience variable since this array shows up in multiple places. Will use it in valid_param_vals.sh in a future PR.

@@ -59,6 +59,7 @@ ushdir="${scrfunc_dir}"
#-----------------------------------------------------------------------
#
. $ushdir/source_util_funcs.sh
. $ushdir/get_crontab_contents.sh
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new function is needed because of the way the crontab function is set up on Cheyenne. The path to the command depends on whether or not it is being called from a cron job. Also, on Cheyenne, crontab adds header lines to its output. This function strips out those headers so the user's cron table doesn't unnecessarily grow.

#
#-----------------------------------------------------------------------
#
valid_args=( \
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FV3LAM_wflow.sh now needs an optional argument (called_from_cron) in order to tell whether or not it is being called from a cron job. This is needed on Cheyenne because the path to the crontab command (which is called in this script) depends on this distinction.

#
esac
env_init_scripts_fps_str="( "$(printf "\"%s\" " "${ENV_INIT_SCRIPTS_FPS[@]}")")"
init_env env_init_scripts_fps="${env_init_scripts_fps_str}"
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced case statement with the new init_env() function that sources the necessary system scripts to make module and other commands (e.g. qsub on Cheyenne) available. I created a separate function because the block of code is also needed in launch_FV3LAM_wflow.sh.

#-----------------------------------------------------------------------
#
env_init_scripts_fps_str="( "$(printf "\"%s\" " "${ENV_INIT_SCRIPTS_FPS[@]}")")"
init_env env_init_scripts_fps="${env_init_scripts_fps_str}"
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new init_env() function sources one or more system scripts (now specified by the array ENV_INIT_SCRIPTS_FPS in the machine files) that enable use of certain commands like module (and qsub on Cheyenne). These commands are not defined by default if not using a login shell.

@@ -1,4 +1,4 @@
#!/bin/bash -l
#!/bin/bash
Copy link
Collaborator Author

@gsketefian gsketefian Feb 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the -l flag so that this script runs in a non-login shell. This means a user's specific environment doesn't get loaded (e.g. the files ~/.bash_profile, ~/.bash_login, and ~/.profile aren't sourced). This is the only remaining workflow script in which the -l is (was) still used. But removing -l also means /etc/profile isn't sourced, which means certain commands (like module) will have to be initialized (which is what the init_env function and ENV_INIT_SCRIPTS_FPS variable now do).

@gsketefian gsketefian changed the title Fix crontab capability for tcsh users on Cheyenne Fix cron capability for tcsh users on Cheyenne Feb 9, 2022
…sourcing "/etc/profile" is enough to make both the "module" and "qsub" commands (and probably all other system-supported commands) available in non-login scripts.
… for-loop so it's different than the variable i used (and unset) by the system script on Hera.
@gsketefian gsketefian changed the title Fix cron capability for tcsh users on Cheyenne Fix cron capability for tcsh users on Cheyenne; use non-login shell launch_FV3LAM_wflow.sh Feb 10, 2022
@gsketefian gsketefian changed the title Fix cron capability for tcsh users on Cheyenne; use non-login shell launch_FV3LAM_wflow.sh Fix cron capability for tcsh users on Cheyenne; use non-login shell for launch_FV3LAM_wflow.sh Feb 10, 2022
@chan-hoo
Copy link
Collaborator

@gsketefian, this pr works on the wcoss dell and cray well now. However, it doesn't work on hera: ./generate_FV3LAM_wflow.sh: line 509: crontab_contents: unbound variable

@JeffBeck-NOAA
Copy link
Collaborator

@gsketefian, this pr works on the wcoss dell and cray well now. However, it doesn't work on hera: ./generate_FV3LAM_wflow.sh: line 509: crontab_contents: unbound variable

@chan-hoo, thanks for getting this working on the WCOSS Cray and Dell. I think @gsketefian was able to get this to run on Hera today without any problem, so I'll let him comment.

A note on the WCOSS Cray, I spoke with Jacob today regarding the move to WCOSS2 that's coming soon. Since the WCOSS Cray will be retired this summer, we should be able to wind down our support for that platform at this point.

@gsketefian
Copy link
Collaborator Author

@chan-hoo I'm going to include parts of our email exchange here (and my responses) so others understand the background.

On Wed, Feb 23, 2022 at 7:01 AM Chan-Hoo Jeon - NOAA Affiliate <chan-hoo.jeon@noaa.gov> wrote:
Gerard,

The problem was

  if [ "$MACHINE" = "WCOSS_DELL_P3" ]; then
    grep_output=$( grep "^${crontab_line_esc_astr}$" "/u/$USER/cron/mycrontab" )
  else
    grep_output=$( echo "${crontab_contents}" | grep "^${crontab_line_esc_astr}$" )
  fi

  exit_status=$?

  if [ "${exit_status}" -eq 0 ]; then

Even in case grep_output="" (nothing), exit_status=0 (It is not supposed to be zero).
So, nothing copies to the crontab file.

Do you have an idea of how to fix this?

Thanks,
Chan-Hoo

It's strange that on the cray, using

grep_output=$( echo "${crontab_contents}" | grep "^${crontab_line_esc_astr}$" )

doesn't work (i.e. the exit status of grep is 0 even if the line is not found in the cron table) but

grep_output=$( crontab -l | grep "^${crontab_line_esc_astr}$" )

works (i.e. returns a non-zero exit status if the line is not found). Since both commands use grep, if the line is not in the cron table, then the exit status should be non-zero regardless of whether the input to grep came from echo or from crontab -l.

On Wed, Feb 23, 2022 at 5:36 AM Chan-Hoo Jeon - NOAA Affiliate chan-hoo.jeon@noaa.gov wrote:
The original expression works well.
So I will add it to the if statement.

  elif [ "$MACHINE" = "WCOSS_CRAY" ]; then
    grep_output=$( crontab -l | grep "^${crontab_line_esc_astr}$" )

Chan-Hoo

I am reluctant to use this fix because it defeats the purpose of having the new function get_crontab_contents.sh that deals with the oddities of getting the contents of the cron table on each machine all in one place instead of having to put if-statements in various places in the scripts. An alternative fix might be to look at the contents of grep_output instead of exit_status in determining what to do in the if-statement that follows these lines.

But -- since it is such a hassle going back and forth to debug this on the cray (because I don't have access to it) and since @JacobCarley-NOAA gave the ok to remove support for the cray from the SRW App, unless you have an objection I am going to go ahead and remove the cray in my next commit.

Btw, I did test this latest version on Hera and it worked for me. I will re-test once I make the latest changes to remove the cray.

@chan-hoo
Copy link
Collaborator

@gsketefian, I don't have any objection for the change at all. It will make my work easier than before :)

… whether the RUN_CMD_... variable is empty, i.e. -z "${RUN_CMD_...}". This is needed because on Cheyenne, not having the double quotes generates an error when RUN_CMD_... consists of a command that contains spaces (e.g. "mpirun -np ...").
@gsketefian gsketefian changed the title Use non-login shell for launch_FV3LAM_wflow.sh; fix cron capability for tcsh users on Cheyenne Use non-login shell for launch_FV3LAM_wflow.sh; remove support for WCOSS_CRAY; fix cron capability for tcsh users on Cheyenne Feb 24, 2022
@gsketefian
Copy link
Collaborator Author

@chan-hoo Ok, great! Makes life a lot easier! I just pushed a commit that removes support for WCOSS_CRAY. Do you mind testing on WCOSS_DELL_P3 one more time?

@willmayfield Do you mind testing the latest (hash 930f4cc) on Cheyenne? I tested with tcsh and it worked for me, but it would be great if you can as well. It would be nice to also test with bash. Can someone else test with bash on Cheyenne (since it takes a day to switch login shells)? @mkavulich, do you happen to have time?

I tested on Hera and it works. Currently testing on Jet. I can't test on Orion today since it's down for maintenance, so I will try tomorrow.

Thanks all.


nprocs=$(( NNODES_MAKE_ICS*PPN_MAKE_ICS ))

if [ -z ${RUN_CMD_UTILS:-} ] ; then
if [ -z "${RUN_CMD_UTILS:-}" ] ; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to have this in (double) quotes because on Cheyenne, if RUN_CMD_UTILS contains spaces (e.g. mpirun -np ...), the -z operator will fail with a too many arguments error. Same in all the other ex-scripts.

@@ -177,7 +177,7 @@ MODULES_RUN_TASK_FP script.

&RSRV_DEFAULT;
<command>&LOAD_MODULES_RUN_TASK_FP; "&MAKE_GRID_TN;" "&JOBSDIR;/JREGIONAL_MAKE_GRID"</command>
{%- if machine in ["WCOSS_DELL_P3", "WCOSS_CRAY"] %}
{%- if machine in ["WCOSS_DELL_P3"] %}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing support for WCOSS_CRAY.

@chan-hoo
Copy link
Collaborator

@gsketefian, it works well on the dell.

@gsketefian
Copy link
Collaborator Author

@gsketefian, it works well on the dell.

Great, thanks @chan-hoo!

Copy link
Collaborator

@JeffBeck-NOAA JeffBeck-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving based on previous assessment. We need to have a final test run on Cheyenne to confirm things work. @willmayfield or @mkavulich, do either of you have time to test (with tcsh and/or bash)?

@willmayfield
Copy link
Collaborator

@JeffBeck-NOAA I'll be testing it today.

Copy link
Collaborator

@willmayfield willmayfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested on Cheyenne with bash and tcsh.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants