Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in postanl job not being handled properly in exgdas_atmos_nceppost.sh #1034

Closed
KateFriedman-NOAA opened this issue Sep 27, 2022 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@KateFriedman-NOAA
Copy link
Member

Description

In cold-started half-cycle tests the gdaspostanl job completes as successful but upon inspection of log there is an error from exgdas_atmos_nceppost.sh that is not being handled properly:

+ exgdas_atmos_nceppost.sh[88]: '[' -f /scratch1/NCEPDEV/stmp4/Kate.Friedman/comrot/testcycatmos1014/gdas.20220101/18/atmos/gdas.t18z.atmanl.nc ']'
+ exgdas_atmos_nceppost.sh[171]: echo ' *** FATAL ERROR: No model anl file output '
 *** FATAL ERROR: No model anl file output
+ exgdas_atmos_nceppost.sh[172]: export err=9
+ exgdas_atmos_nceppost.sh[172]: err=9
+ exgdas_atmos_nceppost.sh[173]: err_chk

*************************************************************
** FATAL ERROR: Job post.232511 failed RETURN CODE 9
** ABNORMAL EXIT at Fri Sep 23 10:37:18 UTC 2022 on h22c09
*************************************************************

Currently Loaded Modules:
  1) hpss/hpss        15) g2tmpl/1.10.0    29) R/3.5.0
  2) nco/4.9.1        16) ip/3.3.3         30) hpc/1.2.0
  3) gempak/7.4.2     17) sp/2.3.3         31) intel/18.0.5.274
  4) ncl/6.6.2        18) sigio/2.3.2      32) hpc-intel/18.0.5.274
  5) prod_util/1.2.2  19) bacio/2.4.1      33) miniconda3/4.6.14
  6) grib_util/1.2.2  20) nemsio/2.5.2     34) hpc-miniconda3/4.6.14
  7) crtm/2.3.0       21) w3emc/2.7.3      35) ufswm/1.0.0
  8) jasper/2.0.25    22) w3nco/2.4.1      36) met/9.1
  9) zlib/1.2.11      23) ncdiag/1.0.0     37) metplus/3.1
 10) png/1.6.35       24) hdf5/1.10.6      38) module_base.hera
 11) pio/2.5.2        25) netcdf/4.7.4     39) impi/2018.0.4
 12) esmf/8.2.1b04    26) wgrib2/2.0.8     40) hpc-impi/2018.0.4
 13) fms/2021.03      27) grib_api/1.26.1
 14) g2/3.4.2         28) cdo/1.9.5

WARNING: jlogfile variable not defined
/scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/testcycatmos1014/2022010118/gdas/post.232511
total 1
-rw-r--r-- 1 Kate.Friedman stmp   0 Sep 23 10:37 ncepdate
-rwxr--r-- 1 Kate.Friedman stmp 328 Sep 23 10:37 PDY
cat: OUTPUT.232511: No such file or directory
+ exgdas_atmos_nceppost.sh[348]: exit 0
+ exgdas_atmos_nceppost.sh[1]: postamble exgdas_atmos_nceppost.sh 1663929438 0
+ preamble.sh[67]: set +x
End exgdas_atmos_nceppost.sh at 10:37:19 with error code 0 (time elapsed: 00:00:01)
+ JGLOBAL_ATMOS_NCEPPOST[121]: status=0
+ JGLOBAL_ATMOS_NCEPPOST[122]: [[ 0 -ne 0 ]]
+ JGLOBAL_ATMOS_NCEPPOST[131]: '[' -e OUTPUT.232511 ']'
+ JGLOBAL_ATMOS_NCEPPOST[138]: cd /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/testcycatmos1014/2022010118/gdas
+ JGLOBAL_ATMOS_NCEPPOST[139]: [[ NO = \N\O ]]
+ JGLOBAL_ATMOS_NCEPPOST[139]: rm -rf /scratch1/NCEPDEV/stmp2/Kate.Friedman/RUNDIRS/testcycatmos1014/2022010118/gdas/post.232511
+ JGLOBAL_ATMOS_NCEPPOST[142]: exit 0
+ JGLOBAL_ATMOS_NCEPPOST[1]: postamble JGLOBAL_ATMOS_NCEPPOST 1663929435 0
+ preamble.sh[67]: set +x
End JGLOBAL_ATMOS_NCEPPOST at 10:37:19 with error code 0 (time elapsed: 00:00:04)
+ post.sh[31]: status=0
+ post.sh[32]: [[ 0 -ne 0 ]]
+ post.sh[35]: exit 0
+ post.sh[1]: postamble post.sh 1663929431 0
+ preamble.sh[67]: set +x
End post.sh at 10:37:19 with error code 0 (time elapsed: 00:00:08)

The err=9 doesn't result in an error exit.

Requirements

Bug fix. The script should exit when it can't find the atmanl.nc file.

Acceptance Criteria (Definition of Done)

The script correctly exits when an error occurs.

@WalterKolczynski-NOAA
Copy link
Contributor

WalterKolczynski-NOAA commented Sep 29, 2022

Three different things contributing here.

  1. err_chk doesn't really do anything on RDHPCS machines
  2. Since this is an error code being set directly rather than from a command return ($?), and only run through err_chk, even with set -e on this isn't picked up (no bash statement returns non-zero). Combined with (1), this means the program sets err=9 but doesn't actually exit.
  3. The failure of this script doesn't seem to prevent the cycled run from continuing on normally. Should this even a fatal error for the cold-start half-cycle?

It's trivial to make these scripts exit with an error, which is how they seem to be written. But we need to figure out if they should, at least for that half-cycle.

More long-term, we need to decide what we're going to do with err_chk. I'm thinking we can put it in the postamble, so scripts would just call exit # and then the trap would call err_chk on wcoss. But we can discuss that elsewhere.

WalterKolczynski-NOAA pushed a commit that referenced this issue Jan 9, 2023
Moves the member directory one higher level in the directory hierarchy in anticipation of additionally components being run for the members in addition to atmos. This results in COM directories that are arranged as `memXXX/atmos` instead of `atmos/memXXX`.

This PR also adds a "hack" to allow the success for the `gdaspostanl` job in the first half cycle (see #1034). The "hack" will be removed with the refactoring of the post jobs in the (near) future.

Fixes #1196
Refs #1034
@aerorahul
Copy link
Contributor

Hacked in #1201.
Will open a issue to be addressed during post refactor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants