Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WE2E tests using subhourly output fail at the run_fcst step with a segmentation fault #729

Closed
mkavulich opened this issue Apr 5, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@mkavulich
Copy link
Collaborator

Description

On Hera, the subhourly post tests ("subhourly_post" and "subhourly_post_ensemble_2mems") fail at the run_fcst task with a segmentation fault. It is unclear if this is specifically due to the subhourly post settings (these tests do not use inline post), however, they all fail at the same point in the weather model execution.

Steps to Reproduce

  1. Run either subhourly_post or subhourly_post_ensemble_2mems WE2E test on Hera
  2. Observe the segmentation fault

Additional Context

I only ran this test on Hera with Intel but presumably this occurs on other platforms.

Output

As mentioned above, the tests all fail at the same point of execution, with the failure message similar to this:

 in wrt grid comp, dopost= F
 af wrtState reconcile, FBcount=           3
 af get wrtfb=mirror_dyn_bilinear rc=           0
 af get wrtfb=mirror_phy_bilinear rc=           0
 af get wrtfb=mirror_phy_nearest_stod rc=           0
 in fv3cap init, time wrtcrt/regrdst  0.664980888366699
 in fv3 cap init, output_startfh=  0.0000000E+00 nsout=           1 iau_offset=
           0 nfhmax_hf=          -1 nfhout_hf=          -1 nfhout=          -1
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ufs_model          000000000406E23C  Unknown               Unknown  Unknown
libpthread-2.17.s  00002BA624D78630  Unknown               Unknown  Unknown
ufs_model          00000000041060F6  Unknown               Unknown  Unknown
ufs_model          00000000040FE06D  Unknown               Unknown  Unknown
ufs_model          00000000040C4A00  Unknown               Unknown  Unknown
ufs_model          0000000001ADC614  fv3gfs_cap_mod_mp         718  fv3_cap.F90
ufs_model          000000000115A68F  _ZNK5ESMCI13Metho         377  ESMCI_MethodTable.C
ufs_model          000000000115A612  _ZN5ESMCI11Method         563  ESMCI_MethodTable.C
ufs_model          000000000115AC70  c_esmc_methodtabl         347  ESMCI_MethodTable.C
ufs_model          000000000096EB99  esmf_attachmethod        1280  ESMF_AttachMethods.F90
ufs_model          0000000002346481  nuopc_modelbase_m         695  NUOPC_ModelBase.F90
ufs_model          000000000097CCCE  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
ufs_model          0000000000980C86  ESMCI_FTableCallE         824  ESMCI_FTable.C
ufs_model          000000000081CABF  _ZN5ESMCI3VMK5ent        2273  ESMCI_VMKernel.C
ufs_model          000000000075AA3A  _ZN5ESMCI2VM5ente        1216  ESMCI_VM.C
ufs_model          000000000097E367  c_esmc_ftablecall         981  ESMCI_FTable.C
ufs_model          00000000007E5E31  esmf_compmod_mp_e        1222  ESMF_Comp.F90
ufs_model          0000000000B7DDD4  esmf_gridcompmod_        1407  ESMF_GridComp.F90
ufs_model          000000000078EA3F  nuopc_driver_mp_l        2565  NUOPC_Driver.F90
ufs_model          00000000007B3A8C  nuopc_driver_mp_i        1284  NUOPC_Driver.F90
ufs_model          00000000007BC405  nuopc_driver_mp_i         455  NUOPC_Driver.F90
ufs_model          000000000097CCCE  _ZN5ESMCI6FTable1        2167  ESMCI_FTable.C
ufs_model          0000000000980C86  ESMCI_FTableCallE         824  ESMCI_FTable.C
ufs_model          000000000081CABF  _ZN5ESMCI3VMK5ent        2273  ESMCI_VMKernel.C
ufs_model          000000000075AA3A  _ZN5ESMCI2VM5ente        1216  ESMCI_VM.C
ufs_model          000000000097E367  c_esmc_ftablecall         981  ESMCI_FTable.C
ufs_model          00000000007E5E31  esmf_compmod_mp_e        1222  ESMF_Comp.F90
ufs_model          0000000000B7DDD4  esmf_gridcompmod_        1407  ESMF_GridComp.F90
ufs_model          000000000041AE88  MAIN__                    381  UFS.F90
ufs_model          0000000000419C5E  Unknown               Unknown  Unknown
libc-2.17.so       00002BA624FA7555  __libc_start_main     Unknown  Unknown
ufs_model          0000000000419B69  Unknown               Unknown  Unknown
srun: error: h8c46: task 0: Exited with exit code 174
srun: launch/slurm: _step_signal: Terminating StepId=30232199.0
slurmstepd: error: *** STEP 30232199.0 ON h8c46 CANCELLED AT 2022-04-05T01:26:40 ***
forrtl: error (78): process killed (SIGTERM)

etc. etc....

A full log file can be seen on Hera at /scratch2/BMC/det/kavulich/workdir/update_app_hashes/expt_dirs/subhourly_post/log/run_fcst_2020081000.log

@mkavulich mkavulich added the bug Something isn't working label Apr 5, 2022
@mkavulich mkavulich modified the milestone: v2 release Apr 5, 2022
@mkavulich mkavulich changed the title WE2E tests using subhourly post fail at the run_fcst step with a segmentation fault WE2E tests using subhourly output fail at the run_fcst step with a segmentation fault Apr 20, 2022
@mkavulich
Copy link
Collaborator Author

Issue migrated to ufs-srweather-app repository: ufs-community/ufs-srweather-app#358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant