Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI self-test with KEEPDATA=YES #2734

Conversation

TerrenceMcGuinness-NOAA
Copy link
Collaborator

Description

This is a CI self-test with KEEPDATA=YES for save off of RUNDIRS to capture disk costs of running CI tests.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI/CD Issue related to CI/CD labels Jun 27, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera labels Jun 28, 2024
@emcbot
Copy link

emcbot commented Jun 28, 2024

Experiment C96_atmaerosnowDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/logs/2021122018/gdassfcanl.log

Follow link here to view the contents of the above file(s): (link)

@emcbot emcbot added CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jun 28, 2024
@emcbot
Copy link

emcbot commented Jun 28, 2024

Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_7e868a54

@emcbot
Copy link

emcbot commented Jun 28, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_7e868a54

@RussTreadon-NOAA
Copy link
Contributor

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

@RussTreadon-NOAA
Copy link
Contributor

C96_atmaerosnowDA failure

The C96_atmaerosnowDA failure in this PR differs from PR #2700 and #2729. The 20211220 18Z gdassfcanl fails in this PR with the error message

2:  FATAL ERROR: OPENING FILE: ./fnbgsi.003: NetCDF: Unknown file format
2:  STOP.
2: Abort(999) on node 2 (rank 2 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 999) - process 2

Local file fnbgsi.003 is a copy of 20211220.180000.sfc_data.tile3.nc from /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow. The source file is zero length

 /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18/analysis/snow:
  total used in directory 109600 available 71308833872
  drwxrwsr-x 2 Terry.McGuinness global     4096 Jun 28 01:33 .
  drwxr-sr-x 5 Terry.McGuinness global     4096 Jun 28 01:33 ..
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.150000.sfc_data.tile6.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
  -rw-r--r-- 1 Terry.McGuinness global        0 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
  -rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

According to gdassnowanl.log file 20211220.180000.sfc_data.tile3.nc originates as

./gdassnowanl.log:2024-06-28 01:33:46,896 - INFO     - file_utils  : Copied /scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl/20211220.180000.sfc_data.tile3.nc to /scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C96_atmaerosnowDA_7e868a54/gdas.20211220/18//analysis/snow/20211220.180000.sfc_data.tile3.nc

File 20211220.180000.sfc_data.tile3.nc is a non-zero length file.

Hera(hfe04):/scratch1/NCEPDEV/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_7e868a54/gdassnowanl_18/anl$ ls -l 20211220.180000.sfc_data*
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile1.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile2.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile3.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile4.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile5.nc
-rw-r--r-- 1 Terry.McGuinness global 10005271 Jun 28 01:33 20211220.180000.sfc_data.tile6.nc

I am not familiar with snow DA. Tagging @jiaruidong2017 . Jiarui, what are your thoughts on this failure?

@jiaruidong2017
Copy link
Contributor

Thanks @RussTreadon-NOAA for digging this. I actually don't have any idea why this happened, and I didn't meet such an issue from my previous tests. A rerun to this CI test may help to find the reason. @CoryMartin-NOAA do you have any thoughts on this?

@RussTreadon-NOAA
Copy link
Contributor

Thank you @jiaruidong2017 for your reply. Do you routinely run C96_atmaerosnowDA as part of your development? If not, how do / how frequently do you test JEDI snow DA in g-w?

@jiaruidong2017
Copy link
Contributor

jiaruidong2017 commented Jun 28, 2024

@RussTreadon-NOAA I actually didn't run the C96_atmaerosnowDA CI test for my development work, but instead I run my own JEDI snow DA test. Recently, I have run my tests four times over the past two weeks.

@RussTreadon-NOAA
Copy link
Contributor

@jiaruidong2017 , to help with debugging, when did you make these runs, on which machine, and do you still have the log files online?

@jiaruidong2017
Copy link
Contributor

@RussTreadon-NOAA You can find the following log files for my three tests as:

/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory04/logs/ (Today)
/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory03/logs/ (June 26)
/scratch1/NCEPDEV/climate/Jiarui.Dong/ptmp/cory02/logs/ (June 15)

@guillaumevernieres
Copy link
Contributor

C48mx500_3DVarAOWCDA failure

The C48mx500_3DVarAOWCDA failure in this PR is the same as #2700. The 20210324 18Z gdasfcst aborts

21:  (abort_ice)ABORTED:
21:  (abort_ice) error = (diagnostic_abort)ERROR: negative area (ice)
21: Abort(128) on node 21 (rank 21 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 128) - process 21

As @guillaumevernieres notes, the log file contains Tsfcn NaN for n = 1 through 5.

This PR uses an older gdas.cd hash. PR #2700 uses a newer gdas.cd hash. The C48mx500_3DVarAOWCDA test fails with both hashes in the same manner. Previous runs of C48mx500_3DVarAOWCDA using PR #2700 passed on Hera when run under role.jedipara and Russ.Treadon.

@JessicaMeixner-NOAA just checked, the ocean and seaice increments are all nans.

@JessicaMeixner-NOAA
Copy link
Contributor

PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

@aerorahul
Copy link
Contributor

PR #2681 was not tested on Hera. I'm not sure why it was not (I know stmp was an issue, but this PR changes a lot for WCDA), but I think this could be the cause of the WCDA failures we are seeing and perhaps because of some logic clean up at the end or oversights in non-CI testing this was not seen. It also seems that #2719 is also possibly causing issues for tests not related to WCDA based on some other threads.

@JessicaMeixner-NOAA
Thanks. I think it is the aggressive clean-up from #2719 that is likely the root cause.
I have left a comment for it in #2719 and #2700 to test that.

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Jul 8, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jul 8, 2024
@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C48_ATM FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_ATM_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C96_atm3DVar FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atm3DVar_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera with error logs:

/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/COMROOT/C48mx500_3DVarAOWCDA_692341ed/logs/2021032412/gdasfcst.log

Follow link here to view the contents of the above file(s): (link)

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C96_atmaerosnowDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96_atmaerosnowDA_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C48_S2SW FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_S2SW_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C48mx500_3DVarAOWCDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48mx500_3DVarAOWCDA_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C96C48_hybatmDA FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C96C48_hybatmDA_692341ed

@emcbot
Copy link

emcbot commented Jul 8, 2024

Experiment C48_S2SWA_gefs FAILED on Hera in
/scratch1/NCEPDEV/global/CI/2734/RUNTESTS/C48_S2SWA_gefs_692341ed

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera and removed CI-Hera-Failed **Bot use only** CI testing on Hera for this PR has failed labels Jul 8, 2024
@emcbot emcbot added CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully and removed CI-Hera-Ready **CM use only** PR is ready for CI testing on Hera CI-Hera-Building **Bot use only** CI testing is cloning/building on Hera CI-Hera-Running **Bot use only** CI testing on Hera for this PR is in-progress labels Jul 8, 2024
@emcbot
Copy link

emcbot commented Jul 9, 2024

CI Passed Hera at
Built and ran in directory /scratch1/NCEPDEV/global/CI/2734


Experiment C48_ATM_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 22:11:41 UTC 2024
Experiment C48mx500_3DVarAOWCDA_0a498766 Completed 2 Cycles: *SUCCESS* at Mon Jul  8 22:23:49 UTC 2024
Experiment C96_atm3DVar_0a498766 Completed 3 Cycles: *SUCCESS* at Mon Jul  8 23:24:35 UTC 2024
Experiment C96C48_hybatmDA_0a498766 Completed 3 Cycles: *SUCCESS* at Mon Jul  8 23:24:39 UTC 2024
Experiment C48_S2SWA_gefs_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 23:36:50 UTC 2024
Experiment C48_S2SW_0a498766 Completed 1 Cycles: *SUCCESS* at Mon Jul  8 23:56:12 UTC 2024
Experiment C96_atmaerosnowDA_0a498766 Completed 3 Cycles: *SUCCESS* at Tue Jul  9 00:19:15 UTC 2024

@emcbot
Copy link

emcbot commented Jul 9, 2024

Disk requirements for RUNDIRS with KEEPDATA=YES: 437 G

Terry.McGuinness (hfe10) CI $ du -h --max-depth=1 "/scratch2/NCEPDEV/stmp/${USER}/global/CI/STMP/RUNDIRS"
75G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96_atm3DVar_0a498766
121G /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96C48_hybatmDA_0a498766
34G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_S2SW_0a498766
26G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48mx500_3DVarAOWCDA_0a498766
83G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C96_atmaerosnowDA_0a498766
73G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_S2SWA_gefs_0a498766
28G	 /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIRS/C48_ATM_0a498766
437G /scratch2/NCEPDEV/stmp/Terry.McGuinness/global/CI/STMP/RUNDIR

And the requirements in all the EXPDIRs (in a typical CI run): 432 G

Terry.McGuinness (hfe05) CI $ du -h --max-depth=1 2581/RUNTESTS
4.0M	2581/RUNTESTS/EXPDIR
432G	2581/RUNTESTS/COMROOT
432G	2581/RUNTESTS

@aerorahul
Copy link
Contributor

@TerrenceMcGuinness-NOAA
Thanks for running this test can collecting the information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Issue related to CI/CD CI-Hera-Passed **Bot use only** CI testing on Hera for this PR has completed successfully
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants