Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse grid simulation with PPE.n12_ctsm5.1.dev030 fails when trying to close h5 file at end of simulation. #1449

Closed
olyson opened this issue Aug 1, 2021 · 6 comments
Labels
bug something is working incorrectly

Comments

@olyson
Copy link
Contributor

olyson commented Aug 1, 2021

Brief summary of bug

A sparse grid simulation that uses the PPE.n12_ctsm5.1.dev030 tag fails when trying to close the h5 file at the end of the simulation.

General bug information

CTSM version you are using: branch_tags/PPE.n12_ctsm5.1.dev030

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: --compset 2000_DATM%GSWP3v1_CLM51%SP_SICE_SOCN_SROF_SGLC_SWAV_SIAC_SESP --res f19_g17

Details of bug

The simulation dies at the end with this error:

73:MPT: #2 MPI_SGI_stacktraceback (
73:MPT: header=header@entry=0x7ffdd3c24b40 "MPT ERROR: Rank 73(g:73) received signal SIGSEGV(11).\n\tProcess ID: 35712, Host: r11i5n9, Program: /glade/scratch/oleson/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000/bld/cesm.exe\n\tMPT Version:"...) at sig.c:340
73:MPT: #3 0x00002b4a1a3e7a62 in first_arriver_handler (signo=signo@entry=11,
73:MPT: stack_trace_sem=stack_trace_sem@entry=0x2b4a24a00080) at sig.c:489
73:MPT: #4 0x00002b4a1a3e7dfb in slave_sig_handler (signo=11,
73:MPT: siginfo=, extra=) at sig.c:565
73:MPT: #5
73:MPT: #6 0x0000000001853a03 in intel_avx_rep_memcpy ()
73:MPT: #7 0x000000000118c60b in PIOc_write_darray_multi ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_darray.c:378
73:MPT: #8 0x0000000001193081 in flush_buffer ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_darray_int.c:1846
73:MPT: #9 0x000000000114dc10 in PIOc_closefile ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/clib/pio_file.c:420
73:MPT: #10 0x00000000010b8da3 in piolib_mod_mp_closefile
()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/externals/pio2/src/flib/piolib_mod.F90:1447
73:MPT: #11 0x00000000005d6626 in histfilemod_mp_hist_htapes_wrapup
()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/main/histFileMod.F90:3869
73:MPT: #12 0x0000000000543d99 in clm_driver_mp_clm_drv_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/main/clm_driver.F90:1398
73:MPT: #13 0x00000000004fc853 in lnd_comp_mct_mp_lnd_run_mct_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/src/cpl/mct/lnd_comp_mct.F90:455
73:MPT: #14 0x000000000042b214 in component_mod_mp_component_run_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/component_mod.F90:737
73:MPT: #15 0x000000000040a1dd in cime_comp_mod_mp_cime_run_ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/cime_comp_mod.F90:2855
73:MPT: #16 0x000000000042ae55 in MAIN__ ()
73:MPT: at /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/drivers/mct/main/cime_driver.F90:153
73:MPT: #17 0x00000000004079e2 in main ()

Line 3869 of histFileMod.F90 is:

        call ncd_pio_closefile(nfid(t))

According to the lnd log it's trying to close the h5 file.
It takes about 45 minutes between the end of the simulation and the error to occur.
This is a 3-hourly file and I've requested that 10 years of data be on a single file.
The h5 configuration is:

hist_mfilt(6) = 29201
hist_dov2xy(6) = .false.
hist_nhtfrq(6) = -3
hist_type1d_pertape(6) = 'GRID'
hist_fincl6 += 'TSA','RH2M','FSH','EFLX_LH_TOT','TSKIN:I','FPSN'

If I completely remove the h5 file from my user_nl_clm and rerun, the simulation ends normally.
Reconfiguring the h5 file with dov2xy = .true. (i.e., lat/lon) DOESN'T work.
Splitting the h5 output into yearly files DOES work. So we (the PPE) could operate in that configuration.
I'm filing this because this worked fine in a previous tag (PPE.n08_ctsm5.1.dev023).
Once significant difference I see is that the new tag uses PIO2 instead of PIO1.

A full grid version of this simulation (including the h5 file as configured above) runs successfully.

Important details of your setup / configuration so we can reproduce the bug

Case directory:

/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/scripts/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000

Run directory:

/glade/scratch/oleson/ctsm51c6_PPEn12ctsm51d030_2deg_GSWP3V1_Sparse400_2000/run

@billsacks billsacks added the bug something is working incorrectly label Aug 5, 2021
@ekluzek
Copy link
Collaborator

ekluzek commented Aug 5, 2021

@olyson is going to try this with PIO1 just for kicks (change PIO_VERSION). And also try it in the latest version on CTSM main-dev. We should also check what NetCDF format is being used (check PIO_NETCDF_FORMAT). Possibly it needs to be upgraded to a newer format if it's reaching the data limit for NetCDF with these 10 year 3-hourly files (the default is likely 64bit_offset, but you could try 64bit_data).

@olyson
Copy link
Contributor Author

olyson commented Aug 5, 2021

  1. A run with PIO1 (./xmlchange PIO_VERSION=1) fails during the build with an error in csm_share.bldlog:

/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/mct_mod.F90(766): remark #5140: Unrecognized directive
!DIR$ CONCURRENT
----------------^
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/mct_mod.F90(767): remark #5140: Unrecognized directive
!DIR$ PREFERVECTOR
------------------^
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/mct_mod.F90(824): remark #5140: Unrecognized directive
!DIR$ CONCURRENT
----------------^
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/mct_mod.F90(825): remark #5140: Unrecognized directive
!DIR$ PREFERVECTOR
------------------^
/glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/shr_pio_mod.F90(147): error #6285: There is no matching specific subroutine for this generic subroutine call. [PIO_INIT]
call pio_init(iosystems, MPI_COMM_WORLD, comp_comm, io_comm, PIO_REARR_BOX)
---------------^
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location
compilation aborted for /glade/work/oleson/PPE.n12_ctsm5.1.dev030/cime/src/share/util/shr_pio_mod.F90 (code 1)
gmake: *** [shr_pio_mod.o] Error 1

  1. PIO_NETCDF_FORMAT = 64bit_offset; A run using 64bit_data fails at the end of the simulation with the same file close error.

  2. A run using latest master/main fails with the same error.

@ekluzek
Copy link
Collaborator

ekluzek commented Aug 5, 2021

OK, none of our suggestions work. Jim Edwards is going on vacation, but when he gets back we should point this out to him. Since, this works for the full grid, but fails for the sparse grid (which should would be smaller data size) it shouldn't be the NetCDF format. And since it worked before but fails now, that might be some hint about what's going on. Jim is back the week of the 16th...

@olyson
Copy link
Contributor Author

olyson commented Oct 13, 2022

Closing this since we have a workaround for this.

@olyson olyson closed this as completed Oct 13, 2022
@ekluzek
Copy link
Collaborator

ekluzek commented Oct 13, 2022

@olyson can you let us know what the workaround you figured out is?

@olyson
Copy link
Contributor Author

olyson commented Oct 13, 2022

It's noted above: "Splitting the h5 output into yearly files DOES work. So we (the PPE) could operate in that configuration."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Development

No branches or pull requests

3 participants