-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DA libs do not work with GSI #9
Comments
Hi Mark - can you provide some more details, like what error message(s) you're seeing? Also, what operating system are you working on? |
Here is where the error gets thrown in the stdout of the first regression test for GSI on Hera-- ================================================================================ For this test, I built the GSI as usual, but replaced the libbufr_v11.3.0_d_64.a library with libbufr_v11.3.0_d_64_DA.a in link stage. You should be able to reproduce by cloning my GSI repo with "git clone --branch Bufr_DA_test --recurse-submodules https://github.com:mark-a-potts/GSI". After the clone completes (the fix files will take a while to download), you can build by cd'ing into GSI and running "./ush/build_all_cmake.sh 0 $PWD". Once the build completes, if you cd to GSI/build, you can run the first regression test with the command "ctest -I 1,1". |
This should work on wcoss (dells) as well. Make sure to run a "module purge; module use path-to-GSI/modulefiles; module load modulefile.ProdGSI.wcoss_d" before running "ctest -I 1,1". The output from the test will be in /gpfs/dell2/ptmp/$LOGNAME/$ptmpName/_gpfs_dell2_emc_modeling_noscrub_Mark.Potts_G2_build/tmpreg_global_T62/global_T62_loproc_updat/stdout |
For some reason, I wasn't able to clone the above repository on mars: Jeff.Ator@m71a3 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (46)] % !! |
Is the repository blocked in some way, or some other permissions issue? |
Just a bad link (see the |
Now I'm getting the following feedback: Jeff.Ator@m72a1 [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (43)] % git clone --branch Bufr_DA_test --recurse-submodules https://github.com/mark-a-potts/GSI.git Please make sure you have the correct access rights Please make sure you have the correct access rights Please make sure you have the correct access rights |
Ah, that means you don't have a gerrit alias set up. The fix files are
still stored in Vlab. Probably the easiest thin to do is to just copy
(or link) to my copy at
/gpfs/dell2/emc/modeling/noscrub/Mark.Potts/ProdGSI/fix. After that,
make sure you run "git submodule init libsrc; git submodule sync libsrc;
git submodule update libsrc" to make sure that the libsrc submodule gets
populated correctly.
…-M
On 7/14/20 11:19 AM, Jeff Ator wrote:
Now I'm getting the following feedback:
***@***.*** [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI
(43)] % git clone --branch Bufr_DA_test --recurse-submodules
https://github.com/mark-a-potts/GSI.git
Cloning into 'GSI'...
remote: Enumerating objects: 7, done.
remote: Counting objects: 100% (7/7), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 61109 (delta 0), reused 2 (delta 0), pack-reused 61102
Receiving objects: 100% (61109/61109), 50.71 MiB | 22.37 MiB/s, done.
Resolving deltas: 100% (40428/40428), done.
Submodule 'fix' (gerrit:GSI-fix) registered for path 'fix'
Submodule 'libsrc' (gerrit:GSI-libsrc) registered for path 'libsrc'
Cloning into
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'...
ssh: Could not resolve hostname gerrit: Name or service not known
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'gerrit:GSI-fix' into submodule path
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed
Failed to clone 'fix'. Retry scheduled
Cloning into
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'...
ssh: Could not resolve hostname gerrit: Name or service not known
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'gerrit:GSI-libsrc' into submodule path
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed
Failed to clone 'libsrc'. Retry scheduled
Cloning into
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix'...
ssh: Could not resolve hostname gerrit: Name or service not known
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'gerrit:GSI-fix' into submodule path
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/fix' failed
Failed to clone 'fix' a second time, aborting
***@***.*** [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI (44)] %
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2UXIMJFZWYFQTDFGO2TR3RZPNANCNFSM4OYY6PIQ>.
|
OK, I made the symlink to your fix directory, then ran the init and sync steps, but the update step failed: git submodule update libsrc Please make sure you have the correct access rights Please make sure you have the correct access rights FYI, my own gerrit account is set up such that I normally need to enter a passcode whenever I clone from or push to repositories in VLab. |
Argh. I think you should be able to just copy the libsrc directory from
/gpfs/dell2/emc/modeling/noscrub/Mark.Potts/ProdGSI/libsrc. Sorry about
that.
-M
On 7/14/20 12:16 PM, Jeff Ator wrote:
OK, I made the symlink to your fix directory, then ran the init and
sync steps, but the update step failed:
git submodule update libsrc
Cloning into
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'...
ssh: Could not resolve hostname gerrit: Name or service not known
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'gerrit:GSI-libsrc' into submodule path
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed
Failed to clone 'libsrc'. Retry scheduled
Cloning into
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc'...
ssh: Could not resolve hostname gerrit: Name or service not known
fatal: Could not read from remote repository.
Please make sure you have the correct access rights
and the repository exists.
fatal: clone of 'gerrit:GSI-libsrc' into submodule path
'/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI/libsrc' failed
Failed to clone 'libsrc' a second time, aborting
***@***.*** [/gpfs/dell2/emc/obsproc/noscrub/Jeff.Ator/testGSI/GSI
(60)] %
FYI, my own gerrit account is set up such that I normally need to
enter a passcode whenever I clone from or push to repositories in VLab.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2UTR6E3VZHCJCCBWD6DR3SAG7ANCNFSM4OYY6PIQ>.
--
Mark A. Potts, Ph.D.
Sr. HPC Software Developer
RedLine Performance Solutions, LLC
Phone 202-744-9469
Mark.Potts@noaa.gov
mpotts@redlineperf.com
|
OK, I copied over the libsrc directory and was able to run the "./ush/build_all_cmake.sh 0 $PWD" step from the GSI directory. I then cd'ed to the build subdirectory and ran the "module purge" and "module use path-to-GSI/modulefiles" commands, but when I then try to run "module load modulefile.ProdGSI.wcoss_d" I get the following: Lmod has detected the following error: The following module(s) are unknown: "modulefile.ProdGSI.wcoss_d" Please check the spelling or version number. Also try "module spider ..." Also make sure that all modulefiles written in TCL start with the string #%Module I checked but couldn't find the module using spider, and if I do a "module avail" I don't see anything under "path-to-GSI", which leads me to believe that the prior "module use path-to-GSI/modulefiles" didn't really do anything. I also don't see anything when I do a "find . -name path-to-GSI" from my main directory. |
Never mind, I now see a modulefiles subdirectory in the main "GSI" directory, so I did a "module use" on that and now I can load the modulefile.ProdGSI.wcoss_d module file. Will now try the ctest command. Fingers crossed... |
OK, I can reproduce the error now, but I'm not getting very much information out of the stack trace - in my stdout everything says "Unknown" for the routine name, whereas in Mark's stdout on hera it showed routine names and line numbers. How do I get that level of detail in my runs - is there some compile setting (or DEBUG option) that I need to set? |
Yes. When you build with the build_all_cmake script, you can use this
command from the GSI directory instead -- "./ush/build_all_cmake.sh
DEBUG $PWD"
Thanks,
…-M
On 7/14/20 3:16 PM, Jeff Ator wrote:
OK, I can reproduce the error now, but I'm not getting very much
information out of the stack trace - in my stdout everything says
"Unknown" for the routine name, whereas in Mark's stdout on hera it
showed routine names and line numbers. How do I get that level of
detail in my runs - is there some compile setting (or DEBUG option)
that I need to set?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2USZBVMHPGZ6A3ZN2A3R3SVHHANCNFSM4OYY6PIQ>.
|
OK thanks, that seemed to work. However, I was hoping that if I reset the environment variable $BUFR_LIBd_DA it might then do the build using a version of the BUFR library that was also compiled with the debug option. However, even if I reset this variable I still see: BUFR library /gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64_DA.a set via Environment variable in the build output, which leads me to believe that it didn't pick up the new value of $BUFR_LIBd_DA within the cmake/Modules/FindBUFR.cmake script. So maybe I have to try hardcoding that value in the script(?) |
Yeah, that is probably not working right. Here is how you can be sure
you get the library you want linked in. From the GSI/build directory,
open up src/gsi/CMakeFiles/gsi_DBG.x.dir/link.txt. That has the full
link line used to compile the GSI in debug mode. Search for bufr and
replace the two instances it shows up with the full path to the library
you want to use. After that, delete the GSI/build/bin/gsi_DBG.x file and
run "make" from the GSI/build directory again. It will re-link with the
new library.
…-M
On 7/14/20 4:34 PM, Jeff Ator wrote:
OK thanks, that seemed to work. However, I was hoping that if I reset
the environment variable $BUFR_LIBd_DA it might then do the build
using a version of the BUFR library that was also compiled with the
debug option. However, even if I reset this variable I still see:
BUFR library
/gpfs/dell1/nco/ops/nwprod/lib/bufr/v11.3.0/ips/18.0.1/libbufr_v11.3.0_d_64_DA.a
set via Environment variable
in the build output, which leads me to believe that it didn't pick up
the new value of $BUFR_LIBd_DA within the cmake/Modules/FindBUFR.cmake
script. So maybe I have to try hardcoding that value in the script(?)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2UT2W263HSWIBHR3PZTR3S6LXANCNFSM4OYY6PIQ>.
|
OK, one more question - is there any way to run this thing using an interactive debugger such as gdb? The only executable I can find anywhere is "ctest", but that doesn't contain any internal debugging symbols (even though I used "DEBUG" as the first argument to ./ush/build_all_cmake.sh!?), so it's basically useless within gdb. What I was hoping was to be able to manually run and step through, say, the gsimain.f90 code, so I could figure out exactly where/why it's SIGSEGV faulting. The only other alternative I see is to just start putting in print statements everywhere, but of course that's a huge time sink b/c I then have to go back and recompile the entire package every time I change something. And the line numbers in closbf and status where it says it's failing are very puzzling. |
That is a tough one. This test runs on 56 cores I think, so putting it
in gdb is going to be hard to work with. Let me see if there is a test
that runs on fewer cores (and also fails) that might work.
-M
On 7/14/20 6:14 PM, Jeff Ator wrote:
OK, one more question - is there any way to run this thing using an
interactive debugger such as gdb? The only executable I can find
anywhere is "ctest", but that doesn't contain any internal debugging
symbols (even though I used "DEBUG" as the first argument to
./ush/build_all_cmake.sh!?), so it's basically useless within gdb.
What I was hoping was to be able to manually run and step through,
say, the gsimain.f90 code, so I could figure out exactly where/why
it's SIGSEGV faulting.
The only other alternative I see is to just start putting in print
statements everywhere, but of course that's a huge time sink b/c I
then have to go back and recompile the entire package every time I
change something. And the line numbers in closbf and status where it
says it's failing are very puzzling.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2UWPKBRAJDBV3E6BENTR3TKGHANCNFSM4OYY6PIQ>.
--
Mark A. Potts, Ph.D.
Sr. HPC Software Developer
RedLine Performance Solutions, LLC
Phone 202-744-9469
Mark.Potts@noaa.gov
mpotts@redlineperf.com
|
Thanks Mark - I'd really appreciate anything you or anyone else could do to narrow down the scope of what I need to look at here. Just some small code snippet which runs on a single processor but exhibits the same behavior would be ideal, though I'm guessing probably also wishful-thinking ;-) I'm not at all familiar with the GSI hierarchy nor even with running any sort of code across multiple cores, so this is a bit overwhelming for me trying to get a handle on this. According to the stack trace in the stdout from my DEBUG runs, it looks like the progression of F90 calls is gsimain->gsimod->gsisub->glbsoi->observer->read_obs->read_iasi, which is quite a mound (literally thousands of lines of code) to try to dig through just to get to the point where a call to the BUFRLIB subroutine closbf then seems to trigger a SIGSEGV. And closbf itself is a very straightforward subroutine which basically just closes a bunch of open FORTRAN logical units, so it's pretty innocuous, and the print statements I've added so far don't show any immediate clues as to where the real problem may lie. As with many segfault errors, the real memory violation could be far removed from where the abort actually shows up in the stack trace. So again, if there's any way for you or the GSI folks (or anyone?) to narrow down the scope of this for me, or just isolate some smaller snippet of code (maybe just all or part of the read_iasi code?) which leads to the same segfault, I'd really appreciate it! Otherwise, I'm really grasping at straws right now trying to figure out where to look next. |
@jbathegit i can setup something you can work with on a single PE. |
@mark-a-potts @jbathegit
The bufr file has not been opened yet, as far as I can tell. Could there be a runaway condition occurring here? @mark-a-potts It might be worth a shot to comment that line L417 and run the test. |
I think that might have been the problem. I need to do a little more testing, but the code got further that time before crashing in crtm, which is does when it is in debug mode. |
Great catch @aerorahul! |
Yes, great catch @aerorahul If the file pointed to by logical unit lnbufr hasn't been "opened" yet, then that could indeed explain it. You'll note I put "opened" in quotation marks here, because I'm talking about opening the file to the BUFRLIB via a call to subroutine openbf, as opposed to just using a FORTRAN open statement to link a filename to a logical unit number. The reason this is significant is because, when using a DA build of the BUFRLIB, the first call to subroutine openbf is also where all of the internal memory arrays for the BUFRLIB actually get dynamically allocated, based on any sizes specified during earlier calls to function isetprm, or else based on the system defaults built into the BUFRLIB. The point is that, until that first call to subroutine OPENBF is made, there's no internal memory available within the BUFRLIB, which means the library itself is basically unusable, and so any call to any other routine (such as closbf) which tries to access such space could certainly trigger a SIGSEGV violation. It's always been kind of presumed that nobody would ever try to call closbf without having first called openbf somewhere else in the code, but that wouldn't have triggered a segfault previously if you weren't using a DA build of the BUFRLIB, because the needed memory would have already been allocated at compile time. |
Well, commenting out the closbf call allowed the global_T62 case to run
to completion in the loproc configuration, but in the hiproc
configuration, it still crashes in closbf (line 61) calling status.f
(line 117). I'll see if I can get more information on that using the
debug build.
…-M
On 7/15/20 9:50 AM, Jeff Ator wrote:
Yes, great catch @aerorahul <https://github.com/aerorahul> If the file
pointed to by logical unit lnbufr hasn't been "opened" yet, then that
could indeed explain it. You'll note I put "opened" in quotation marks
here, because I'm talking about opening the file to the BUFRLIB via a
call to subroutine openbf, as opposed to just using a FORTRAN open
statement to link a filename to a logical unit number. The reason this
is significant is because, when using a DA build of the BUFRLIB, the
first call to subroutine openbf is also where all of the internal
memory arrays for the BUFRLIB actually get dynamically allocated,
based on any sizes specified during earlier calls to function isetprm,
or else based on the system defaults built into the BUFRLIB. The point
is that, until that first call to subroutine OPENBF is made, there's
no internal memory available within the BUFRLIB, which means the
library itself is basically unusable, and so any call to any other
routine (such as closbf) which tries to access such space could
certainly trigger a SIGSEGV violation. It's always been kind of
presumed that nobody would ever try to call closbf without having
first called openbf somewhere else in the code, but that wouldn't have
triggered a segfault previously if you weren't using a DA build of the
BUFRLIB, because the needed memory would have already been allocated
at compile time.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AH4Q2URRORQKYOXNHILNLRDR3WX3ZANCNFSM4OYY6PIQ>.
|
@jbathegit Is there a reason why one has to do the Fortran |
@aerorahul this was a conscious design decision made a long time (i.e. decades) ago when the library was first developed, in order to allow maximum portability to different systems which might have different extensions to the Fortran OPEN statement. For example, for a long time SGI-based systems had a non-standard FORM="SYSTEM" extension which allowed a Fortran read of a file as a binary stream without control words. And some systems also had implicit ways to associate logical unit numbers with files on the system outside of an actual Fortran OPEN statement, e.g. using an assign directive, or by simply just naming the file as "fort.#" where # is the logical unit number. Bottom line - a conscious decision was made way back when to keep BUFRLIB as flexible as possible by not including the OPEN statement inside of subroutine openbf. |
So, I added a check in closbf.f (and then renamed it closbf.F) that simply returns if arrays haven't yet been allocated. This seems to fix the "problem" with the GSI. I put problem in quotes because I am not sure it is really a problem with bufrlib as much as with GSI, but I think it does make bufrlib more robust to have this check in place. It looks like this--
C----------------------------------------------------------------------- C CLOSE fortran UNIT IF NULL(LUN) = 0
|
Is this issue still active? |
I think it is resolved now. |
Yes, it's resolved. Even though this is really an issue that should be fixed in the GSI code, I agreed to pull the CLOSBF "fix" over into the code baseline. |
My understanding of how the dynamic allocation build of bufrlib is supposed to work is that if the sizing of the bufr is not specified, it defaults to the size used for the static allocation. This should make it possible to swap an SA library for a DA version. Unfortunately, this does not seem to work with the GSI using either the old makefile build system or the just completed CMake build system. Can this be fixed?
The text was updated successfully, but these errors were encountered: