Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGI Fortran: runtime error when using using mpi_f08 module on OpenPOWER #2606

Closed
hppritcha opened this issue Dec 19, 2016 · 31 comments
Closed
Assignees
Labels
Milestone

Comments

@hppritcha
Copy link
Member

@PHHargrove has reported a runtime problem when using the PGI 16.10 fortran compiler on OpenPOWER.

This posting on devel describes the problem in some more detail:

https://mail-archive.com/devel@lists.open-mpi.org/msg19836.html

@hppritcha hppritcha added the bug label Dec 19, 2016
@hppritcha hppritcha added this to the v2.0.3 milestone Dec 19, 2016
@jjhursey jjhursey self-assigned this Dec 20, 2016
@PHHargrove
Copy link
Member

@jjhursey
Believe it or not, I see the same failure with the XLC community edition compilers on the same system!
No issues with gcc/gfortran.

@xiachsh
Copy link

xiachsh commented Jan 10, 2017

@PHHargrove

which version of IBM Compiler did you used ?

@PHHargrove
Copy link
Member

@xiachsh Here is the full version info for xlc and xlf:

phargrov@openpower-6:~$ xlc -qversion
IBM XL C/C++ for Linux, V13.1.4 (Community Edition)
Version: 13.01.0004.0001
/opt/ibm/xlC/13.1.4/bin/.orig/xlc: note: XL C/C++ Community Edition is a no-charge product and does not include official IBM support. You can provide feedback at the XL on POWER C/C++ Community Edition forum (http://ibm.biz/xlcpp-linux-ce). For information about a fully supported XL C/C++ compiler with advanced optimization features, visit XL C/C++ for Linux (http://ibm.biz/xlcpp-linux).
phargrov@openpower-6:~$ xlf -qversion
IBM XL Fortran for Linux, V15.1.5 (Community Edition)
Version: 15.01.0005.0001
/opt/ibm/xlf/15.1.5/bin/.orig/xlf: 1501-303 (I) XL Fortran Community Edition is a no-charge product and does not include official IBM support. You can provide feedback at the XL on POWER Fortran Community Edition forum (http://ibm.biz/xlfortran-linux-ce). For information about a fully supported XL Fortran compiler, visit XL Fortran for Linux (http://ibm.biz/xlfortran-linux).

@DanielCChen
Copy link

@PHHargrove.
Are you still experiencing problems building openmpi with IBM's community edition compilers?

I tried it using the same IBM XLC and XLF compilers as you mentioned to build Openmpi 2.0.2, and it compiled and ran ring_openmpif08.f90 just fine for me.

I built Openmpi 2.0.2 on a LE Power8 machine that runs Ubuntu 14.04.1. The only difference in comparing to your config file is that I also added --enable-mpi-fortran=usempif08, which is required to build mpi_f08.mod.

Here is the run result:
mpirun -mca btl sm,self -np 2 a.out
Process 0 sending 10 to 1 tag 201 ( 2 processes in ring)
Process 0 sent to 1
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
Process 1 exiting

@PHHargrove
Copy link
Member

@DanielCChen

If I recall correctly, I had omitted --enable-mpi-fortan=usempif08 because configure had determined that F08 support was present even without passing that flag. Regardless, I have retested today with that flag passed explicitly.

I am using:

$ xlc -qversion 2>/dev/null
IBM XL C/C++ for Linux, V13.1.4 (Community Edition)
Version: 13.01.0004.0001

on

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04.2 LTS
Release:        16.04
Codename:       xenial
$ uname -p
ppc64le

The system is a KVM-based VM on Power8E h/w.

Using Open MPI 2.0.2 I still see the error:

mpirun -mca btl sm,self -np 2 examples/ring_usempif08
[openpower-6:14143] *** An error occurred in MPI_Recv
[openpower-6:14143] *** reported by process [645332993,1]
[openpower-6:14143] *** on communicator MPI_COMM_WORLD
[openpower-6:14143] *** MPI_ERR_TYPE: invalid datatype
[openpower-6:14143] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[openpower-6:14143] ***    and potentially your MPI job)
Process 0 sending 10 to  1 tag 201 ( 2 processes in ring)
[openpower-6:14137] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[openpower-6:14137] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

Meanwhile, ring_usempi and ring_mpifh ran just fine.

@DanielCChen
Copy link

@PHHargrove
Hmm, for me --enable-mpi-fortan=usempif08 is mandatory to build openmpi 2.0.2 in order to compile ring_usempif08.f90. Otherwise, the compiler will complain at "use mpi_f08" because the mpi_f08.mod does not exist.

My guess is that your mpi_f08.mod might be compiled using a different version of the compiler that may have different module format as your compilation didn't seem fail complaining "use mpi_f08".

Can you please try to

  1. "make clean" first so that all the .mod (s) and .o (s) are deleted (and will be rebuilt by xlc and xlf)
  2. The following is my config file:
    #!/bin/bash

export PREFIX=/path_to_openmpi202

./configure --prefix=$PREFIX
CC=/path_to_xlc/xlc
CXX=/path_to_xlC/xlC
FC=/path_to_xlf/xlf
--enable-mpi-fortran=usempif08

  1. Then make && make install.

If you still run into the same error, can you please send me your configuration line and the command line you compile ring_usempif08.f90?

@PHHargrove
Copy link
Member

@DanielCChen

My build scripts start with downloading the release tarball and unpacking.
So there is nothing to "make clean".

There are also no MPI installs (Open MPI or otherwise) in $PATH, $LD_LIBRARY_PATH, etc.

The full configure command:

[pathto]/configure --prefix=[...] CC=xlc CXX=xlC FC=xlf --disable-oshmem-fortran --enable-mpi-fortran=usempif08

The full "make -V" is here. This should let you determine if the .mod was compiled with the proper compiler.

There is only one xlc/xlf installation on this system.

The ring_* tests are build by running "make" in the examples directory with the freshly built mpicc/mpifort as the only MPI in $PATH:

mpifort -g ring_usempif08.f90 -o ring_usempif08
** ring   === End of Compilation 1 ===
1501-510  Compilation successful for file ring_usempif08.f90.

@DanielCChen
Copy link

@PHHargrove
I have another guess. In your log file, when .F90 files got compiled by xlf, -D (e.g. -DHAVE_CONFIG_H ) didn't not get passed to CPP as an option the same way as gfortran, so it is possible the *.F90 files didn't get preprocessed correctly. We have fixed this recently but it is not in the 15.1.5 Community Edition.
If you could send me an email or let me know your email, I can pass on some more information of how to get access to the newer XL compilers.

@DanielCChen
Copy link

@PHHargrove
Just to add, in order to pass -D* to CPP with xlf, you would need "-WF,-D*" for each -D*.

@PHHargrove
Copy link
Member

@DanielCChen

I am aware that Open MPI incorrectly passes -DVAR=VAL to xlf90, and have reported it as an Open MPI bug in the past.

Are you saying that your results and mine differ because you are using a newer xlf?

My email is the same as my github user id plus AT lbl.gov.

@PHHargrove
Copy link
Member

@DanielCChen

I have access to the following two non-community-edition XL compiler installations:

-bash-4.2$ xlc -qversion
IBM XL C/C++ for Linux, V13.1.5 (5725-C73, 5765-J08)
Version: 13.01.0005.0001
-bash-4.2$ xlf -qversion
IBM XL Fortran for Linux, V15.1.5 (5725-C75, 5765-J10)
Version: 15.01.0005.0001

and

-bash-4.2$ xlf -qversion 2>/dev/null
IBM XL Fortran for Linux, V16.1 (Beta 2)
Version: 16.01.0000.0000
-bash-4.2$ xlc -qversion 2>/dev/null
IBM XL C/C++ for Linux, V14.1 (Beta 2)
Version: 14.01.0000.0000

Both are running on

-bash-4.2$ lsb_release -a
LSB Version:    :core-4.1-noarch:core-4.1-ppc64le
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.3 (Maipo)
Release:        7.3
Codename:       Maipo

I do not see the error on that system with either pair of compilers.

@DanielCChen
Copy link

@PHHargrove

  1. I see you already have access to our Beta compiler as I was going to send you an email about it.
  2. -DVAR=VAL is not passed as an option to CPP with the XLF compiler you have. XLF has been using -D as the shortcut of -qdlines that specifies whether the compiler compiles fixed source form lines with a D in column1 or treats them as comments. We changed it recently, so you should get updated version via the Beta program.
  3. I can now reproduce the error on ubuntu 16.04. It seems it only fails if MPI libs are dynamically linked in.

@DanielCChen
Copy link

@PHHargrove
It turns out to be a XLF compiler bug that is exposed by the linker change in Ubuntu16.04 (comparing with Ubuntu 14.04). We have fixed it in the Beta compiler that is coming soon.
Thanks for bringing the issue up!

@jsquyres
Copy link
Member

@jjhursey @gpaulsen Can someone at IBM make a release note in README about this problem?

@DanielCChen
Copy link

@jsquyres
Sure. I will make sure to add this to the README.

@jsquyres
Copy link
Member

@DanielCChen Excellent; thank you.

@hppritcha hppritcha modified the milestones: v2.0.3, v2.0.4 Jun 1, 2017
@hppritcha
Copy link
Member Author

@jsquyres I think this issue can be closed since we updated the README and this is an XLF compiler bug.

@PHHargrove
Copy link
Member

@hppritcha There are actually PGI and XLF bugs in this issue.
However, all the discussion has been on XLF so far.
Please don't close unless there is evidence that the original PGI problem has been resolved as well.

@PHHargrove
Copy link
Member

I have confirmed access to PGI 16.10 and 17.{1,3,4,5} for OpenPower.
I am going to see today if I can at least reproduce w/ OMPI 2.0.2 and PGI-16.10 (where the problem was originally reported).
I will then see about OMPI 2.1.1 with both PGI-16.10 and more recent pgfortran.

@PHHargrove
Copy link
Member

Problem re-confirmed with Open MPI 2.0.2 and PGI-17.3.
It is still present in Open MPI 2.1.1 (tried PGI-17.3 and 17.5) and 3.0.0rc1 (tested only PGI-17.5).

@jsquyres jsquyres changed the title PGI Fortran: runtime error when using using mpi_f09 module on OpenPOWER PGI Fortran: runtime error when using using mpi_f08 module on OpenPOWER Jul 5, 2017
@jsquyres
Copy link
Member

jsquyres commented Jul 5, 2017

@jjhursey @DanielCChen Can one of you guys follow up on the PGI issues on this issue? Thanks!

@jjhursey
Copy link
Member

jjhursey commented Jul 5, 2017

I'll try to take a look this week.

@hppritcha
Copy link
Member Author

@gpaulsen says to bump to 2.1.2

@hppritcha hppritcha modified the milestones: v2.1.2, v2.0.4 Jul 12, 2017
@hppritcha
Copy link
Member Author

@gvallee and here's the other PGI issue we discussed before you arrived today.

@jjhursey
Copy link
Member

This is an old thread and I'm trying to capture where we are so please let me know if I get this wrong.

  • RHEL 7.3 - No issues. Both PGI and XL are passing correctly.
    • @PHHargrove Is that correct?
    • My testing with the following configure options on RHEL 7.3 passed compiling/running ring_usempif08 (just XL at the moment, testing PGI next)
      • --enable-debug CC=xlc_r CXX=xlC_r FC=xlf_r
      • --enable-debug --disable-dlopen CC=xlc_r CXX=xlC_r FC=xlf_r
    • Note that IBM runs CI tests with GNU/PGI/XL on RHEL 7.3 for the community so this should be working.
  • Ubuntu 16.04.2 - Failure with both PGI and XL compilers
    • @DanielCChen You mentioned here that you tracked it to a linker change in Ubuntu that was fixed in XL's current beta. Can you confirm that and give us a little bit of language about the problem that we can include in the README?
    • I wonder if this is related to a similar linker issue that we found for PGI in Issue PGI F08 symbol Issue on POWER #3075 (comment here)

@DanielCChen
Copy link

The linker is changed in Ubuntu 16.04 in comparison to 14.1 (the one I tried). It no longer searches and links in the definition of the symbol, which caused the issue. I have fixed the XLF compiler to workaround it.

@jjhursey
Copy link
Member

How about this for README language:

 * Compiling Fortran programs using the mpi_f08 module on PowerPC with
   the PGI (tested 17.5) or XL (tested v15.1.5) Fortran compilers and GNU
   linker after 2.25.1 and before 2.28 will likely experience runtime failures.
   This was noticed on Ubuntu 16.04 which uses the 2.26.1 version of ld by
   default. However, this issue impacts any OS running the impacted
   version of ld. This GNU linker regression will be fixed in version 2.28.
   Below is a link to the GNU bug on this issue:
   https://sourceware.org/bugzilla/show_bug.cgi?id=21306
   The XL compiler will have a fix for this issue in their next release.

jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 12, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey
Copy link
Member

I created a PR for the README change (probably an easier place to discuss the language): PR #3866

@PHHargrove
Copy link
Member

@jjhursey

Based on my testing of 3.0.0rc1, can can confirm:

FWIW: it looks like #3075 has the identical failure mode as #2606 - they are duplicate bug reports as far as I can tell.

@PHHargrove
Copy link
Member

I wrote:

Earlier testing w/ V15.1.5 was the same the best of my recollection.

I restested XLF V15.1.5 on RHEL 7.3 today and reconfirmed no problem w/ MPI F08 bindings.
I configured with --disable-oshmem-fortran to avoid the OSHMEM F08 bindings issue: #3612

jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
jjhursey added a commit to jjhursey/ompi that referenced this issue Jul 13, 2017
 * Related to Issue open-mpi#2606 and Issue open-mpi#3075
 * The core problem in those two issues is related to a regression in
   ld upstream. Add a note in the README about this issue.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
(cherry picked from commit 1c6a253)
Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey
Copy link
Member

jjhursey commented Aug 9, 2017

The README on the various branches has now been updated. I think we can close this Issue now.

@jjhursey jjhursey closed this as completed Aug 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants