Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

Closed
shinechou opened this issue Nov 2, 2017 · 25 comments

Comments

@shinechou
Copy link

shinechou commented Nov 2, 2017

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI v3.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Following the installation guidance of FAQ,

shell$ gunzip -c openmpi-3.0.0.tar.gz \| tar xf - 
shell$ cd openmpi-3.0.0 
shell$ ./configure --enable-orterun-prefix-by-default --with-cuda
shell$ make all install

Please describe the system on which you are running

  • Operating system/version: Ubuntu v16.04
  • Computer hardware: HP-Z820 (GTX 1080Ti)
  • Network type: Ethernet

Details of the problem

shell$ mpirun -np 2 -x LD_LIBRARY_PATH -hostfile machinefile python fcn_horovod.py 

I got the error as below,

_[brs-dualG:09057] [[48500,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351

An internal error has occurred in ORTE:

[[48500,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)

This is something that should be reported to the developers.__

Thanks a lot in advance.

@rhc54
Copy link
Contributor

rhc54 commented Nov 2, 2017

This looks like version confusion between the two nodes. Note that the -x option only forwards that variable to the application processes, not the daemons launched by mpirun. I suspect the path on the remote node is picking up a different OMPI install. Can you check?

@shinechou
Copy link
Author

@rhc54: thx a lot. I don't think that is the case. because I was compile/install them using the exactly same configuration, you can check the ompi information as below. Thx again.

ompi info of the local node:
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017
MPI API: 3.1.0
Ident string: 3.0.0
Prefix: /usr/local
Configured architecture: x86_64-unknown-linux-gnu
Configure host: ryan-z820
Configured by: root
Configured on: Fri Oct 27 16:11:22 CEST 2017
Configure host: ryan-z820
Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

the ompi_info of the remote node:
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017
MPI API: 3.1.0
Ident string: 3.0.0
Prefix: /usr/local
Configured architecture: x86_64-unknown-linux-gnu
Configure host: brs-dualG
Configured by: brs
Configured on: Thu Oct 26 09:47:52 CEST 2017
Configure host: brs-dualG
Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

@rhc54
Copy link
Contributor

rhc54 commented Nov 2, 2017

Thanks for providing that output. I gather you built/installed them on each node separately, yes? That is a rather unusual way of doing it and generally not recommended - it is much safer to install on a shared file system directory.

Try configuring with --enable-debug, and then run the following:

$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname

@jsquyres
Copy link
Member

jsquyres commented Nov 2, 2017

@shinechou You might also want to check that there's not some OS/distro-installed Open MPI on your nodes that is being found and used (e.g., earlier in the PATH than your hand-installed Open MPI installations).

@shinechou
Copy link
Author

@rhc54: you are right. I install them separately. What is the proper way to do it? Do u have have any guidance for that? Because after I install it I have to install another library on top of ompi library. I'll try to compile it with --enable-debug option and run the command u mentioned.

@jsquyres: thank u for ur comments. But I'm sure there is no other ompi installed on my both nodes.

@jsquyres
Copy link
Member

jsquyres commented Nov 2, 2017

You could install Open MPI on one node, and then tar up the installation tree on that node, and then untar it on the other node. Then you'd know for sure that you have exactly the same binary installation on both nodes. Something like this:

$ ./configure --prefix=/opt/openmpi-3.0.0
$ make -j 32 install
...
$ cd /opt
$ tar jcf ~/ompi-install-3.0.0.tar.bz2 openmpi-3.0.0
$ scp ~/ompi-install-3.0.0.tar.bz2 @othernode:

$ ssh othernode
...login to othernode...
$ cd /opt
$ rm -rf openmpi-3.0.0
$ sudo tar xf ~/ompi-install-3.0.0.tar.bz2

Usually, people install Open MPI either via package (e.g., RPM) on each node, or they install Open MPI on a network filesystem (such as NFS) so that the one, single installation is available on all nodes.


Note that I mentioned the multiple Open MPI installation issue because the majority of time people run into this error, it's because users are accidentally / unknowingly using multiple different versions of Open MPI (note that Open MPI currently only supports running exactly the same version of Open MPI on all nodes in a single job). This kind of error almost always indicates that version X of Open MPI is trying to read more data than was sent by Open MPI version Y.

Try this exercise:

$ ompi_info | head
$ ssh othernode ompi_info | head

Doing the 2nd line non-interactively is important (i.e., a single command -- not ssh'ing to the other node and then entering another command to run ompi_info).

Make sure that both ompi_info outputs return the same version.

If they do, then there's something configured differently between the two (but which might still be a bug, because "same version but configured differently" should still usually work).

@shinechou
Copy link
Author

@jsquyres: thx again for ur guidance. I tried ur excercise. They return the same version as below,

$ ompi_info | head
Package: Open MPI root@ryan-z820 Distribution
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017

$ ssh mpiuser@client ompi_info | head
mpiuser@client's password:
Package: Open MPI root@brs-dualG Distribution
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017

@shinechou
Copy link
Author

shinechou commented Nov 2, 2017

@rhc54: thx. I've tried ur suggestion to run $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname, but I got the error message like,

$ mpirun -npernode 1 -mca plm_base_verbose 5 master
[ryan-z820:21968] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:21968] plm:base:set_hnp_name: initial bias 21968 nodename hash 974627533
[ryan-z820:21968] plm:base:set_hnp_name: final jobfam 52490
[ryan-z820:21968] [[52490,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:21968] [[52490,0],0] plm:base:receive start comm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_job
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm creating map
[ryan-z820:21968] [[52490,0],0] setup:vm: working unmanaged allocation
[ryan-z820:21968] [[52490,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:21968] [[52490,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:21968] [[52490,0],0] complete_setup on job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:launch_apps for job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master
--------------------------------------------------------------------------
[ryan-z820:21968] [[52490,0],0] plm:base:receive stop comm

or

$ mpirun -npernode 1 -mca plm_base_verbose 1 master python fcn_horovod.py 
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master

@jsquyres
Copy link
Member

jsquyres commented Nov 2, 2017

Github pro tip: use three single-tick-marks to denote verbatim regions. See https://guides.github.com/features/mastering-markdown/.

Ok, good, so you have the same Open MPI v3.0.0 installed on both sides. But something must be different between them, or you wouldn't be getting these errors.

Are both machines the same hardware?
Are they running the same version of your Linux distro configured generally the same way?
(e.g., they're both 64 bit, etc.)

Also, I think @rhc54 meant for you to run the hostname executable -- not master. hostname(1) is a Linux command that tells you what host you are on. He did not mean for you to replace hostname with the actual hostname of the machine (assumedly master).

@shinechou
Copy link
Author

shinechou commented Nov 2, 2017

@jsquyres: thx a lot. Sorry that I'm a noob for linux and openmpi so I don't know that hostname is not the hostname (indeed it is master). I am using same version of ubuntu for both nodes (ubuntu 16.04 64-bit desktop version), but indeed the hardware are different for those two nodes, the "master" node is HP Z820 workstation (XEON E5-2670, 64G ECC-RAM, ASUS GTX1080), the "client" node is a DIY PC(i3-7100, 32G DDR4 RAM, ASUS GTX1080Ti), whether the difference of HW configuration of two nodes will result in this error?

'''
$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname
[ryan-z820:22613] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:22613] plm:base:set_hnp_name: initial bias 22613 nodename hash 974627533
[ryan-z820:22613] plm:base:set_hnp_name: final jobfam 49295
[ryan-z820:22613] [[49295,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:22613] [[49295,0],0] plm:base:receive start comm
[ryan-z820:22613] [[49295,0],0] plm:base:setup_job
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm creating map
[ryan-z820:22613] [[49295,0],0] setup:vm: working unmanaged allocation
[ryan-z820:22613] [[49295,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:22613] [[49295,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:22613] [[49295,0],0] complete_setup on job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch_apps for job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch wiring up iof for job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch job [49295,1] is not a dynamic spawn
ryan-z820
[ryan-z820:22613] [[49295,0],0] plm:base:orted_cmd sending orted_exit commands
[ryan-z820:22613] [[49295,0],0] plm:base:receive stop comm
'''

@rhc54
Copy link
Contributor

rhc54 commented Nov 2, 2017

Sorry for the confusion - I expected you to retain the -hostfile machinefile option

@shinechou
Copy link
Author

shinechou commented Nov 2, 2017

@rhc54: thx. Could you please help me to figure it out? Pls check the output as below,

'''
$ mpirun -npernode 1 -hostfile machinefile -mca plm_base_verbose 5 hostname
[ryan-z820:15616] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:15616] plm:base:set_hnp_name: initial bias 15616 nodename hash 974627533
[ryan-z820:15616] plm:base:set_hnp_name: final jobfam 42458
[ryan-z820:15616] [[42458,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:15616] [[42458,0],0] plm:base:receive start comm
[ryan-z820:15616] [[42458,0],0] plm:base:setup_job
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm creating map
[ryan-z820:15616] [[42458,0],0] setup:vm: working unmanaged allocation
[ryan-z820:15616] [[42458,0],0] using hostfile machinefile
[ryan-z820:15616] [[42458,0],0] checking node ryan-z820
[ryan-z820:15616] [[42458,0],0] ignoring myself
[ryan-z820:15616] [[42458,0],0] checking node client
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm add new daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm assigning new daemon [[42458,0],1] to node client
[ryan-z820:15616] [[42458,0],0] plm:rsh: launching vm
[ryan-z820:15616] [[42458,0],0] plm:rsh: local shell: 0 (bash)
[ryan-z820:15616] [[42458,0],0] plm:rsh: assuming same remote shell as local shell
[ryan-z820:15616] [[42458,0],0] plm:rsh: remote shell: 0 (bash)
[ryan-z820:15616] [[42458,0],0] plm:rsh: final template argv:
/usr/bin/ssh PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid "" -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
[ryan-z820:15616] [[42458,0],0] plm:rsh:launch daemon 0 not a child of mine
[ryan-z820:15616] [[42458,0],0] plm:rsh: adding node client to launch list
[ryan-z820:15616] [[42458,0],0] plm:rsh: activating launch event
[ryan-z820:15616] [[42458,0],0] plm:rsh: recording launch of daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh client PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"]
[brs-dualG:03278] [[42458,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[brs-dualG:03278] [[42458,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[brs-dualG:03278] [[42458,0],1] plm:base:receive start comm
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1] on node brs-dualG
[ryan-z820:15616] [[42458,0],0] RECEIVED TOPOLOGY SIG 0N:1S:1L3:2L2:2L1:2C:4H:x86_64:le FROM NODE brs-dualG
[ryan-z820:15616] [[42458,0],0] NEW TOPOLOGY - ADDING
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch completed for daemon [[42458,0],1] at contact 2782527488.1;tcp://192.168.1.6:33197
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node client by cores
[ryan-z820:15616] [[42458,0],0] complete_setup on job [42458,1]
[ryan-z820:15616] [[42458,0],0] plm:base:launch_apps for job [42458,1]
[brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn called
[brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn - have no children!
[brs-dualG:03278] [[42458,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351

An internal error has occurred in ORTE:

[[42458,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)

This is something that should be reported to the developers.

[ryan-z820:15616] [[42458,0],0] plm:base:receive processing msg
[ryan-z820:15616] [[42458,0],0] plm:base:receive update proc state command from [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for job [42458,0]
[ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for vpid 1 state CALLED ABORT exit_code -26
[ryan-z820:15616] [[42458,0],0] plm:base:receive done processing commands
[ryan-z820:15616] [[42458,0],0] plm:base:orted_cmd sending orted_exit commands
[brs-dualG:03278] [[42458,0],1] plm:base:receive stop comm
[ryan-z820:15616] [[42458,0],0] plm:base:receive stop comm

'''

@shinechou
Copy link
Author

@rhc54: I've provided the log with your debug command, could you please help me to check it? thx a lot in advance.

@rhc54
Copy link
Contributor

rhc54 commented Nov 6, 2017

I honestly am stumped - it looks like you basically received an empty buffer, and I have no idea why. I can't replicate it. Perhaps you might try with the nightly snapshot of the 3.0.x branch to see if something has been fixed that might have caused the problem?

@shinechou
Copy link
Author

@rhc54: thank you. I'll try the nightly version to see what comes out.

@shinechou
Copy link
Author

@rhc54: the problem has been resolved. It seems that the problem is caused by different HW architecture. One is using xeon but another one is using i3, now I change the xeon one to i3 and it works fine. Maybe another possible reason is the xeon workstation has two network adapters, one is used for AMT, not sure whether or not it will affect ompi though.

@zhanglistar
Copy link

@shinechou I have the same problem. And I have check the open-mpi version, they are the same. Could you tell me how to find the hardware problem? I have checked my network and its adapters, they are the same.

@shinechou
Copy link
Author

@zhanglistar: for me, one of my node is an HP workstation with XEON CPU and it has different HW configuration than the master node (which is a regular PC). So I didn't use the HP workstation but just use another regular PC.

@zwets
Copy link

zwets commented Mar 5, 2018

For the benefit of others running into this error or "ORTE_ERROR_LOG: Data unpack had inadequate space": in my case the issue was resolved by switching to the internal hwloc.

I had compiled OpenMPI 3.0.0 on two different Ubuntu releases (16.04 and 17.10), both configured identically, and with --with-hwloc=/usr, thus using the Ubuntu-provided libhwloc-dev package. The version of libhwloc was 1.11.2 on xenial (16.04), and 1.11.5 on artful (17.10).

Running mpirun -H xenial,artful from artful worked fine, but running it from xenial consistently failed at if (OPAL_SUCCESS != (rc = opal_dss.unpack(data, &topo, &idx, OPAL_HWLOC_TOPO))) in ./orte/mca/plm/base/plm_base_launch_support.c.

Removing --with-hwloc=/usr from the configure step, thus switching to OpenMPI's internal hwloc (and uninstalling libhwloc-dev on both machines, though this shouldn't be necessary) resolved the issue.

@sjeaugey
Copy link
Member

sjeaugey commented Jul 8, 2019

I got this same problem with Open MPI 4.0.1, when built locally on each machine (having machines with different generations of Intel CPUs).

A Sandybridge machine would not be able to communicate with Skylake nodes. Copying the Sandybridge binaries over to the Skylake nodes fixed the issue.

So we have a problem where different architectures produce different binaries (structures ?) which are not compatible protocol-wise.

Do you think that should be fixed or just documented (don't build Open MPI locally on each machine) @rhc54 ?

@rhc54
Copy link
Contributor

rhc54 commented Jul 8, 2019

@sjeaugey It sounds like you built with a different hwloc version on the two types of nodes?

@sjeaugey
Copy link
Member

sjeaugey commented Jul 8, 2019

That was not my impression as I could not find any trace of hwloc anywhere on the nodes (so I assume both were compiled with the internal hwloc).

Reading the whole issue, I could not determine whether the fix came from the hwloc change or the fact that the binary was propagated from one machine to the others as suggested by Jeff in #4437 (comment)

Now, I did not compile myself the libraries that weren't working properly, nor did I try to re-compile the working version on each node to confirm it would break, so I'm not 100% sure yet. I'll update the bug if I can reproduce it better.

@RahulKulhari
Copy link

RahulKulhari commented Sep 26, 2019

Open MPI Version: v4.0.0

Output of ompi_info | head on two machine

mpiuser@s2:~$ ssh s1 ompi_info | head
                 Package: Open MPI mpiuser@s1 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018
mpiuser@s2:~$ ompi_info | head
                 Package: Open MPI mpiuser@s2 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018

Both are installed using common shared network.

while running command on s1(master)

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -n 2 ./hello
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)

while running command separately in s2(slave)

mpiuser@s2:~/cloud$ mpirun -n 2 ./hello
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)

Output of hwloc command on s2:

mpiuser@s2:~/cloud/openmpi-4.0.0$ dpkg -l | grep hwloc
mpiuser@s2:~/cloud/openmpi-4.0.0$

Output of hwloc command on s1:

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ dpkg -l | grep hwloc
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$

Both machines are running on Ubuntu 16.04.5 LTS

but while running command on distributed giving following error

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 ./hello
[s2:26283] [[40517,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[40517,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

@jsquyres
Copy link
Member

@RahulKulhari Please do not add new issues to a closed issue; thanks.

@usamazf
Copy link

usamazf commented Jan 7, 2020

@RahulKulhari were you able to resolve the issue? Facing same problem!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants