ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

shinechou · 2017-11-02T13:07:34Z

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Open MPI v3.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Following the installation guidance of FAQ,

shell$ gunzip -c openmpi-3.0.0.tar.gz \| tar xf - 
shell$ cd openmpi-3.0.0 
shell$ ./configure --enable-orterun-prefix-by-default --with-cuda
shell$ make all install

Please describe the system on which you are running

Operating system/version: Ubuntu v16.04
Computer hardware: HP-Z820 (GTX 1080Ti)
Network type: Ethernet

Details of the problem

shell$ mpirun -np 2 -x LD_LIBRARY_PATH -hostfile machinefile python fcn_horovod.py

I got the error as below,

_[brs-dualG:09057] [[48500,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351

An internal error has occurred in ORTE:

[[48500,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)

This is something that should be reported to the developers.__

Thanks a lot in advance.

The text was updated successfully, but these errors were encountered:

rhc54 · 2017-11-02T13:49:42Z

This looks like version confusion between the two nodes. Note that the -x option only forwards that variable to the application processes, not the daemons launched by mpirun. I suspect the path on the remote node is picking up a different OMPI install. Can you check?

shinechou · 2017-11-02T14:01:36Z

@rhc54: thx a lot. I don't think that is the case. because I was compile/install them using the exactly same configuration, you can check the ompi information as below. Thx again.

ompi info of the local node:
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017
MPI API: 3.1.0
Ident string: 3.0.0
Prefix: /usr/local
Configured architecture: x86_64-unknown-linux-gnu
Configure host: ryan-z820
Configured by: root
Configured on: Fri Oct 27 16:11:22 CEST 2017
Configure host: ryan-z820
Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

the ompi_info of the remote node:
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017
MPI API: 3.1.0
Ident string: 3.0.0
Prefix: /usr/local
Configured architecture: x86_64-unknown-linux-gnu
Configure host: brs-dualG
Configured by: brs
Configured on: Thu Oct 26 09:47:52 CEST 2017
Configure host: brs-dualG
Configure command line: '--enable-orterun-prefix-by-default' '--with-cuda'

rhc54 · 2017-11-02T14:07:30Z

Thanks for providing that output. I gather you built/installed them on each node separately, yes? That is a rather unusual way of doing it and generally not recommended - it is much safer to install on a shared file system directory.

Try configuring with --enable-debug, and then run the following:

$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname

jsquyres · 2017-11-02T14:57:33Z

@shinechou You might also want to check that there's not some OS/distro-installed Open MPI on your nodes that is being found and used (e.g., earlier in the PATH than your hand-installed Open MPI installations).

shinechou · 2017-11-02T15:04:19Z

@rhc54: you are right. I install them separately. What is the proper way to do it? Do u have have any guidance for that? Because after I install it I have to install another library on top of ompi library. I'll try to compile it with --enable-debug option and run the command u mentioned.

@jsquyres: thank u for ur comments. But I'm sure there is no other ompi installed on my both nodes.

jsquyres · 2017-11-02T15:21:13Z

You could install Open MPI on one node, and then tar up the installation tree on that node, and then untar it on the other node. Then you'd know for sure that you have exactly the same binary installation on both nodes. Something like this:

$ ./configure --prefix=/opt/openmpi-3.0.0
$ make -j 32 install
...
$ cd /opt
$ tar jcf ~/ompi-install-3.0.0.tar.bz2 openmpi-3.0.0
$ scp ~/ompi-install-3.0.0.tar.bz2 @othernode:

$ ssh othernode
...login to othernode...
$ cd /opt
$ rm -rf openmpi-3.0.0
$ sudo tar xf ~/ompi-install-3.0.0.tar.bz2

Usually, people install Open MPI either via package (e.g., RPM) on each node, or they install Open MPI on a network filesystem (such as NFS) so that the one, single installation is available on all nodes.

Note that I mentioned the multiple Open MPI installation issue because the majority of time people run into this error, it's because users are accidentally / unknowingly using multiple different versions of Open MPI (note that Open MPI currently only supports running exactly the same version of Open MPI on all nodes in a single job). This kind of error almost always indicates that version X of Open MPI is trying to read more data than was sent by Open MPI version Y.

Try this exercise:

$ ompi_info | head
$ ssh othernode ompi_info | head

Doing the 2nd line non-interactively is important (i.e., a single command -- not ssh'ing to the other node and then entering another command to run ompi_info).

Make sure that both ompi_info outputs return the same version.

If they do, then there's something configured differently between the two (but which might still be a bug, because "same version but configured differently" should still usually work).

shinechou · 2017-11-02T16:04:48Z

@jsquyres: thx again for ur guidance. I tried ur excercise. They return the same version as below,

$ ompi_info | head
Package: Open MPI root@ryan-z820 Distribution
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017

$ ssh mpiuser@client ompi_info | head
mpiuser@client's password:
Package: Open MPI root@brs-dualG Distribution
Open MPI: 3.0.0
Open MPI repo revision: v3.0.0
Open MPI release date: Sep 12, 2017
Open RTE: 3.0.0
Open RTE repo revision: v3.0.0
Open RTE release date: Sep 12, 2017
OPAL: 3.0.0
OPAL repo revision: v3.0.0
OPAL release date: Sep 12, 2017

shinechou · 2017-11-02T16:06:14Z

@rhc54: thx. I've tried ur suggestion to run $ mpirun -npernode 1 -mca plm_base_verbose 5 hostname, but I got the error message like,

$ mpirun -npernode 1 -mca plm_base_verbose 5 master
[ryan-z820:21968] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:21968] plm:base:set_hnp_name: initial bias 21968 nodename hash 974627533
[ryan-z820:21968] plm:base:set_hnp_name: final jobfam 52490
[ryan-z820:21968] [[52490,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:21968] [[52490,0],0] plm:base:receive start comm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_job
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm creating map
[ryan-z820:21968] [[52490,0],0] setup:vm: working unmanaged allocation
[ryan-z820:21968] [[52490,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:21968] [[52490,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:21968] [[52490,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:21968] [[52490,0],0] complete_setup on job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:launch_apps for job [52490,1]
[ryan-z820:21968] [[52490,0],0] plm:base:orted_cmd sending orted_exit commands
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master
--------------------------------------------------------------------------
[ryan-z820:21968] [[52490,0],0] plm:base:receive stop comm

or

$ mpirun -npernode 1 -mca plm_base_verbose 1 master python fcn_horovod.py 
--------------------------------------------------------------------------
mpirun was unable to find the specified executable file, and therefore
did not launch the job.  This error was first reported for process
rank 0; it may have occurred for other processes as well.

NOTE: A common cause for this error is misspelling a mpirun command
      line parameter option (remember that mpirun interprets the first
      unrecognized command line token as the executable).

Node:       ryan-z820
Executable: master

jsquyres · 2017-11-02T16:19:41Z

Github pro tip: use three single-tick-marks to denote verbatim regions. See https://guides.github.com/features/mastering-markdown/.

Ok, good, so you have the same Open MPI v3.0.0 installed on both sides. But something must be different between them, or you wouldn't be getting these errors.

Are both machines the same hardware?
Are they running the same version of your Linux distro configured generally the same way?
(e.g., they're both 64 bit, etc.)

Also, I think @rhc54 meant for you to run the hostname executable -- not master. hostname(1) is a Linux command that tells you what host you are on. He did not mean for you to replace hostname with the actual hostname of the machine (assumedly master).

shinechou · 2017-11-02T16:25:56Z

@jsquyres: thx a lot. Sorry that I'm a noob for linux and openmpi so I don't know that hostname is not the hostname (indeed it is master). I am using same version of ubuntu for both nodes (ubuntu 16.04 64-bit desktop version), but indeed the hardware are different for those two nodes, the "master" node is HP Z820 workstation (XEON E5-2670, 64G ECC-RAM, ASUS GTX1080), the "client" node is a DIY PC(i3-7100, 32G DDR4 RAM, ASUS GTX1080Ti), whether the difference of HW configuration of two nodes will result in this error?

'''
$ mpirun -npernode 1 -mca plm_base_verbose 5 hostname
[ryan-z820:22613] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:22613] plm:base:set_hnp_name: initial bias 22613 nodename hash 974627533
[ryan-z820:22613] plm:base:set_hnp_name: final jobfam 49295
[ryan-z820:22613] [[49295,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:22613] [[49295,0],0] plm:base:receive start comm
[ryan-z820:22613] [[49295,0],0] plm:base:setup_job
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm creating map
[ryan-z820:22613] [[49295,0],0] setup:vm: working unmanaged allocation
[ryan-z820:22613] [[49295,0],0] using default hostfile /usr/local/etc/openmpi-default-hostfile
[ryan-z820:22613] [[49295,0],0] plm:base:setup_vm only HNP in allocation
[ryan-z820:22613] [[49295,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:22613] [[49295,0],0] complete_setup on job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch_apps for job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch wiring up iof for job [49295,1]
[ryan-z820:22613] [[49295,0],0] plm:base:launch job [49295,1] is not a dynamic spawn
ryan-z820
[ryan-z820:22613] [[49295,0],0] plm:base:orted_cmd sending orted_exit commands
[ryan-z820:22613] [[49295,0],0] plm:base:receive stop comm
'''

rhc54 · 2017-11-02T17:37:50Z

Sorry for the confusion - I expected you to retain the -hostfile machinefile option

shinechou · 2017-11-02T17:42:03Z

@rhc54: thx. Could you please help me to figure it out? Pls check the output as below,

'''
$ mpirun -npernode 1 -hostfile machinefile -mca plm_base_verbose 5 hostname
[ryan-z820:15616] [[INVALID],INVALID] plm:rsh_lookup on agent ssh : rsh path NULL
[ryan-z820:15616] plm:base:set_hnp_name: initial bias 15616 nodename hash 974627533
[ryan-z820:15616] plm:base:set_hnp_name: final jobfam 42458
[ryan-z820:15616] [[42458,0],0] plm:rsh_setup on agent ssh : rsh path NULL
[ryan-z820:15616] [[42458,0],0] plm:base:receive start comm
[ryan-z820:15616] [[42458,0],0] plm:base:setup_job
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm creating map
[ryan-z820:15616] [[42458,0],0] setup:vm: working unmanaged allocation
[ryan-z820:15616] [[42458,0],0] using hostfile machinefile
[ryan-z820:15616] [[42458,0],0] checking node ryan-z820
[ryan-z820:15616] [[42458,0],0] ignoring myself
[ryan-z820:15616] [[42458,0],0] checking node client
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm add new daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:setup_vm assigning new daemon [[42458,0],1] to node client
[ryan-z820:15616] [[42458,0],0] plm:rsh: launching vm
[ryan-z820:15616] [[42458,0],0] plm:rsh: local shell: 0 (bash)
[ryan-z820:15616] [[42458,0],0] plm:rsh: assuming same remote shell as local shell
[ryan-z820:15616] [[42458,0],0] plm:rsh: remote shell: 0 (bash)
[ryan-z820:15616] [[42458,0],0] plm:rsh: final template argv:
/usr/bin/ssh PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid "" -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"
[ryan-z820:15616] [[42458,0],0] plm:rsh:launch daemon 0 not a child of mine
[ryan-z820:15616] [[42458,0],0] plm:rsh: adding node client to launch list
[ryan-z820:15616] [[42458,0],0] plm:rsh: activating launch event
[ryan-z820:15616] [[42458,0],0] plm:rsh: recording launch of daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:rsh: executing: (/usr/bin/ssh) [/usr/bin/ssh client PATH=/usr/local/bin:$PATH ; export PATH ; LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH ; export LD_LIBRARY_PATH ; DYLD_LIBRARY_PATH=/usr/local/lib:$DYLD_LIBRARY_PATH ; export DYLD_LIBRARY_PATH ; /usr/local/bin/orted -mca ess "env" -mca ess_base_jobid "2782527488" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "ryan-z820,client@0(2)" -mca orte_hnp_uri "2782527488.0;tcp://192.168.1.1:50882" -mca plm_base_verbose "5" -mca plm "rsh" -mca rmaps_ppr_n_pernode "1" -mca pmix "^s1,s2,cray,isolated"]
[brs-dualG:03278] [[42458,0],1] plm:rsh_lookup on agent ssh : rsh path NULL
[brs-dualG:03278] [[42458,0],1] plm:rsh_setup on agent ssh : rsh path NULL
[brs-dualG:03278] [[42458,0],1] plm:base:receive start comm
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch from daemon [[42458,0],1] on node brs-dualG
[ryan-z820:15616] [[42458,0],0] RECEIVED TOPOLOGY SIG 0N:1S:1L3:2L2:2L1:2C:4H:x86_64:le FROM NODE brs-dualG
[ryan-z820:15616] [[42458,0],0] NEW TOPOLOGY - ADDING
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch completed for daemon [[42458,0],1] at contact 2782527488.1;tcp://192.168.1.6:33197
[ryan-z820:15616] [[42458,0],0] plm:base:orted_report_launch recvd 2 of 2 reported daemons
[ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node ryan-z820 by cores
[ryan-z820:15616] [[42458,0],0] plm:base:setting slots for node client by cores
[ryan-z820:15616] [[42458,0],0] complete_setup on job [42458,1]
[ryan-z820:15616] [[42458,0],0] plm:base:launch_apps for job [42458,1]
[brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn called
[brs-dualG:03278] [[42458,0],1] plm:rsh: remote spawn - have no children!
[brs-dualG:03278] [[42458,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 351

An internal error has occurred in ORTE:

[[42458,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(355)

This is something that should be reported to the developers.

[ryan-z820:15616] [[42458,0],0] plm:base:receive processing msg
[ryan-z820:15616] [[42458,0],0] plm:base:receive update proc state command from [[42458,0],1]
[ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for job [42458,0]
[ryan-z820:15616] [[42458,0],0] plm:base:receive got update_proc_state for vpid 1 state CALLED ABORT exit_code -26
[ryan-z820:15616] [[42458,0],0] plm:base:receive done processing commands
[ryan-z820:15616] [[42458,0],0] plm:base:orted_cmd sending orted_exit commands
[brs-dualG:03278] [[42458,0],1] plm:base:receive stop comm
[ryan-z820:15616] [[42458,0],0] plm:base:receive stop comm

'''

shinechou · 2017-11-06T15:25:27Z

@rhc54: I've provided the log with your debug command, could you please help me to check it? thx a lot in advance.

rhc54 · 2017-11-06T15:48:56Z

I honestly am stumped - it looks like you basically received an empty buffer, and I have no idea why. I can't replicate it. Perhaps you might try with the nightly snapshot of the 3.0.x branch to see if something has been fixed that might have caused the problem?

shinechou · 2017-11-07T13:25:59Z

@rhc54: thank you. I'll try the nightly version to see what comes out.

shinechou · 2017-11-10T21:05:57Z

@rhc54: the problem has been resolved. It seems that the problem is caused by different HW architecture. One is using xeon but another one is using i3, now I change the xeon one to i3 and it works fine. Maybe another possible reason is the xeon workstation has two network adapters, one is used for AMT, not sure whether or not it will affect ompi though.

zhanglistar · 2017-12-28T09:00:38Z

@shinechou I have the same problem. And I have check the open-mpi version， they are the same. Could you tell me how to find the hardware problem? I have checked my network and its adapters, they are the same.

shinechou · 2017-12-28T16:51:19Z

@zhanglistar: for me, one of my node is an HP workstation with XEON CPU and it has different HW configuration than the master node (which is a regular PC). So I didn't use the HP workstation but just use another regular PC.

zwets · 2018-03-05T20:08:54Z

For the benefit of others running into this error or "ORTE_ERROR_LOG: Data unpack had inadequate space": in my case the issue was resolved by switching to the internal hwloc.

I had compiled OpenMPI 3.0.0 on two different Ubuntu releases (16.04 and 17.10), both configured identically, and with --with-hwloc=/usr, thus using the Ubuntu-provided libhwloc-dev package. The version of libhwloc was 1.11.2 on xenial (16.04), and 1.11.5 on artful (17.10).

Running mpirun -H xenial,artful from artful worked fine, but running it from xenial consistently failed at if (OPAL_SUCCESS != (rc = opal_dss.unpack(data, &topo, &idx, OPAL_HWLOC_TOPO))) in ./orte/mca/plm/base/plm_base_launch_support.c.

Removing --with-hwloc=/usr from the configure step, thus switching to OpenMPI's internal hwloc (and uninstalling libhwloc-dev on both machines, though this shouldn't be necessary) resolved the issue.

sjeaugey · 2019-07-08T18:24:46Z

I got this same problem with Open MPI 4.0.1, when built locally on each machine (having machines with different generations of Intel CPUs).

A Sandybridge machine would not be able to communicate with Skylake nodes. Copying the Sandybridge binaries over to the Skylake nodes fixed the issue.

So we have a problem where different architectures produce different binaries (structures ?) which are not compatible protocol-wise.

Do you think that should be fixed or just documented (don't build Open MPI locally on each machine) @rhc54 ?

rhc54 · 2019-07-08T18:33:45Z

@sjeaugey It sounds like you built with a different hwloc version on the two types of nodes?

sjeaugey · 2019-07-08T18:45:41Z

That was not my impression as I could not find any trace of hwloc anywhere on the nodes (so I assume both were compiled with the internal hwloc).

Reading the whole issue, I could not determine whether the fix came from the hwloc change or the fact that the binary was propagated from one machine to the others as suggested by Jeff in #4437 (comment)

Now, I did not compile myself the libraries that weren't working properly, nor did I try to re-compile the working version on each node to confirm it would break, so I'm not 100% sure yet. I'll update the bug if I can reproduce it better.

RahulKulhari · 2019-09-26T11:42:15Z

Open MPI Version: v4.0.0

Output of ompi_info | head on two machine

mpiuser@s2:~$ ssh s1 ompi_info | head
                 Package: Open MPI mpiuser@s1 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018
mpiuser@s2:~$ ompi_info | head
                 Package: Open MPI mpiuser@s2 Distribution
                Open MPI: 4.0.0
  Open MPI repo revision: v4.0.0
   Open MPI release date: Nov 12, 2018
                Open RTE: 4.0.0
  Open RTE repo revision: v4.0.0
   Open RTE release date: Nov 12, 2018
                    OPAL: 4.0.0
      OPAL repo revision: v4.0.0
       OPAL release date: Nov 12, 2018

Both are installed using common shared network.

while running command on s1(master)

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -n 2 ./hello
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s1 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 112)

while running command separately in s2(slave)

mpiuser@s2:~/cloud$ mpirun -n 2 ./hello
Hello, world, I am 0 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)
Hello, world, I am 1 of 2, (Open MPI v4.0.0, package: Open MPI mpiuser@s2 Distribution, ident: 4.0.0, repo rev: v4.0.0, Nov 12, 2018, 113)

Output of hwloc command on s2:

mpiuser@s2:~/cloud/openmpi-4.0.0$ dpkg -l | grep hwloc
mpiuser@s2:~/cloud/openmpi-4.0.0$

Output of hwloc command on s1:

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ dpkg -l | grep hwloc
mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$

Both machines are running on Ubuntu 16.04.5 LTS

but while running command on distributed giving following error

mpiuser@s1:/disk3/cloud/openmpi-4.0.0/examples$ mpirun -host s1,s2 ./hello
[s2:26283] [[40517,0],1] ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c at line 355
--------------------------------------------------------------------------
An internal error has occurred in ORTE:

[[40517,0],1] FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c(359)

This is something that should be reported to the developers.
--------------------------------------------------------------------------

jsquyres · 2019-09-26T16:29:35Z

@RahulKulhari Please do not add new issues to a closed issue; thanks.

usamazf · 2020-01-07T09:58:23Z

@RahulKulhari were you able to resolve the issue? Facing same problem!

shinechou closed this as completed Nov 10, 2017

alsrgv mentioned this issue Dec 26, 2017

Run distributed : ERROR: ORTE_ERROR_LOG: Data unpack would read past end of buffer horovod/horovod#133

Closed

tremblerz mentioned this issue Jul 11, 2018

Getting error with mpirun horovod/horovod#368

Closed

MakisH mentioned this issue Dec 12, 2018

Spurious errors when running tests precice/precice#103

Closed

abditag2 mentioned this issue May 2, 2019

FORCE-TERMINATE AT Data unpack would read past end of buffer:-26 - error grpcomm_direct.c horovod/horovod#1040

Closed

snyjm-18 mentioned this issue Jun 18, 2019

mpirun launch failure on heterogeneous system #6762

Closed

siyuan0322 mentioned this issue Mar 7, 2023

[BUG] Use Graphscope session to deploy on k8s faild when config num_workers=2 alibaba/GraphScope#2479

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

shinechou commented Nov 2, 2017 •

edited by jsquyres

Loading

rhc54 commented Nov 2, 2017

shinechou commented Nov 2, 2017

rhc54 commented Nov 2, 2017

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017

shinechou commented Nov 2, 2017 •

edited by jsquyres

Loading

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017 •

edited

Loading

rhc54 commented Nov 2, 2017

shinechou commented Nov 2, 2017 •

edited

Loading

shinechou commented Nov 6, 2017

rhc54 commented Nov 6, 2017

shinechou commented Nov 7, 2017

shinechou commented Nov 10, 2017

zhanglistar commented Dec 28, 2017

shinechou commented Dec 28, 2017

zwets commented Mar 5, 2018

sjeaugey commented Jul 8, 2019

rhc54 commented Jul 8, 2019

sjeaugey commented Jul 8, 2019

RahulKulhari commented Sep 26, 2019 •

edited

Loading

jsquyres commented Sep 26, 2019

usamazf commented Jan 7, 2020

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

ORTE_ERROR_LOG: Data unpack would read past end of buffer in file grpcomm_direct.c #4437

Comments

shinechou commented Nov 2, 2017 • edited by jsquyres Loading

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

Please describe the system on which you are running

Details of the problem

rhc54 commented Nov 2, 2017

shinechou commented Nov 2, 2017

rhc54 commented Nov 2, 2017

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017

shinechou commented Nov 2, 2017 • edited by jsquyres Loading

jsquyres commented Nov 2, 2017

shinechou commented Nov 2, 2017 • edited Loading

rhc54 commented Nov 2, 2017

shinechou commented Nov 2, 2017 • edited Loading

shinechou commented Nov 6, 2017

rhc54 commented Nov 6, 2017

shinechou commented Nov 7, 2017

shinechou commented Nov 10, 2017

zhanglistar commented Dec 28, 2017

shinechou commented Dec 28, 2017

zwets commented Mar 5, 2018

sjeaugey commented Jul 8, 2019

rhc54 commented Jul 8, 2019

sjeaugey commented Jul 8, 2019

RahulKulhari commented Sep 26, 2019 • edited Loading

jsquyres commented Sep 26, 2019

usamazf commented Jan 7, 2020

shinechou commented Nov 2, 2017 •

edited by jsquyres

Loading

shinechou commented Nov 2, 2017 •

edited by jsquyres

Loading

shinechou commented Nov 2, 2017 •

edited

Loading

shinechou commented Nov 2, 2017 •

edited

Loading

RahulKulhari commented Sep 26, 2019 •

edited

Loading