Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add-hostfile not working on parallel prun commands #1773

Open
BhattaraiRajat opened this issue Jul 25, 2023 · 20 comments
Open

add-hostfile not working on parallel prun commands #1773

BhattaraiRajat opened this issue Jul 25, 2023 · 20 comments

Comments

@BhattaraiRajat
Copy link

BhattaraiRajat commented Jul 25, 2023

Background information

Working on a project, one part of which runs multiple prun commands in parallel from multiple processes to launch multiple tasks, some of these commands with --add-hostfile option to extend existing DVM.

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

55536ef

What version of PMIx are you using? (e.g., v4.2.0, git branch name and hash, etc.)

openpmix/openpmix@bde8038

Please describe the system on which you are running

  • Operating system/version: AlmaLinux 8.7 (Stone Smilodon)
  • Computer hardware: aarch64
  • Network type:

Details of the problem

Steps to reproduce

  • Get two nodes
  • Create a hostfile with one node and add_hostfile with another node
  • Use hostfile to start DVM prte --report-uri dvm.uri --hostfile hostfile
  • Spawn two processes using python multiprocessing and run two prun commands in parallel, one with add-hostfile option and another without as the following.
from multiprocessing import Pool
import subprocess

def run(x):
    print(x)
    process = subprocess.Popen(x, stdout=subprocess.PIPE, shell=True)
    output,error = process.communicate()

prun_commands = ["prun --display allocation --dvm-uri file:dvm.uri --map-by ppr:2:node -n 2 hostname > out0", "prun --display allocation --dvm-uri file:dvm.uri --add-hostfile add_hostfile --map-by ppr:2:node -n 2 hostname > out1"]

with Pool(2) as p:
    p.map(run, prun_commands)

It outputs the following error.

[st-master:2320852] PMIx_Spawn failed (-25): UNREACHABLE
[st-master:2320851] PMIx_Spawn failed (-25): UNREACHABLE

If no add-hostfile option is given, both processes run without error.

While debugging, it can be seen that the daemon was launched in the added node but throws following error during initialization and terminates the PMIX Server.

A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    st27
  Remote host:   192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.

@hppritcha @rhc54 Please advise.

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

Looks like the remote host cannot open a TCP socket back to st27. When you say:

If no add-hostfile option is given, both processes run without error.

are you running with both hosts in your hostfile? Or just the one?

@BhattaraiRajat
Copy link
Author

I am running with just one host on the hostfile. I meant to say that there have been no problems running multiple prun commands without add-hostfile option.

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

I understood that last part. My point was just that if there is only one host in the hostfile, then you won't detect that your remote host cannot communicate to you.

Try putting both hosts in that hostfile and see if you can start the DVM.

@BhattaraiRajat
Copy link
Author

I wanted to use the DVM expansion feature, starting DVM with one node and then add another node with add-hostfile option on prun.
st27 is the added node with add-hostfile option on one of the prun commands.
If I put both hosts in the hostfile, I can start the DVM fine.

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

I understand what you want to do - I'm just trying to check that PRRTE itself is behaving correctly. With both hosts in the hostfile, it starts - which means that the daemon can indeed communicate back.

So the question becomes: why can't it do so when started by add-hostfile? You were able to do it before, so what has changed?

If you run the two prun commands sequentially on the cmd line, does that work?

@BhattaraiRajat
Copy link
Author

If you run the two prun commands sequentially on the cmd line, does that work?

Yes. Running sequentially works.

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

I'm not familiar with Python's pool module. I gather that it basically fork/exec's both prun processes? Isn't there an inherent race condition here? Whichever prun connects first to the DVM is going to execute first, so you may or may not get the other node involved in both jobs.

Let's see if the problem really is in PRRTE and not in how you are trying to do this. Add --prtemca pmix_server_verbose 5 --prtemca state_base_verbose 5 --leave-session-attached to the prte cmd line and see what it says.

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

Oh yeah - also add --prtemca plm_base_verbose 5 to the prte cmd line

@BhattaraiRajat
Copy link
Author

Yes. The project I am working on uses fork method for starting the processes using python multiprocessing module.

The following is the log after it receives prun commands.

[st-master:2558116] [prte-st-master-2558116@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2558116] [prte-st-master-2558116@0,0] PROCESSING TOOL CONNECTION
[st-master:2558116] [prte-st-master-2558116@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2558116] [prte-st-master-2558116@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2558144
[st-master:2558116] [prte-st-master-2558116@0,0] PROCESSING TOOL CONNECTION
[st-master:2558116] [prte-st-master-2558116@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2558143
[st-master:2558116] [prte-st-master-2558116@0,0] spawn upcalled on behalf of proc prun.st-master.2558144:0 with 5 job infos
[st-master:2558116] [prte-st-master-2558116@0,0] spawn called from proc [prun.st-master.2558144,0] with 1 apps
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive processing msg
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive job launch command from [prte-st-master-2558116@0,0]
[st-master:2558116] [prte-st-master-2558116@0,0] spawn upcalled on behalf of proc prun.st-master.2558143:0 with 5 job infos
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive adding hosts
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive calling spawn
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489563] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_ssh_module.c:910
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489586] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive done processing commands
[st-master:2558116] [prte-st-master-2558116@0,0] spawn called from proc [prun.st-master.2558143,0] with 1 apps
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive processing msg
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive job launch command from [prte-st-master-2558116@0,0]
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive adding hosts
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive calling spawn
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489835] ACTIVATE JOB [INVALID] STATE PENDING INIT AT plm_ssh_module.c:910
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489847] ACTIVATING JOB [INVALID] STATE PENDING INIT PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive done processing commands
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_job
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489889] ACTIVATE JOB prte-st-master-2558116@1 STATE INIT_COMPLETE AT base/plm_base_launch_support.c:695
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489902] ACTIVATING JOB prte-st-master-2558116@1 STATE INIT_COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_job
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489925] ACTIVATE JOB prte-st-master-2558116@2 STATE INIT_COMPLETE AT base/plm_base_launch_support.c:695
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489936] ACTIVATING JOB prte-st-master-2558116@2 STATE INIT_COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489947] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING ALLOCATION AT state_dvm.c:257
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489959] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING ALLOCATION PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489971] ACTIVATE JOB prte-st-master-2558116@2 STATE PENDING ALLOCATION AT state_dvm.c:257
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489981] ACTIVATING JOB prte-st-master-2558116@2 STATE PENDING ALLOCATION PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.489993] ACTIVATE JOB prte-st-master-2558116@1 STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:745
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490003] ACTIVATING JOB prte-st-master-2558116@1 STATE ALLOCATION COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490014] ACTIVATE JOB prte-st-master-2558116@2 STATE ALLOCATION COMPLETE AT base/ras_base_allocate.c:745
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490024] ACTIVATING JOB prte-st-master-2558116@2 STATE ALLOCATION COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490035] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:201
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490048] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING DAEMON LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490060] ACTIVATE JOB prte-st-master-2558116@2 STATE PENDING DAEMON LAUNCH AT base/plm_base_launch_support.c:201
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490071] ACTIVATING JOB prte-st-master-2558116@2 STATE PENDING DAEMON LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm no new daemons required
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490103] ACTIVATE JOB prte-st-master-2558116@1 STATE ALL DAEMONS REPORTED AT plm_ssh_module.c:1056
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490113] ACTIVATING JOB prte-st-master-2558116@1 STATE ALL DAEMONS REPORTED PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm add new daemon [prte-st-master-2558116@0,2]
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setup_vm assigning new daemon [prte-st-master-2558116@0,2] to node st27
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: launching vm
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: local shell: 0 (bash)
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: assuming same remote shell as local shell
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: remote shell: 0 (bash)
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: final template argv:
	/usr/bin/ssh <template> PRTE_PREFIX=/home/rbhattara/dyn-wf/install/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/rbhattara/dyn-wf/install/prrte/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prte-st-master-2558116@0" --prtemca ess_base_vpid "<template>" --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prte-st-master-2558116@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:51773:24,23,23" --prtemca pmix_server_verbose "5" --prtemca state_base_verbose "5" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh"
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh:launch daemon already exists on node st-master
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh:launch daemon already exists on node st26
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: adding node st27 to launch list
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: activating launch event
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:setting slots for node st27 by core
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490490] ACTIVATE JOB prte-st-master-2558116@1 STATE VM READY AT base/plm_base_launch_support.c:177
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.490504] ACTIVATING JOB prte-st-master-2558116@1 STATE VM READY PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: recording launch of daemon [prte-st-master-2558116@0,2]
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491026] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING MAPPING AT state_dvm.c:244
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491045] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING MAPPING PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:ssh: executing: (/usr/bin/ssh) [/usr/bin/ssh st27 PRTE_PREFIX=/home/rbhattara/dyn-wf/install/prrte;export PRTE_PREFIX;LD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$LD_LIBRARY_PATH;export LD_LIBRARY_PATH;DYLD_LIBRARY_PATH=/home/rbhattara/dyn-wf/install/prrte/lib:/home/rbhattara/dyn-wf/install/pmix/lib:$DYLD_LIBRARY_PATH;export DYLD_LIBRARY_PATH;/home/rbhattara/dyn-wf/install/prrte/bin/prted --leave-session-attached --prtemca ess "env" --prtemca ess_base_nspace "prte-st-master-2558116@0" --prtemca ess_base_vpid 2 --prtemca ess_base_num_procs "3" --prtemca prte_hnp_uri "prte-st-master-2558116@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:51773:24,23,23" --prtemca pmix_server_verbose "5" --prtemca state_base_verbose "5" --prtemca plm_base_verbose "5" --prtemca pmix_session_server "1" --prtemca plm "ssh"]
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491335] ACTIVATE JOB prte-st-master-2558116@1 STATE MAP COMPLETE AT base/rmaps_base_map_job.c:904
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491354] ACTIVATING JOB prte-st-master-2558116@1 STATE MAP COMPLETE PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491370] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING FINAL SYSTEM PREP AT base/plm_base_launch_support.c:275
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491383] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING FINAL SYSTEM PREP PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] complete_setup on job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491402] ACTIVATE JOB prte-st-master-2558116@1 STATE PENDING APP LAUNCH AT base/plm_base_launch_support.c:736
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491414] ACTIVATING JOB prte-st-master-2558116@1 STATE PENDING APP LAUNCH PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:launch_apps for job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491643] ACTIVATE JOB prte-st-master-2558116@1 STATE SENDING LAUNCH MSG AT base/odls_base_default_fns.c:146
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.491667] ACTIVATING JOB prte-st-master-2558116@1 STATE SENDING LAUNCH MSG PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:send launch msg for job prte-st-master-2558116@1
[st-master:2558116] [prte-st-master-2558116@0,0] register nspace for prte-st-master-2558116@1
[st-master:2558116] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2558116] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2558116] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2558116] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2558116] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494709] ACTIVATE PROC [prte-st-master-2558116@0,2] STATE NO PATH TO TARGET AT rml/rml.c:123
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494736] ACTIVATING PROC [prte-st-master-2558116@0,2] STATE NO PATH TO TARGET PRI 0
[st-master:2558116] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494765] ACTIVATE JOB NULL STATE DAEMONS TERMINATED AT errmgr_dvm.c:342
[st-master:2558116] [prte-st-master-2558116@0,0] [1690324866.494777] ACTIVATING JOB NULL STATE DAEMONS TERMINATED PRI 4
[st-master:2558116] [prte-st-master-2558116@0,0] plm:base:receive stop comm
[st26:869495] [prte-st-master-2558116@0,1] register nspace for prte-st-master-2558116@1
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2558116] [prte-st-master-2558116@0,0] Finalizing PMIX server
[rbhattara@st-master add-host-debug]$ [st26:869495] [prte-st-master-2558116@0,1] register nspace for prte-st-master-2558116@1
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:869495] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:869495] [prte-st-master-2558116@0,1] [1690324866.576490] ACTIVATE PROC [prte-st-master-2558116@0,0] STATE LIFELINE LOST AT oob_tcp_component.c:881
[st26:869495] [prte-st-master-2558116@0,1] [1690324866.576513] ACTIVATING PROC [prte-st-master-2558116@0,0] STATE LIFELINE LOST PRI 0
[st26:869495] [prte-st-master-2558116@0,1] plm:base:receive stop comm
[st26:869495] [prte-st-master-2558116@0,1] Finalizing PMIX server
[st27:800778] [prte-st-master-2558116@0,2] plm:ssh_lookup on agent ssh : rsh path NULL
[st27:800778] [prte-st-master-2558116@0,2] plm:ssh_setup on agent ssh : rsh path NULL
[st27:800778] [prte-st-master-2558116@0,2] plm:base:receive start comm
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    st27
  Remote host:   192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[st27:800778] [prte-st-master-2558116@0,2] [1690324866.972653] ACTIVATE PROC [prte-st-master-2558116@0,0] STATE FAILED TO CONNECT AT oob_tcp_component.c:1022
[st27:800778] [prte-st-master-2558116@0,2] [1690324866.972669] ACTIVATING PROC [prte-st-master-2558116@0,0] STATE FAILED TO CONNECT PRI 0
[st27:800778] [prte-st-master-2558116@0,2] plm:base:receive stop comm
[st27:800778] [prte-st-master-2558116@0,2] Finalizing PMIX server

@rhc54
Copy link
Contributor

rhc54 commented Jul 25, 2023

Yes. The project I am working on uses fork method for starting the processes using python multiprocessing module.

Understood - but as implemented, your test will yield non-deterministic results. Is that what you want?

Try adding --prtemca oob_base_verbose 5 --prtemca rml_base_verbose 5 to the prte cmd line. It looks like we simply cannot create a socket connection back to prte for some reason.

@BhattaraiRajat
Copy link
Author

This is the log with --prtemca oob_base_verbose 5 --prtemca rml_base_verbose 5

[st-master:2560223] [prte-st-master-2560223@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2560223] [prte-st-master-2560223@0,0] PROCESSING TOOL CONNECTION
[st-master:2560223] [prte-st-master-2560223@0,0] TOOL CONNECTION REQUEST RECVD
[st-master:2560223] [prte-st-master-2560223@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2560245
[st-master:2560223] [prte-st-master-2560223@0,0] PROCESSING TOOL CONNECTION
[st-master:2560223] [prte-st-master-2560223@0,0] LAUNCHER CONNECTION FROM UID 6127 GID 6127 NSPACE prun.st-master.2560246
[st-master:2560223] [prte-st-master-2560223@0,0] spawn upcalled on behalf of proc prun.st-master.2560245:0 with 5 job infos
[st-master:2560223] [prte-st-master-2560223@0,0] spawn called from proc [prun.st-master.2560245,0] with 1 apps
[st-master:2560223] RML-SEND(0:5): prted/pmix/pmix_server_dyn.c:spawn:178
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] spawn upcalled on behalf of proc prun.st-master.2560246:0 with 5 job infos
[st-master:2560223] [prte-st-master-2560223@0,0] message received 465 bytes from [prte-st-master-2560223@0,0] for tag 5 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 5 on released
[st-master:2560223] [prte-st-master-2560223@0,0] spawn called from proc [prun.st-master.2560246,0] with 1 apps
[st-master:2560223] RML-SEND(0:5): prted/pmix/pmix_server_dyn.c:spawn:178
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 5
[st-master:2560223] [prte-st-master-2560223@0,0] message received 408 bytes from [prte-st-master-2560223@0,0] for tag 5 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 5 on released
[st-master:2560223] RML-SEND(0:15): grpcomm_direct.c:xcast:99
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 0 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer_to_self at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 15
[st-master:2560223] RML-SEND(1:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 1 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] RML-SEND(1:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 1 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] RML-SEND(2:15): grpcomm_direct.c:xcast_recv:681
[st-master:2560223] [prte-st-master-2560223@0,0] rml_send_buffer to peer 2 at tag 15
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: rml/rml_send.c:89
[st-master:2560223] [prte-st-master-2560223@0,0] Message posted at grpcomm_direct.c:702 for tag 1
[st-master:2560223] [prte-st-master-2560223@0,0] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 15 on released
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,1] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,1]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,1]:15 seq_num = -1 via [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] tcp:send_nb: already connected to [prte-st-master-2560223@0,1] - queueing for send
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:197] queue send to [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,1] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,1]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,1]:15 seq_num = -1 via [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] tcp:send_nb: already connected to [prte-st-master-2560223@0,1] - queueing for send
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:197] queue send to [prte-st-master-2560223@0,1]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 0
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send unknown peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] message received from [prte-st-master-2560223@0,0] for tag 1
[st-master:2560223] [prte-st-master-2560223@0,0] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st-master:2560223] [prte-st-master-2560223@0,0] prted_cmd: received add_local_procs
[st-master:2560223] [prte-st-master-2560223@0,0] register nspace for prte-st-master-2560223@2
[st-master:2560223] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2560223] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] UUID: ipv6://00:00:10:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:be OSNAME: ib0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f0 OSNAME: enp22s0f0np0 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv4://f4:03:43:fe:80:f1 OSNAME: enp22s0f1np1 TYPE: NETWORK MIND: 6 MAXD: 6
[st-master:2560223] UUID: ipv6://00:00:18:87:fe:80:00:00:00:00:00:00:24:8a:07:03:00:aa:82:c0 OSNAME: ib1 TYPE: NETWORK MIND: 12 MAXD: 12
[st-master:2560223] UUID: fab://248a:0703:00aa:82be::248a:0703:00aa:82be OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st-master:2560223] UUID: fab://248a:0703:00aa:82c0::248a:0703:00aa:82be OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] [prte-st-master-2560223@0,0] message received 724 bytes from [prte-st-master-2560223@0,0] for tag 1 called callback
[st-master:2560223] [prte-st-master-2560223@0,0] message tag 1 on released
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 1
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 2
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send known transport for peer [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] oob:tcp:send_nb to peer [prte-st-master-2560223@0,2]:15 seq = -1
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:180] processing send to peer [prte-st-master-2560223@0,2]:15 seq_num = -1 hop [prte-st-master-2560223@0,2] unknown
[st-master:2560223] [prte-st-master-2560223@0,0]:[oob_tcp.c:181] post no route to [prte-st-master-2560223@0,2]
[st-master:2560223] [prte-st-master-2560223@0,0] OOB_SEND: oob_tcp_component.c:914
[st-master:2560223] [prte-st-master-2560223@0,0] oob:base:send to target [prte-st-master-2560223@0,2] - attempt 3
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,2] Send message complete at base/oob_base_stubs.c:61
[st-master:2560223] [prte-st-master-2560223@0,0] UNABLE TO SEND MESSAGE TO [prte-st-master-2560223@0,2] TAG 15: No OOB path to target
[st-master:2560223] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2560223] oob:tcp:send_handler SENDING MSG
[st-master:2560223] [prte-st-master-2560223@0,0] MESSAGE SEND COMPLETE TO [prte-st-master-2560223@0,1] OF 775 BYTES ON SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,1] Send message complete at oob_tcp_sendrecv.c:253
[st-master:2560223] oob:tcp:send_handler SENDING MSG
[st-master:2560223] [prte-st-master-2560223@0,0] MESSAGE SEND COMPLETE TO [prte-st-master-2560223@0,1] OF 775 BYTES ON SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0]-[prte-st-master-2560223@0,1] Send message complete at oob_tcp_sendrecv.c:253
[st-master:2560223] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82
[st26:870061] [prte-st-master-2560223@0,1] Message posted at oob_tcp_sendrecv.c:522 for tag 15
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,0] for tag 15
[st26:870061] [prte-st-master-2560223@0,1] Message posted at grpcomm_direct.c:702 for tag 1
[st26:870061] [prte-st-master-2560223@0,1] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 15 on released
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,1] for tag 1
[st26:870061] [prte-st-master-2560223@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st26:870061] [prte-st-master-2560223@0,1] prted_cmd: received add_local_procs
[st-master:2560223] RML-CANCEL(5): base/plm_base_receive.c:prte_plm_base_comm_stop:102
[st-master:2560223] RML-CANCEL(10): base/plm_base_receive.c:prte_plm_base_comm_stop:104
[st-master:2560223] RML-CANCEL(12): base/plm_base_receive.c:prte_plm_base_comm_stop:105
[st-master:2560223] RML-CANCEL(62): base/plm_base_receive.c:prte_plm_base_comm_stop:106
[st26:870061] [prte-st-master-2560223@0,1] register nspace for prte-st-master-2560223@2
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st-master:2560223] [prte-st-master-2560223@0,0] TCP SHUTDOWN
[st-master:2560223] [prte-st-master-2560223@0,0] TCP SHUTDOWN done
[st-master:2560223] [prte-st-master-2560223@0,0] CLOSING SOCKET 21
[st-master:2560223] [prte-st-master-2560223@0,0] Finalizing PMIX server
[rbhattara@st-master add-host-debug]$ [st26:870061] [prte-st-master-2560223@0,1] message received 724 bytes from [prte-st-master-2560223@0,1] for tag 1 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 1 on released
[st26:870061] [prte-st-master-2560223@0,1] Message posted at oob_tcp_sendrecv.c:522 for tag 15
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,0] for tag 15
[st26:870061] [prte-st-master-2560223@0,1] Message posted at grpcomm_direct.c:702 for tag 1
[st26:870061] [prte-st-master-2560223@0,1] message received 775 bytes from [prte-st-master-2560223@0,0] for tag 15 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 15 on released
[st26:870061] [prte-st-master-2560223@0,1] message received from [prte-st-master-2560223@0,1] for tag 1
[st26:870061] [prte-st-master-2560223@0,1] prted:comm:process_commands() Processing Command: PRTE_DAEMON_ADD_LOCAL_PROCS
[st26:870061] [prte-st-master-2560223@0,1] prted_cmd: received add_local_procs
[st26:870061] [prte-st-master-2560223@0,1] register nspace for prte-st-master-2560223@2
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6a OSNAME: eth0 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: ipv4://f4:03:43:fe:70:6b OSNAME: eth1 TYPE: NETWORK MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3a::ec0d:9a03:0098:aa3a OSNAME: mlx5_0 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0000::0000:0000:0000:0000 OSNAME: mlx5_1 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://0000:0000:0000:0001::0000:0000:0000:0000 OSNAME: mlx5_2 TYPE: OPENFABRICS MIND: 6 MAXD: 6
[st26:870061] UUID: fab://ec0d:9a03:0098:aa3c::ec0d:9a03:0098:aa3a OSNAME: mlx5_3 TYPE: OPENFABRICS MIND: 12 MAXD: 12
[st26:870061] [prte-st-master-2560223@0,1] message received 724 bytes from [prte-st-master-2560223@0,1] for tag 1 called callback
[st26:870061] [prte-st-master-2560223@0,1] message tag 1 on released
[st26:870061] [prte-st-master-2560223@0,1]-[prte-st-master-2560223@0,0] prte_oob_tcp_msg_recv: peer closed connection
[st26:870061] [prte-st-master-2560223@0,1]:errmgr_prted.c(365) updating exit status to 1
[st26:870061] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82
[st26:870061] RML-CANCEL(3): iof_prted.c:finalize:295
[st26:870061] RML-CANCEL(5): base/plm_base_receive.c:prte_plm_base_comm_stop:102
[st26:870061] [prte-st-master-2560223@0,1] TCP SHUTDOWN
[st26:870061] no hnp or not active
[st26:870061] [prte-st-master-2560223@0,1] TCP SHUTDOWN done
[st26:870061] [prte-st-master-2560223@0,1] Finalizing PMIX server
Daemon was launched on st27 - beginning to initialize
[st27:801326] mca:oob:select: checking available component tcp
[st27:801326] mca:oob:select: Querying component [tcp]
[st27:801326] oob:tcp: component_available called
[st27:801326] [prte-st-master-2560223@0,2] TCP STARTUP
[st27:801326] [prte-st-master-2560223@0,2] attempting to bind to IPv4 port 0
[st27:801326] mca:oob:select: Adding component to end
[st27:801326] mca:oob:select: Found 1 active transports
[st27:801326] RML-RECV(27): runtime/prte_data_server.c:prte_data_server_init:150
[st27:801326] RML-RECV(50): prted/pmix/pmix_server.c:pmix_server_start:886
[st27:801326] RML-RECV(51): prted/pmix/pmix_server.c:pmix_server_start:890
[st27:801326] RML-RECV(6): prted/pmix/pmix_server.c:pmix_server_start:894
[st27:801326] RML-RECV(28): prted/pmix/pmix_server.c:pmix_server_start:898
[st27:801326] RML-RECV(59): prted/pmix/pmix_server.c:pmix_server_start:902
[st27:801326] RML-RECV(24): prted/pmix/pmix_server.c:pmix_server_start:906
[st27:801326] RML-RECV(15): grpcomm_direct.c:init:74
[st27:801326] RML-RECV(33): grpcomm_direct.c:init:76
[st27:801326] RML-RECV(31): grpcomm_direct.c:init:79
[st27:801326] RML-RECV(5): base/plm_base_receive.c:prte_plm_base_comm_start:79
[st27:801326] RML-RECV(3): iof_prted.c:init:98
[st27:801326] RML-RECV(21): filem_raw_module.c:raw_init:113
[st27:801326] RML-RECV(1): prted.c:main:449
[st27:801326] RML-RECV(10): prted.c:main:504
[st27:801326] RML-SEND(0:10): prted.c:main:715
[st27:801326] [prte-st-master-2560223@0,2] rml_send_buffer to peer 0 at tag 10
[st27:801326] [prte-st-master-2560223@0,2] OOB_SEND: rml/rml_send.c:89
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 27 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 50 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 51 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 6 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 28 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 59 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 24 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 15 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 33 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 31 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 5 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 3 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 21 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 1 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] posting recv
[st27:801326] [prte-st-master-2560223@0,2] posting persistent recv on tag 10 for peer [[INVALID],WILDCARD]
[st27:801326] [prte-st-master-2560223@0,2] oob:base:send to target [prte-st-master-2560223@0,0] - attempt 0
[st27:801326] [prte-st-master-2560223@0,2] oob:base:send unknown peer [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:set_addr processing uri prte-st-master-2560223@0.0;tcp://10.15.3.34,172.16.0.254,192.168.0.254:35289:24,23,23
[st27:801326] [prte-st-master-2560223@0,2]:set_addr checking if peer [prte-st-master-2560223@0,0] is reachable via component tcp
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp: working peer [prte-st-master-2560223@0,0] address tcp://10.15.3.34,172.16.0.254,192.168.0.254:35289:24,23,23
[st27:801326] [prte-st-master-2560223@0,2]: peer [prte-st-master-2560223@0,0] is reachable via component tcp
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:send_nb to peer [prte-st-master-2560223@0,0]:10 seq = -1
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:190] processing send to peer [prte-st-master-2560223@0,0]:10 seq_num = -1 via [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:204] queue pending to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] tcp:send_nb: initiating connection to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp.c:216] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2] oob:tcp:peer creating socket to [prte-st-master-2560223@0,0]
[st27:801326] [prte-st-master-2560223@0,2]:[oob_tcp_connection.c:1068] connect to [prte-st-master-2560223@0,0]
------------------------------------------------------------
A process or daemon was unable to complete a TCP connection
to another process:
  Local host:    st27
  Remote host:   192.168.0.254
This is usually caused by a firewall on the remote host. Please
check that any firewall (e.g., iptables) has been disabled and
try again.
------------------------------------------------------------
[st27:801326] [prte-st-master-2560223@0,2]:errmgr_prted.c(365) updating exit status to 1
[st27:801326] RML-CANCEL(15): base/grpcomm_base_frame.c:prte_grpcomm_base_close:82

@rhc54
Copy link
Contributor

rhc54 commented Jul 26, 2023

Just for clarification - is this that Slurm environment again? If so, that could well be the problem.

@BhattaraiRajat
Copy link
Author

I am running under the Slurm allocation but with prrte and pmix built with --with-slurm=no option.

@rhc54
Copy link
Contributor

rhc54 commented Jul 26, 2023

Sounds suspicious - try running prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile and let's see what it thinks it got.

@BhattaraiRajat
Copy link
Author

This is the result for prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile after it receives prun commands.

[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate allocation already read
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:add_hosts checking add-hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:node_insert node st27 slots 64
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: checking hostfile /home/rbhattara/add-host-debug/hostfile for nodes
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: adding node st26 slots 64
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate
[st-master:2612696] [prte-st-master-2612696@0,0] ras:base:allocate allocation already read
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: filtering nodes through hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2612696] [prte-st-master-2612696@0,0] hostfile: node st27 is being included - keep all is FALSE
[st-master:2612696] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2612696] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET

@rhc54
Copy link
Contributor

rhc54 commented Jul 26, 2023

Sigh - I want to see the output when it immediately starts up, please.

@rhc54
Copy link
Contributor

rhc54 commented Jul 26, 2023

Question: is prte itself running on the node in your initial hostfile? Or is it running on a login node that isn't included in the hostfile?

@BhattaraiRajat
Copy link
Author

Sigh - I want to see the output when it immediately starts up, please.

I am sorry. Here is the full output.

[st-master add-host-debug]$ prte --display alloc --prtemca ras_base_verbose 10 --hostfile hostfile --report-uri dvm.uri
[st-master:2817884] mca: base: component_find: searching NULL for ras components
[st-master:2817884] mca: base: find_dyn_components: checking NULL for ras components
[st-master:2817884] pmix:mca: base: components_register: registering framework ras components
[st-master:2817884] pmix:mca: base: components_register: found loaded component simulator
[st-master:2817884] pmix:mca: base: components_register: component simulator register function successful
[st-master:2817884] pmix:mca: base: components_register: found loaded component pbs
[st-master:2817884] pmix:mca: base: components_register: component pbs register function successful
[st-master:2817884] mca: base: components_open: opening ras components
[st-master:2817884] mca: base: components_open: found loaded component simulator
[st-master:2817884] mca: base: components_open: found loaded component pbs
[st-master:2817884] mca: base: components_open: component pbs open function successful
[st-master:2817884] mca:base:select: Auto-selecting ras components
[st-master:2817884] mca:base:select:(  ras) Querying component [simulator]
[st-master:2817884] mca:base:select:(  ras) Querying component [pbs]
[st-master:2817884] mca:base:select:(  ras) No component selected!
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate nothing found in module - proceeding to hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate adding hostfile hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
DVM ready
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate allocation already read
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: checking hostfile /home/rbhattara/add-host-debug/hostfile for nodes
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st26 is being included - keep all is FALSE
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: adding node st26 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:add_hosts checking add-hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert inserting 1 nodes
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:node_insert node st27 slots 64
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate
[st-master:2817884] [prte-st-master-2817884@0,0] ras:base:allocate allocation already read
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: filtering nodes through hostfile /home/rbhattara/add-host-debug/add_hostfile
[st-master:2817884] [prte-st-master-2817884@0,0] hostfile: node st27 is being included - keep all is FALSE
[st-master:2817884] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET
[st-master:2817884] UNSUPPORTED DAEMON ERROR STATE: NO PATH TO TARGET

Question: is prte itself running on the node in your initial hostfile? Or is it running on a login node that isn't included in the hostfile?

prte is running on login node. st-master. hostfile contains st26.

@BhattaraiRajat
Copy link
Author

I found another strange issue while running prun with add-hostfile option.
Sometimes prun is seems to be running one more copy of the program than specified in -np option.

[st-master add-host-debug]$ prun --dvm-uri file:dvm.uri --add-hostfile add_hostfile -np 2 hostname
st26
st26
[st-master add-host-debug]$ prun --dvm-uri file:dvm.uri --add-hostfile add_hostfile -np 2 hostname
st26
st26
st26

@rhc54
Copy link
Contributor

rhc54 commented Jul 26, 2023

Okay, that confirms the setup - no Slurm interactions. I'm afraid this will take a while to track down. It's some kind of race condition, though the precise nature of it remains hard to see. Unfortunately, I'm pretty occupied right now, which will further delay things.

I'd suggest you run those prun cmds sequentially for now as that seems to be working.

Sometimes prun is seems to be running one more copy of the program than specified in -np option.

No ideas - I can try to reproduce, but don't know if/when I'll be able to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants