ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

newwhitecheng · 2020-04-21T05:17:20Z

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

Go to pl_examples/basic_examples/
modify the script to fit environment

# SLURM SUBMIT SCRIPT
#SBATCH --account=gpu-s2-intelperf-0
#SBATCH --partition=gpu-s2-core-0
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --time=0-02:00:00
#SBATCH --ntasks-per-node=2
# activate conda env
source  activate $1
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
export NCCL_SOCKET_IFNAME=^ib0,lo
# ib0 is looked up from ifconfig
module load cuda10.1
# run script from above
srun python multi_node_ddp_demo.py

See error

Traceback (most recent call last):
  File "multi_node_ddp_demo.py", line 51, in <module>
    main(hyperparams)
  File "multi_node_ddp_demo.py", line 37, in main
    trainer.fit(model)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 684, in fit
    self.ddp_train(task, model)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 308, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 951, in init_ddp_connection
    torch_distrib.init_process_group('nccl', rank=proc_rank, world_size=world_size)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 168, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
ValueError: host not found: Name or service not known
srun: error: gpu-1: task 2: Exited with exit code 1
srun: error: gpu-1: task 3: Exited with exit code 1
srun: error: gpu-0: task 0: Exited with exit code 1
srun: error: gpu-0: task 1: Exited with exit code 1

Code sample

Expected behavior

I was following the README in basic_examples folder, I can pass through single node example. But it shows this error in multi-nodes.

Environment

 CUDA:
	- GPU:
		- Tesla P100-SXM2-16GB
		- Tesla P100-SXM2-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.4.0
	- pytorch-lightning: 0.7.3
	- tensorboard:       1.14.0
	- tqdm:              4.45.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.6.10
	- version:           #1 SMP Mon Jul 29 17:46:05 UTC 2019
* CUDA:
	- GPU:
		- Tesla P100-SXM2-16GB
		- Tesla P100-SXM2-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.4.0
	- pytorch-lightning: 0.7.3
	- tensorboard:       1.14.0
	- tqdm:              4.45.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.6.10
	- version:           #1 SMP Mon Jul 29 17:46:05 UTC 2019```

The text was updated successfully, but these errors were encountered:

github-actions · 2020-04-21T05:18:02Z

Hi! thanks for your contribution!, great first issue!

mmiakashs · 2020-04-26T04:14:14Z

@newwhitecheng I am also facing the same issue, do you have any update/solution regarding this issue?

newwhitecheng · 2020-04-27T22:54:57Z

@newwhitecheng I am also facing the same issue, do you have any update/solution regarding this issue?

My best guess is that NCCL is not install properly, but I'm not certain since I don't have root to install.

isekulic · 2020-06-07T17:26:34Z

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?

Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.

I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:

File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)

and got:

nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00

I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]

Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!

isekulic · 2020-06-08T16:09:34Z

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?

Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.

I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:
File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
and got:
nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00
I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]

Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!

Found the problem: there was a '0' added to the node name. It's even visible from the output I put here. I don't know why it happened, I'll look into it and find line responsible for a '0' in the master_addr.

isekulic · 2020-06-09T08:14:22Z

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?
Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.
I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:
File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
and got:
nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00
I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]
Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!
Found the problem: there was a '0' added to the node name. It's even visible from the output I put here. I don't know why it happened, I'll look into it and find line responsible for a '0' in the master_addr.

I updated to PL 0.8.0 and additional '0' in node name is no longer there.
pip install pytorch_lightning==0.8.0rc1

mmiakashs · 2020-06-09T22:12:05Z

@isekulic is it working now?
If yes, can I know your project folder structure and how call the training scripts? Thanks in advance.

isekulic · 2020-06-10T08:36:30Z

@mmiakashs yes, this part seems OK now. However, I'm facing a separate issue:
Unidentified node: Error detected by libibgni.so. Subsequent operation may be unreliable. IAA did not recognize this as an MPI process
This has nothing to do with PL though (I think). I'm getting faster performance when using multiple GPUs anyways, so I'm not sure how relevant the above message is.

My current setup is quite cluster-dependent, but it might be helpful anyways, so I'll post the slurm script:

#!/bin/bash -l
#SBATCH --constraint=gpu
#SBATCH --partition=normal
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=0

module load daint-gpu
module load cudatoolkit
module load mpiP # not sure if needed

source ~env/bin/activate

export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

# cluster specific
export NCCL_IB_HCA=ipogif0 
export NCCL_IB_CUDA_SUPPORT=1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun python run.py --gpus=1 \
    --num_nodes=4 \
    --num_workers=0 \
    --distributed_backend='ddp'

To clarify, my cluster has only 1 GPU per node.

Relevant part of Trainer:

trainer = pl.Trainer(
            gpus=hparams.gpus, # =1
            num_nodes=hparams.num_nodes, # =4
            distributed_backend=hparams.distributed_backend) # ='ddp'

I hope it helps. Make sure to check master_addr and master_port are correct and reachable. That was the reason why I got this error in _env_rendezvous_handler.

williamFalcon · 2020-06-26T14:06:08Z

sounds solved! we can reopen otherwise

newwhitecheng added bug Something isn't working help wanted Open to be worked on labels Apr 21, 2020

newwhitecheng closed this as completed May 1, 2020

newwhitecheng reopened this May 12, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

williamFalcon closed this as completed Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

newwhitecheng commented Apr 21, 2020

github-actions bot commented Apr 21, 2020

mmiakashs commented Apr 26, 2020

newwhitecheng commented Apr 27, 2020

isekulic commented Jun 7, 2020 •

edited

Loading

isekulic commented Jun 8, 2020

isekulic commented Jun 9, 2020 •

edited

Loading

mmiakashs commented Jun 9, 2020

isekulic commented Jun 10, 2020 •

edited

Loading

williamFalcon commented Jun 26, 2020

ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

Comments

newwhitecheng commented Apr 21, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Environment

github-actions bot commented Apr 21, 2020

mmiakashs commented Apr 26, 2020

newwhitecheng commented Apr 27, 2020

isekulic commented Jun 7, 2020 • edited Loading

isekulic commented Jun 8, 2020

isekulic commented Jun 9, 2020 • edited Loading

mmiakashs commented Jun 9, 2020

isekulic commented Jun 10, 2020 • edited Loading

williamFalcon commented Jun 26, 2020

isekulic commented Jun 7, 2020 •

edited

Loading

isekulic commented Jun 9, 2020 •

edited

Loading

isekulic commented Jun 10, 2020 •

edited

Loading