Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: host not found: Name or service not known in _env_rendezvous_handler #1542

Closed
newwhitecheng opened this issue Apr 21, 2020 · 9 comments
Labels
bug Something isn't working help wanted Open to be worked on

Comments

@newwhitecheng
Copy link

🐛 Bug

To Reproduce

Steps to reproduce the behavior:

  1. Go to pl_examples/basic_examples/
  2. modify the script to fit environment
# SLURM SUBMIT SCRIPT
#SBATCH --account=gpu-s2-intelperf-0
#SBATCH --partition=gpu-s2-core-0
#SBATCH --nodes=2
#SBATCH --gres=gpu:2
#SBATCH --time=0-02:00:00
#SBATCH --ntasks-per-node=2
# activate conda env
source  activate $1
export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1
export NCCL_SOCKET_IFNAME=^ib0,lo
# ib0 is looked up from ifconfig
module load cuda10.1
# run script from above
srun python multi_node_ddp_demo.py
  1. See error
Traceback (most recent call last):
  File "multi_node_ddp_demo.py", line 51, in <module>
    main(hyperparams)
  File "multi_node_ddp_demo.py", line 37, in main
    trainer.fit(model)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 684, in fit
    self.ddp_train(task, model)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 308, in ddp_train
    model.init_ddp_connection(self.proc_rank, self.world_size)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 951, in init_ddp_connection
    torch_distrib.init_process_group('nccl', rank=proc_rank, world_size=world_size)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 397, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/data/gpfs/home/hsinpaic/anaconda3/envs/pose/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 168, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
ValueError: host not found: Name or service not known
srun: error: gpu-1: task 2: Exited with exit code 1
srun: error: gpu-1: task 3: Exited with exit code 1
srun: error: gpu-0: task 0: Exited with exit code 1
srun: error: gpu-0: task 1: Exited with exit code 1

Code sample

Expected behavior

I was following the README in basic_examples folder, I can pass through single node example. But it shows this error in multi-nodes.

Environment

 CUDA:
	- GPU:
		- Tesla P100-SXM2-16GB
		- Tesla P100-SXM2-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.4.0
	- pytorch-lightning: 0.7.3
	- tensorboard:       1.14.0
	- tqdm:              4.45.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.6.10
	- version:           #1 SMP Mon Jul 29 17:46:05 UTC 2019
* CUDA:
	- GPU:
		- Tesla P100-SXM2-16GB
		- Tesla P100-SXM2-16GB
	- available:         True
	- version:           10.1
* Packages:
	- numpy:             1.18.1
	- pyTorch_debug:     False
	- pyTorch_version:   1.4.0
	- pytorch-lightning: 0.7.3
	- tensorboard:       1.14.0
	- tqdm:              4.45.0
* System:
	- OS:                Linux
	- architecture:
		- 64bit
		-
	- processor:         x86_64
	- python:            3.6.10
	- version:           #1 SMP Mon Jul 29 17:46:05 UTC 2019```
@newwhitecheng newwhitecheng added bug Something isn't working help wanted Open to be worked on labels Apr 21, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@mmiakashs
Copy link

@newwhitecheng I am also facing the same issue, do you have any update/solution regarding this issue?

@newwhitecheng
Copy link
Author

@newwhitecheng I am also facing the same issue, do you have any update/solution regarding this issue?

My best guess is that NCCL is not install properly, but I'm not certain since I don't have root to install.

@isekulic
Copy link

isekulic commented Jun 7, 2020

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?

Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.

I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:

File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)

and got:

nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00

I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]

Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!

@isekulic
Copy link

isekulic commented Jun 8, 2020

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?

Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.

I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:

File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)

and got:

nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00

I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]

Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!

Found the problem: there was a '0' added to the node name. It's even visible from the output I put here. I don't know why it happened, I'll look into it and find line responsible for a '0' in the master_addr.

@isekulic
Copy link

isekulic commented Jun 9, 2020

@newwhitecheng @mmiakashs
I am facing the same issue. Any solution?
Single node works fine, but multi-node results in the exact same error. I've set NCCL_SOCKET_IFNAME=ipogif0, based on ifconfig output and verified nodes can ping each other through that interface.
I've put:
print(master_addr, master_port, world_size, start_daemon, timeout)
before the line where error occurs:

File "/users/isekulic/evn2/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 173, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)

and got:

nid002326 20426 2 True 0:30:00
nid002326 20426 2 False 0:30:00

I'm trying to use 2 nodes with 1 GPU each. os.environ['SLURM_NODELIST'] returns: nid0[2326-2327]
Btw, when I don't set NCCL_SOCKET_IFNAME=ipogif0 I get one output from mentioned print only. I'm struggling to understand what's going on, any help is appreciated. Any suggestion about a direction in which to look for an answer is appreciated!

Found the problem: there was a '0' added to the node name. It's even visible from the output I put here. I don't know why it happened, I'll look into it and find line responsible for a '0' in the master_addr.

I updated to PL 0.8.0 and additional '0' in node name is no longer there.
pip install pytorch_lightning==0.8.0rc1

@mmiakashs
Copy link

@isekulic is it working now?
If yes, can I know your project folder structure and how call the training scripts? Thanks in advance.

@isekulic
Copy link

isekulic commented Jun 10, 2020

@mmiakashs yes, this part seems OK now. However, I'm facing a separate issue:
Unidentified node: Error detected by libibgni.so. Subsequent operation may be unreliable. IAA did not recognize this as an MPI process
This has nothing to do with PL though (I think). I'm getting faster performance when using multiple GPUs anyways, so I'm not sure how relevant the above message is.

My current setup is quite cluster-dependent, but it might be helpful anyways, so I'll post the slurm script:

#!/bin/bash -l
#SBATCH --constraint=gpu
#SBATCH --partition=normal
#SBATCH --nodes=4
#SBATCH --ntasks-per-core=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1
#SBATCH --mem=0

module load daint-gpu
module load cudatoolkit
module load mpiP # not sure if needed

source ~env/bin/activate

export NCCL_DEBUG=INFO
export PYTHONFAULTHANDLER=1

# cluster specific
export NCCL_IB_HCA=ipogif0 
export NCCL_IB_CUDA_SUPPORT=1

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

srun python run.py --gpus=1 \
    --num_nodes=4 \
    --num_workers=0 \
    --distributed_backend='ddp' 

To clarify, my cluster has only 1 GPU per node.

Relevant part of Trainer:

trainer = pl.Trainer(
            gpus=hparams.gpus, # =1
            num_nodes=hparams.num_nodes, # =4
            distributed_backend=hparams.distributed_backend) # ='ddp'

I hope it helps. Make sure to check master_addr and master_port are correct and reachable. That was the reason why I got this error in _env_rendezvous_handler.

@williamFalcon
Copy link
Contributor

sounds solved! we can reopen otherwise

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

4 participants