Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with nccl_mpi_all_reduce on multinode system #97

Open
vilmara opened this issue May 15, 2018 · 6 comments
Open

Error with nccl_mpi_all_reduce on multinode system #97

vilmara opened this issue May 15, 2018 · 6 comments
Assignees

Comments

@vilmara
Copy link

vilmara commented May 15, 2018

Hi all,

What is the command line to run nccl_mpi_all_reduce on a multi-node system (2 nodes with 4 GPUs each one)?, and I am getting the below error when typing this command:

DeepBench/code$ mpirun -np 8 bin/nccl_mpi_all_reduce


WARNING: There is at least non-excluded one OpenFabrics device found, but there are no active ports detected (or Open MPI was unable to use them). This is most certainly not what you wanted. Check your cables, subnet manager configuration, etc. The openib BTL will be ignored for this job.

Local host: C4-1

terminate called after throwing an instance of 'std::runtime_error'
what(): Failed to set cuda device

When running only with 4 ranks, I get this output:

DeepBench/code$ mpirun -np 4 bin/nccl_mpi_all_reduce


WARNING: There is at least non-excluded one OpenFabrics device found,
but there are no active ports detected (or Open MPI was unable to use
them). This is most certainly not what you wanted. Check your
cables, subnet manager configuration, etc. The openib BTL will be
ignored for this job.

Local host: C4-1

NCCL MPI AllReduce
Num Ranks: 4

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

[C4130-1:04094] 3 more processes have sent help message help-mpi-btl-openib.txt / no active ports found
[C4130-1:04094] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
100000 400000 0.148489 0.148565
3097600 12390400 2.63694 2.63695
4194304 16777216 3.57147 3.57148
6553600 26214400 5.59742 5.59744
16777217 67108868 81.9391 81.9396
38360000 153440000 32.6457 32.6462

Thanks

@sharannarang
Copy link
Contributor

@mpatwary , can you help with this?

@mpatwary
Copy link
Collaborator

It looks like you are using the right command and I think the problem is unrelated to nccl_mpi_all_reduce. Do the other MPI implementations run well like ring_all_reduce and osu_allreduce? I suspect the problem could be the setup. Does any other code with MPI run well in your system?

@vilmara
Copy link
Author

vilmara commented May 23, 2018

hi @mpatwary, my system has 2 nodes, each with 4 P100 GPUs (total 8 gpus) connected using infiniband, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?

ring_all_reduce and osu_allreduce are throwing errors when I compile the DeepBench benchmarks:

Compilation:
make CUDA_PATH=/usr/local/cuda-9.1 CUDNN_PATH=/usr/local/cuda/include/ MPI_PATH=/home/dell/.openmpi/ NCCL_PATH=/home/$USER/.openmpi/ ARCH=sm_60

Normal outputs and errors:
mkdir -p bin
make -C nvidia
make[1]: Entering directory '/home/dell/DeepBench/code/nvidia'
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc gemm_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/gemm_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -L /usr/local/cuda-9.1/lib64 -lcublas -L /usr/local/cuda-9.1/lib64 -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc conv_bench.cu -DUSE_TENSOR_CORES=0 -DPAD_KERNELS=1 -o bin/conv_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc rnn_bench.cu -DUSE_TENSOR_CORES=0 -o bin/rnn_bench -I ../kernels/ -I /usr/local/cuda-9.1/include -I /usr/local/cuda/include//include/ -L /usr/local/cuda/include//lib64/ -L /usr/local/cuda-9.1/lib64 -lcurand -lcudnn --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc nccl_single_all_reduce.cu -o bin/nccl_single_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -lnccl -lcudart -lcurand --generate-code arch=compute_60,code=sm_60 -std=c++11
mkdir -p bin
/usr/local/cuda-9.1/bin/nvcc nccl_mpi_all_reduce.cu -o bin/nccl_mpi_all_reduce -I ../kernels/ -I /home/root/.openmpi//include/ -I /usr/local/cuda/include//include/ -I /home/dell/.openmpi//include -L /home/root/.openmpi//lib/ -L /usr/local/cuda/include//lib64 -L /home/dell/.openmpi//lib -lnccl -lcurand -lcudart -lmpi --generate-code arch=compute_60,code=sm_60 -std=c++11
make[1]: Leaving directory '/home/dell/DeepBench/code/nvidia'
cp nvidia/bin/* bin
rm -rf nvidia/bin
mkdir -p bin
make -C osu_allreduce
make[1]: Entering directory '/home/dell/DeepBench/code/osu_allreduce'
mkdir -p bin
gcc -o bin/osu_coll.o -c -O2 -pthread -Wall -march=native -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_coll.c
gcc -o bin/osu_allreduce.o -c -O2 -pthread -Wall -march=native -I ../kernels/ -I/usr/local/cuda-9.1/include -I/home/dell/.openmpi//include osu_allreduce.c
gcc -o bin/osu_allreduce -pthread -Wl,--enable-new-dtags -Wl,-rpath=/usr/local/cuda-9.1/lib64 -Wl,-rpath=/home/dell/.openmpi//lib bin/osu_allreduce.o bin/osu_coll.o -L/usr/local/cuda-9.1/lib64 -L/home/dell/.openmpi//lib -lstdc++ -lmpi_cxx -lmpi -lcuda
/usr/bin/ld: cannot find -lmpi_cxx
collect2: error: ld returned 1 exit status
Makefile:17: recipe for target 'build' failed
make[1]: *** [build] Error 1
make[1]: Leaving directory '/home/dell/DeepBench/code/osu_allreduce'
Makefile:6: recipe for target 'osu_allreduce' failed
make: *** [osu_allreduce] Error 2

I have recompiled, and ran it again with 4 and 8 gpus but now I got another the below error:

mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
bin/nccl_mpi_all_reduce: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[41026,1],0]
Exit code: 127

@mpatwary
Copy link
Collaborator

Looks like the code is not getting the path to the mpi lib directory. You can try exporting that.

@vilmara
Copy link
Author

vilmara commented May 29, 2018

@mpatwary, thanks for your prompt reply. I exported that and have gotten other errors. My system has 2 nodes, each with 4 P100 GPUs (total 8 GPUs) connected using InfiniBand, I was wonder how mpirun communicates between the nodes to implement the distributed benchmark?. It looks like the command mpirun --allow-run-as-root -np 8 bin/nccl_mpi_all_reduce is just considering the host node only; my understanding is that mpirun should receive the flag -H with the ib address of both servers (I tried this option but got errors too). Can you share the command line you have used to implement DeepBench nccl_mpi_all_reduce with multinode and multi GPUs systems?

Here is the error I am getting considering just the 4 GPUs of the host server:
mpirun --allow-run-as-root -np 4 bin/nccl_mpi_all_reduce
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[host-P100-2:10830] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

Process name: [[36721,1],0]
Exit code: 1

@laserrapt0r
Copy link

I have a problem here as well. Normal single version works fine. All other MPI applications are working. But i get this one here:

NCCL MPI AllReduce
Num Ranks: 2

# of floats    bytes transferred    Avg Time (msec)    Max Time (msec)

terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 0

[jetson-3:29969] *** Process received signal ***
[jetson-3:29969] Signal: Aborted (6)
[jetson-3:29969] Signal code: (-6)
what(): NCCL failure: unhandled cuda error in nccl_mpi_all_reduce.cu at line: 86 rank: 1

[jetson-2:08669] *** Process received signal ***
[jetson-2:08669] Signal: Aborted (6)
[jetson-2:08669] Signal code: (-6)
[jetson-2:08669] *** End of error message ***
[jetson-3:29969] *** End of error message ***

Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec noticed that process rank 0 with PID 0 on node jetson-3 exited on signal 6 (Aborted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants