different gpus to train #3736

alicera · 2021-06-23T02:05:20Z

docker： pytorch-21.03
Driver Version: 460.73.01
GPU：
CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:1 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:2 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:3 (GeForce GTX TITAN X, 12212.8125MB)

Command ：python -m torch.distributed.launch --nproc_per_node 4 train.py --resume

Traceback (most recent call last):
File "train.py", line 541, in
train(hyp, opt, device, tb_writer)
File "train.py", line 304, in train
loss, loss_items = compute_loss(pred, targets.to(device)) # loss scaled by batch_size
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fd165a3e5cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7fd165a04d4e in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x987 (0x7fd165a7f6f7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x5c (0x7fd165a244cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x29a (0x7fd1b2b3bd7a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x1c4 (0x7fd1b2b31444 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7fd1b2b642c6 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x22 (0x7fd1b2b697f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xc700e5 (0x7fd1b2b680e5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x6ff782 (0x7fd1b25f7782 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x700743 (0x7fd1b25f8743 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0x12b785 (0x5565291cb785 in /opt/conda/bin/python)
frame #14: + 0x1ca984 (0x55652926a984 in /opt/conda/bin/python)
frame #15: + 0x11f906 (0x5565291bf906 in /opt/conda/bin/python)
frame #16: + 0x12bc96 (0x5565291cbc96 in /opt/conda/bin/python)
frame #17: + 0x12bc4c (0x5565291cbc4c in /opt/conda/bin/python)
frame #18: + 0x154ec8 (0x5565291f4ec8 in /opt/conda/bin/python)
frame #19: PyDict_SetItemString + 0x87 (0x5565291f6127 in /opt/conda/bin/python)
frame #20: PyImport_Cleanup + 0x9a (0x5565292f65aa in /opt/conda/bin/python)
frame #21: Py_FinalizeEx + 0x7d (0x5565292f694d in /opt/conda/bin/python)
frame #22: Py_RunMain + 0x110 (0x5565292f77f0 in /opt/conda/bin/python)
frame #23: Py_BytesMain + 0x39 (0x5565292f7979 in /opt/conda/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fd1e12bf0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e7185 (0x556529287185 in /opt/conda/bin/python)

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size

0%| | 0/1195 [00:00<?, ?it/s]Killing subprocess 20868
Killing subprocess 20869
Killing subprocess 20870
Killing subprocess 20871
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=3', '--resume', 'runs/train/exp/weights/last.pt']' died with <Signals.SIGABRT: 6>.

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2021-06-23T09:35:55Z

@alicera for Multi-GPU it's recommended to train with even GPU counts (2, 4, 8) and with all the same exact model of GPU.

alicera · 2021-06-23T09:45:21Z

But TiTANX and 1080ti should be used at the same time.
Because I once train the epoch of 200 times and no error happen.
Do you have other idea

glenn-jocher · 2021-06-23T10:13:35Z

@alicera well for starters Ultralytics will never be able to reproduce this error on this hardware combination, so there's no action for us to take.

We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

✅ Minimal – Use as little code as possible that still produces the same problem
✅ Complete – Provide all parts someone else needs to reproduce your problem in the question itself
✅ Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

✅ Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
✅ Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

alicera · 2021-06-23T10:22:37Z

Thank you! smiley

alicera added the bug Something isn't working label Jun 23, 2021

alicera closed this as completed Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

different gpus to train #3736

different gpus to train #3736

alicera commented Jun 23, 2021 •

edited

Loading

glenn-jocher commented Jun 23, 2021

alicera commented Jun 23, 2021

glenn-jocher commented Jun 23, 2021 •

edited

Loading

alicera commented Jun 23, 2021

different gpus to train #3736

different gpus to train #3736

Comments

alicera commented Jun 23, 2021 • edited Loading

glenn-jocher commented Jun 23, 2021

alicera commented Jun 23, 2021

glenn-jocher commented Jun 23, 2021 • edited Loading

How to create a Minimal, Reproducible Example

alicera commented Jun 23, 2021

alicera commented Jun 23, 2021 •

edited

Loading

glenn-jocher commented Jun 23, 2021 •

edited

Loading