Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

different gpus to train #3736

Closed
alicera opened this issue Jun 23, 2021 · 4 comments
Closed

different gpus to train #3736

alicera opened this issue Jun 23, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@alicera
Copy link

alicera commented Jun 23, 2021

docker: pytorch-21.03
Driver Version: 460.73.01
GPU:
CUDA:0 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:1 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:2 (GeForce GTX 1080 Ti, 11178.5MB)
CUDA:3 (GeForce GTX TITAN X, 12212.8125MB)

Command :python -m torch.distributed.launch --nproc_per_node 4 train.py --resume

Traceback (most recent call last):
File "train.py", line 541, in
train(hyp, opt, device, tb_writer)
File "train.py", line 304, in train
loss, loss_items = compute_loss(pred, targets.to(device)) # loss scaled by batch_size
RuntimeError: CUDA error: the launch timed out and was terminated
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: the launch timed out and was terminated
Exception raised from create_event_internal at ../c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6c (0x7fd165a3e5cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0xfa (0x7fd165a04d4e in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x987 (0x7fd165a7f6f7 in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x5c (0x7fd165a244cc in /opt/conda/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocatorc10d::Reducer::Bucket >::~vector() + 0x29a (0x7fd1b2b3bd7a in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x1c4 (0x7fd1b2b31444 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x16 (0x7fd1b2b642c6 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: std::_Sp_counted_ptr<c10d::Logger*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x22 (0x7fd1b2b697f2 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x48 (0x7fd1b25ebf58 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0xc700e5 (0x7fd1b2b680e5 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: + 0x6ff782 (0x7fd1b25f7782 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #12: + 0x700743 (0x7fd1b25f8743 in /opt/conda/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #13: + 0x12b785 (0x5565291cb785 in /opt/conda/bin/python)
frame #14: + 0x1ca984 (0x55652926a984 in /opt/conda/bin/python)
frame #15: + 0x11f906 (0x5565291bf906 in /opt/conda/bin/python)
frame #16: + 0x12bc96 (0x5565291cbc96 in /opt/conda/bin/python)
frame #17: + 0x12bc4c (0x5565291cbc4c in /opt/conda/bin/python)
frame #18: + 0x154ec8 (0x5565291f4ec8 in /opt/conda/bin/python)
frame #19: PyDict_SetItemString + 0x87 (0x5565291f6127 in /opt/conda/bin/python)
frame #20: PyImport_Cleanup + 0x9a (0x5565292f65aa in /opt/conda/bin/python)
frame #21: Py_FinalizeEx + 0x7d (0x5565292f694d in /opt/conda/bin/python)
frame #22: Py_RunMain + 0x110 (0x5565292f77f0 in /opt/conda/bin/python)
frame #23: Py_BytesMain + 0x39 (0x5565292f7979 in /opt/conda/bin/python)
frame #24: __libc_start_main + 0xf3 (0x7fd1e12bf0b3 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: + 0x1e7185 (0x556529287185 in /opt/conda/bin/python)

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size

0%| | 0/1195 [00:00<?, ?it/s]Killing subprocess 20868
Killing subprocess 20869
Killing subprocess 20870
Killing subprocess 20871
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', 'train.py', '--local_rank=3', '--resume', 'runs/train/exp/weights/last.pt']' died with <Signals.SIGABRT: 6>.

@alicera alicera added the bug Something isn't working label Jun 23, 2021
@glenn-jocher
Copy link
Member

@alicera for Multi-GPU it's recommended to train with even GPU counts (2, 4, 8) and with all the same exact model of GPU.

@alicera
Copy link
Author

alicera commented Jun 23, 2021

But TiTANX and 1080ti should be used at the same time.
Because I once train the epoch of 200 times and no error happen.
Do you have other idea

@glenn-jocher
Copy link
Member

glenn-jocher commented Jun 23, 2021

@alicera well for starters Ultralytics will never be able to reproduce this error on this hardware combination, so there's no action for us to take.

We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible that still produces the same problem
  • Complete – Provide all parts someone else needs to reproduce your problem in the question itself
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

In addition to the above requirements, for Ultralytics to provide assistance your code should be:

  • Current – Verify that your code is up-to-date with current GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been resolved by previous commits.
  • Unmodified – Your problem must be reproducible without any modifications to the codebase in this repository. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all of the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template and providing a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@alicera
Copy link
Author

alicera commented Jun 23, 2021

Thank you! smiley

@alicera alicera closed this as completed Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants