Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: CUDA error: invalid device function #982

Open
tigerccx opened this issue Oct 20, 2020 · 4 comments
Open

RuntimeError: CUDA error: invalid device function #982

tigerccx opened this issue Oct 20, 2020 · 4 comments

Comments

@tigerccx
Copy link

I am trying to run this github project and I encountered a CUDA error with apex.

`Traceback (most recent call last):
File "train_AEI.py", line 132, in
scaled_loss.backward()
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/contextlib.py", line 119, in exit
next(self.gen)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/handle.py", line 123, in scale_loss
optimizer._post_amp_backward(loss_scaler)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 249, in post_backward_no_master_weights
post_backward_models_are_masters(scaler, params, stashed_grads)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/_process_optimizer.py", line 128, in post_backward_models_are_masters
scale_override=grads_have_scale/out_scale)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/amp/scaler.py", line 117, in unscale
1./scale)
File "/home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 30, in call
*args)
RuntimeError: CUDA error: invalid device function (multi_tensor_apply at csrc/multi_tensor_apply.cuh:111)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f7679444193 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: void multi_tensor_apply<2, ScaleFunctor<float, float>, float>(int, int, at::Tensor const&, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > > const&, ScaleFunctor<float, float>, float) + 0x1270 (0x7f7668c39ce0 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #2: multi_tensor_scale_cuda(int, at::Tensor, std::vector<std::vector<at::Tensor, std::allocatorat::Tensor >, std::allocator<std::vector<at::Tensor, std::allocatorat::Tensor > > >, float) + 0x829 (0x7f7668c37c99 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #3: + 0x25e5a (0x7f7668c27e5a in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)
frame #4: + 0x1f641 (0x7f7668c21641 in /home/ivdai/anaconda3/envs/ccx_test0/lib/python3.7/site-packages/amp_C.cpython-37m-x86_64-linux-gnu.so)

frame #35: __libc_start_main + 0xf0 (0x7f767da69840 in /lib/x86_64-linux-gnu/libc.so.6)

Segmentation fault (core dumped)`

What could be the problem?

@ptrblck
Copy link
Contributor

ptrblck commented Oct 28, 2020

You might be using an older apex version, which didn't have the device guards for multi_tensor_apply.
Note that we recommend to use the native mixed-precision implementation as explained here.

@tigerccx
Copy link
Author

@ptrblck Thank you for your answer. I followed the installation guide. So where can I get a newer version of apex? Or maybe is it because I am using CUDA 10.0 so apex was complied into an older version automatically?

@ptrblck
Copy link
Contributor

ptrblck commented Oct 29, 2020

You should get an error, if you are trying to compile apex with another CUDA version than used to compile or build PyTorch.
However, the native mixed-precision training works out of the box in PyTorch without building a 3rd party package.

@tigerccx
Copy link
Author

tigerccx commented Oct 30, 2020

@ptrblck My server was installed with CUDA10.0 (as displayed in nvcc -V) and PyTorch 1.4.0+cu100. So the versions should be matching. And it is not quite possible for me to update CUDA so I cannot access a newer version of PyTorch with the integration of cuda.amp.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants