Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'RuntimeError: GET was unable to find an engine to execute this computation' #43

Open
VikasAmaraneni opened this issue Apr 12, 2024 · 3 comments

Comments

@VikasAmaraneni
Copy link

Hello Everyone,
I'm using pytorch version=2.2.1 and CUDA=12.1, python version = 3.12.2 and I'm getting the following error;

'RuntimeError: RuntimeError Traceback (most recent call last)
Cell In[16], line 47
45 num_epochs = 10
46 for epoch in range(num_epochs):
---> 47 train_loss, train_time = train(model, train_loader, criterion, optimizer)
48 val_loss, val_accuracy, val_time = validate(model, val_loader, criterion)
49 print(f'Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss:.4f}, Train Time: {train_time:.2f}s, '
50 f'Val Loss: {val_loss:.4f}, Val Accuracy: {val_accuracy:.4f}, Val Time: {val_time:.2f}s')

Cell In[16], line 13, in train(model, train_loader, criterion, optimizer)
11 outputs = model(inputs)
12 loss = criterion(outputs, labels) # Calculate loss between model outputs and ground truth
---> 13 loss.backward()
14 optimizer.step()
15 running_loss += loss.item() * inputs.size(0) # Update running loss

File ~/.conda/envs/torchTest1/lib/python3.12/site-packages/torch/_tensor.py:522, in Tensor.backward(self, gradient, retain_graph, create_graph, inputs)
512 if has_torch_function_unary(self):
513 return handle_torch_function(
514 Tensor.backward,
515 (self,),
(...)
520 inputs=inputs,
521 )
--> 522 torch.autograd.backward(
523 self, gradient, retain_graph, create_graph, inputs=inputs
524 )

File ~/.conda/envs/torchTest1/lib/python3.12/site-packages/torch/autograd/init.py:266, in backward(tensors, grad_tensors, retain_graph, create_graph, grad_variables, inputs)
261 retain_graph = create_graph
263 # The reason we repeat the same comment below is that
264 # some Python versions print out the first line of a multi-line function
265 # calls in the traceback and some print out the last line
--> 266 Variable.execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
267 tensors,
268 grad_tensors
,
269 retain_graph,
270 create_graph,
271 inputs,
272 allow_unreachable=True,
273 accumulate_grad=True,
274 )

RuntimeError: GET was unable to find an engine to execute this computation'

Originally posted by @VikasAmaraneni in ultralytics/ultralytics#4060 (comment)

@shuyueW1991
Copy link

hi, there. I fixed a similar problem by matching the version of torch, torchvision, as well as torchaudio according to what is said on the PyTorch official release website. One such feasible solution is:
torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0

@VikasAmaraneni
Copy link
Author

Thank you so much, it worked.

@shuyueW1991
Copy link

I run into the problem again. I think the solution is not really the matching versions between. torch, torch vision, and torch audio. The solution should be:

  1. echo $LD_LIBRARY_PATH;
  2. go to the directory
  3. rename the problematic libcudnn_cnn_train.so.8 (or whatever is mentioned in message) as a copy.
  4. Now the system wouldn't go to this env var for cuda/cudnn shit. The underlying reason is that torch brings its own cuda/cudnn. We need to make them called.
  5. Done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants