Skip to content
This repository has been archived by the owner on Dec 16, 2022. It is now read-only.

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

Closed
10 tasks done
nelson-liu opened this issue Mar 19, 2021 · 17 comments
Closed
10 tasks done
Assignees
Labels

Comments

@nelson-liu
Copy link
Contributor

nelson-liu commented Mar 19, 2021

Checklist

  • I have verified that the issue exists against the master branch of AllenNLP.
  • I have read the relevant section in the contribution guide on reporting bugs.
  • I have checked the issues list for similar or identical bug reports.
  • I have checked the pull requests list for existing proposed fixes.
  • I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the master branch.
  • I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
  • I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • I have included in the "Environment" section below the output of pip freeze.
  • I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

When I train RoBERTa (or BERT, but let's just stick with RoBERTa in this issue in the interest of simplicity) on MNLI, I get an odd CUDA error.

  File "/opt/conda/lib/python3.7/site-packages/allennlp/models/basic_classifier.py", line 116,[26/1829$
rd
    embedded_text = self._text_field_embedder(tokens)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_e
mbedder.py", line 103, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_
embedder.py", line 201, in forward
    transformer_output = self.transformer_model(**parameters)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 8
22, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 5
15, in forward
    output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 4
36, in forward
    self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1819, in apply_chu
nking_to_forward
    return forward_fn(*input_tensors)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 4
47, in feed_forward_chunk
    intermediate_output = self.intermediate(attention_output)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 3
48, in forward
    hidden_states = self.dense(hidden_states)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m
, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Environment

I made a docker image that reproduces this issue at: https://hub.docker.com/r/nfliu/torch1.8.0-sgemm-execution-debugging . The associated dockerfile is https://gist.github.com/nelson-liu/f80d76f5557d48f2a52b2082b1bf86da . In short, it is based off of the NVIDIA cuda 11.1 container, and installs allennlp and allennlp-models off the most recent commits, and also pytorch 1.8.0+cu111. The python is python 3.7

Here's the output of nvidia-smi (for things like driver version, etc)

$ nvidia-smi
Fri Mar 19 13:26:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:5E:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Steps to reproduce

  1. Go to a machine with a titanxp and a driver that supports cuda 11.1
  2. nvidia-docker run --rm -it nfliu/torch1.8.0-sgemm-execution-debugging
  3. allennlp train https://gist.githubusercontent.com/nelson-liu/2164bb51097c5a8f9f9e8 d7784f8473e/raw/ce93da75558489177556355c8d54ca4949417b8b/roberta_base_mnli.jsonnet -s output
  4. You should see the error above within the first epoch. If not, it'd be great to know that you can't reproduce the issue.

The config is at https://gist.github.com/nelson-liu/2164bb51097c5a8f9f9e8d7784f8473e , it's exactly the same as the RoBERTa MNLI config except I'm using RoBERTa base and a batch size of 8, since the titanxp has a bit less memory.

@nelson-liu nelson-liu added the bug label Mar 19, 2021
@nelson-liu
Copy link
Contributor Author

I should also note that I've seen this when running the sample on GTX TITAN X (so it doesn't seem like an issue that's isolated to that particular GPU / that particular GPU has gone bad). I could not reproduce the issue when running with a Titan RTX.

@nelson-liu
Copy link
Contributor Author

This error doesn't seem to occur if I use the HF Transformer run_glue script.

@AkshitaB AkshitaB self-assigned this Mar 26, 2021
@nelson-liu
Copy link
Contributor Author

@AkshitaB i saw you self-assigned this issue---did you get a chance to see if you can repro on titan XP with the above docker image?

@dirkgr
Copy link
Member

dirkgr commented Apr 2, 2021

I assume this also doesn't happen when you run on CPU? Usually you get better error messages on CPU.

@dirkgr
Copy link
Member

dirkgr commented Apr 2, 2021

We don't have any Titans anymore outside the vision team. I'm trying to reproduce it on allennlp-server4, which has Quadro RTX.

@dirkgr
Copy link
Member

dirkgr commented Apr 2, 2021

@nelson-liu, if I don't reproduce it, can you set the CUDA_LAUNCH_BLOCKING=1 environment variable and see if you get a better error?

If this is a general problem with AllenNLP, it becomes my top priority. But I'm not keen on debugging a specific bug in CUDA that only occurs with a specific combination of CUDA and GPU. Can you run an older version of CUDA, or on a different GPU, to get around this problem?

@dirkgr dirkgr self-assigned this Apr 2, 2021
@nelson-liu
Copy link
Contributor Author

nelson-liu commented Apr 2, 2021

We don't have any Titans anymore outside the vision team. I'm trying to reproduce it on allennlp-server4, which has Quadro RTX.

I think that should work...my suspicion is that it'll fail if the compute capability is 6.1 or below ( https://developer.nvidia.com/cuda-gpus ). It works on a Titan RTX, and Quadro RTX's have the same compute capability of 7.5

If this is a general problem with AllenNLP, it becomes my top priority. But I'm not keen on debugging a specific bug in CUDA that only occurs with a specific combination of CUDA and GPU.

I managed to reproduce this issue on Titan XP, K40, GTX Titan X. I haven't tried any other GPUs, but I'll gather some more data and also try running with CUDA_LAUNCH_BLOCKING.

Can you run an older version of CUDA, or on a different GPU, to get around this problem?

So the context here is that I'm in a heterogeneous cluster environment, where the ndoes have anythign between a Titan XP (majority of nodes), K40, GTX Titan X, Titan RTX, RTX 3080, Titan V, or 2080 Ti. It'd be great if i could run my jobs on any node (would certainly speed things up). More generally, a lot of users are still on these older GPUs, and I just wanted to see if it was reproducible outside of my organization.

@dirkgr
Copy link
Member

dirkgr commented Apr 2, 2021

It's independent of the CUDA version then? Just depends on the compute capability?

@nelson-liu
Copy link
Contributor Author

nelson-liu commented Apr 2, 2021

Ah, the issue is that CUDA 11+ is the only CUDA version (with a pytorch release) that supports all of these GPUs (hence why I'm using it in particular). I haven't tried CUDA 10.2, I'll give it a shot.

@nelson-liu
Copy link
Contributor Author

Another possibly-relevant hint: sometimes, the error I get out is RuntimeError: CUDA error: an illegal memory access was encountered.

the CUDA 11.1 release notes mention a fixed issue in cublas: Fixed an issue that caused an Address out of bounds error when calling cublasSgemm().. Some people have reported that using 11.2 fixed this: NVIDIA/apex#580 (comment) . However, that might not be on PyTorch, since PyTorch doesn't support 11.2 ( pytorch/pytorch#50232 (comment) ), and it seems like they might just wait until 11.3 ( pytorch/pytorch#54246 (review) ).

Given that others are seeing this as well ( pytorch/pytorch#53957 ), maybe this is just an issue with PyTorch. It'd be nice to hear whether you're able to reproduce it on a Titan XP or older GPU, though, just so we can perhaps verify that that's the case.

@dirkgr
Copy link
Member

dirkgr commented Apr 5, 2021

I'm trying this now.

@dirkgr
Copy link
Member

dirkgr commented Apr 5, 2021

Turns out I don't have a server that has a card this old but an Nvidia driver recent enough to run CUDA 11.

@nelson-liu
Copy link
Contributor Author

Ah, no worries then. Thanks for looking into it regardless, hopefully there's more info on the PyTorch upstream side. I'll close this for now, maybe it'll be useful for wayward google wanderers.

@dirkgr
Copy link
Member

dirkgr commented Apr 7, 2021

I've asked about getting the drivers updated. If I hear more, I'll let you know. Also, if this problem pops up in other context, let us know. If it's something we can fix, or at least work around, we should fix it.

@nelson-liu
Copy link
Contributor Author

FWIW, this seems to be fixed with the 1.9.0 nightly.

@dirkgr
Copy link
Member

dirkgr commented May 24, 2021

Took a while, but I'm always super happy when PyTorch fixes a bug for us :-)

@jasonyoun
Copy link

I had the same issue with 1.8.0, CUDA 11.1. I'd like to also reassure people like me that updating to 1.9.0 nightly fixed the issue.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

4 participants