RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

nelson-liu · 2021-03-19T20:25:55Z

Checklist

Description

When I train RoBERTa (or BERT, but let's just stick with RoBERTa in this issue in the interest of simplicity) on MNLI, I get an odd CUDA error.

  File "/opt/conda/lib/python3.7/site-packages/allennlp/models/basic_classifier.py", line 116,[26/1829$
rd
    embedded_text = self._text_field_embedder(tokens)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/allennlp/modules/text_field_embedders/basic_text_field_e
mbedder.py", line 103, in forward
    token_vectors = embedder(**tensors, **forward_params_values)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/allennlp/modules/token_embedders/pretrained_transformer_
embedder.py", line 201, in forward
    transformer_output = self.transformer_model(**parameters)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 8
22, in forward
    return_dict=return_dict,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 5
15, in forward
    output_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 4
36, in forward
    self.feed_forward_chunk, self.chunk_size_feed_forward, self.seq_len_dim, attention_output
  File "/opt/conda/lib/python3.7/site-packages/transformers/modeling_utils.py", line 1819, in apply_chu
nking_to_forward
    return forward_fn(*input_tensors)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 4
47, in feed_forward_chunk
    intermediate_output = self.intermediate(attention_output)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/models/roberta/modeling_roberta.py", line 3
48, in forward
    hidden_states = self.dense(hidden_states)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
    return F.linear(input, self.weight, self.bias)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m
, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Environment

I made a docker image that reproduces this issue at: https://hub.docker.com/r/nfliu/torch1.8.0-sgemm-execution-debugging . The associated dockerfile is https://gist.github.com/nelson-liu/f80d76f5557d48f2a52b2082b1bf86da . In short, it is based off of the NVIDIA cuda 11.1 container, and installs allennlp and allennlp-models off the most recent commits, and also pytorch 1.8.0+cu111. The python is python 3.7

Here's the output of nvidia-smi (for things like driver version, etc)

$ nvidia-smi
Fri Mar 19 13:26:08 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.56       Driver Version: 460.56       CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  TITAN Xp            On   | 00000000:5E:00.0 Off |                  N/A |
| 23%   26C    P8     9W / 250W |      1MiB / 12196MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Steps to reproduce

Go to a machine with a titanxp and a driver that supports cuda 11.1
nvidia-docker run --rm -it nfliu/torch1.8.0-sgemm-execution-debugging
allennlp train https://gist.githubusercontent.com/nelson-liu/2164bb51097c5a8f9f9e8 d7784f8473e/raw/ce93da75558489177556355c8d54ca4949417b8b/roberta_base_mnli.jsonnet -s output
You should see the error above within the first epoch. If not, it'd be great to know that you can't reproduce the issue.

The config is at https://gist.github.com/nelson-liu/2164bb51097c5a8f9f9e8d7784f8473e , it's exactly the same as the RoBERTa MNLI config except I'm using RoBERTa base and a batch size of 8, since the titanxp has a bit less memory.

The text was updated successfully, but these errors were encountered:

nelson-liu · 2021-03-19T20:35:40Z

I should also note that I've seen this when running the sample on GTX TITAN X (so it doesn't seem like an issue that's isolated to that particular GPU / that particular GPU has gone bad). I could not reproduce the issue when running with a Titan RTX.

nelson-liu · 2021-03-20T08:40:27Z

This error doesn't seem to occur if I use the HF Transformer run_glue script.

nelson-liu · 2021-04-01T00:08:32Z

@AkshitaB i saw you self-assigned this issue---did you get a chance to see if you can repro on titan XP with the above docker image?

dirkgr · 2021-04-02T00:34:44Z

I assume this also doesn't happen when you run on CPU? Usually you get better error messages on CPU.

dirkgr · 2021-04-02T00:49:03Z

We don't have any Titans anymore outside the vision team. I'm trying to reproduce it on allennlp-server4, which has Quadro RTX.

dirkgr · 2021-04-02T00:52:04Z

@nelson-liu, if I don't reproduce it, can you set the CUDA_LAUNCH_BLOCKING=1 environment variable and see if you get a better error?

If this is a general problem with AllenNLP, it becomes my top priority. But I'm not keen on debugging a specific bug in CUDA that only occurs with a specific combination of CUDA and GPU. Can you run an older version of CUDA, or on a different GPU, to get around this problem?

nelson-liu · 2021-04-02T01:07:18Z

We don't have any Titans anymore outside the vision team. I'm trying to reproduce it on allennlp-server4, which has Quadro RTX.

I think that should work...my suspicion is that it'll fail if the compute capability is 6.1 or below ( https://developer.nvidia.com/cuda-gpus ). It works on a Titan RTX, and Quadro RTX's have the same compute capability of 7.5

If this is a general problem with AllenNLP, it becomes my top priority. But I'm not keen on debugging a specific bug in CUDA that only occurs with a specific combination of CUDA and GPU.

I managed to reproduce this issue on Titan XP, K40, GTX Titan X. I haven't tried any other GPUs, but I'll gather some more data and also try running with CUDA_LAUNCH_BLOCKING.

Can you run an older version of CUDA, or on a different GPU, to get around this problem?

So the context here is that I'm in a heterogeneous cluster environment, where the ndoes have anythign between a Titan XP (majority of nodes), K40, GTX Titan X, Titan RTX, RTX 3080, Titan V, or 2080 Ti. It'd be great if i could run my jobs on any node (would certainly speed things up). More generally, a lot of users are still on these older GPUs, and I just wanted to see if it was reproducible outside of my organization.

dirkgr · 2021-04-02T01:08:25Z

It's independent of the CUDA version then? Just depends on the compute capability?

nelson-liu · 2021-04-02T01:09:30Z

Ah, the issue is that CUDA 11+ is the only CUDA version (with a pytorch release) that supports all of these GPUs (hence why I'm using it in particular). I haven't tried CUDA 10.2, I'll give it a shot.

nelson-liu · 2021-04-02T01:29:57Z

Another possibly-relevant hint: sometimes, the error I get out is RuntimeError: CUDA error: an illegal memory access was encountered.

the CUDA 11.1 release notes mention a fixed issue in cublas: Fixed an issue that caused an Address out of bounds error when calling cublasSgemm().. Some people have reported that using 11.2 fixed this: NVIDIA/apex#580 (comment) . However, that might not be on PyTorch, since PyTorch doesn't support 11.2 ( pytorch/pytorch#50232 (comment) ), and it seems like they might just wait until 11.3 ( pytorch/pytorch#54246 (review) ).

Given that others are seeing this as well ( pytorch/pytorch#53957 ), maybe this is just an issue with PyTorch. It'd be nice to hear whether you're able to reproduce it on a Titan XP or older GPU, though, just so we can perhaps verify that that's the case.

dirkgr · 2021-04-05T23:35:38Z

I'm trying this now.

dirkgr · 2021-04-05T23:42:00Z

Turns out I don't have a server that has a card this old but an Nvidia driver recent enough to run CUDA 11.

nelson-liu · 2021-04-07T03:37:18Z

Ah, no worries then. Thanks for looking into it regardless, hopefully there's more info on the PyTorch upstream side. I'll close this for now, maybe it'll be useful for wayward google wanderers.

dirkgr · 2021-04-07T04:56:07Z

I've asked about getting the drivers updated. If I hear more, I'll let you know. Also, if this problem pops up in other context, let us know. If it's something we can fix, or at least work around, we should fix it.

nelson-liu · 2021-05-24T23:20:28Z

FWIW, this seems to be fixed with the 1.9.0 nightly.

dirkgr · 2021-05-24T23:26:24Z

Took a while, but I'm always super happy when PyTorch fixes a bug for us :-)

jasonyoun · 2021-06-03T18:29:55Z

I had the same issue with 1.8.0, CUDA 11.1. I'd like to also reassure people like me that updating to 1.9.0 nightly fixed the issue.

nelson-liu added the bug label Mar 19, 2021

AkshitaB self-assigned this Mar 26, 2021

dirkgr self-assigned this Apr 2, 2021

nelson-liu closed this as completed Apr 7, 2021

nelson-liu reopened this May 24, 2021

nelson-liu closed this as completed May 24, 2021

glassnotes mentioned this issue May 30, 2021

[unitaryHACK] Create a Pytorch simulator #1225 PennyLaneAI/pennylane#1360

Merged

glassnotes mentioned this issue Jun 22, 2021

Fix backward pass on GPU in PyTorch interface PennyLaneAI/pennylane#1426

Merged

epicfaace mentioned this issue Aug 11, 2021

CUDA error CUBLAS_STATUS_EXECUTION_FAILED codalab/codalab-worksheets#3740

Closed

peastman mentioned this issue Sep 10, 2021

Can't run steps of dynamics with NNPOps TorchForce openmm/NNPOps#28

Open

XinzeLee mentioned this issue Sep 30, 2021

training problem XinzeLee/RotateObjectDetection#7

Open

bfshi mentioned this issue Oct 20, 2021

CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm bfshi/DGAM-Weakly-Supervised-Action-Localization#21

Closed

seuly1203 mentioned this issue Dec 1, 2021

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 O-per/cakd3_Project3#29

Closed

olastor mentioned this issue Feb 15, 2022

CUDA errors when using models that have been imported from HF and trained with SentenceTransformers UKPLab/sentence-transformers#324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

nelson-liu commented Mar 19, 2021 •

edited

Loading

nelson-liu commented Mar 19, 2021

nelson-liu commented Mar 20, 2021

nelson-liu commented Apr 1, 2021

dirkgr commented Apr 2, 2021

dirkgr commented Apr 2, 2021

dirkgr commented Apr 2, 2021 •

edited

Loading

nelson-liu commented Apr 2, 2021 •

edited

Loading

dirkgr commented Apr 2, 2021

nelson-liu commented Apr 2, 2021 •

edited

Loading

nelson-liu commented Apr 2, 2021

dirkgr commented Apr 5, 2021

dirkgr commented Apr 5, 2021

nelson-liu commented Apr 7, 2021

dirkgr commented Apr 7, 2021

nelson-liu commented May 24, 2021

dirkgr commented May 24, 2021

jasonyoun commented Jun 3, 2021

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasSgemm #5064

Comments

nelson-liu commented Mar 19, 2021 • edited Loading

Checklist

Description

Environment

Steps to reproduce

nelson-liu commented Mar 19, 2021

nelson-liu commented Mar 20, 2021

nelson-liu commented Apr 1, 2021

dirkgr commented Apr 2, 2021

dirkgr commented Apr 2, 2021

dirkgr commented Apr 2, 2021 • edited Loading

nelson-liu commented Apr 2, 2021 • edited Loading

dirkgr commented Apr 2, 2021

nelson-liu commented Apr 2, 2021 • edited Loading

nelson-liu commented Apr 2, 2021

dirkgr commented Apr 5, 2021

dirkgr commented Apr 5, 2021

nelson-liu commented Apr 7, 2021

dirkgr commented Apr 7, 2021

nelson-liu commented May 24, 2021

dirkgr commented May 24, 2021

jasonyoun commented Jun 3, 2021

nelson-liu commented Mar 19, 2021 •

edited

Loading

dirkgr commented Apr 2, 2021 •

edited

Loading

nelson-liu commented Apr 2, 2021 •

edited

Loading

nelson-liu commented Apr 2, 2021 •

edited

Loading