Cannot install apex on the machine of CUDA 12.2 #1761

momo1986 · 2023-12-21T16:16:05Z

Describe the Bug

Minimal Steps/Code to Reproduce the Bug
running script:
"python setup.py install --cpp_ext --cuda_ext"

The reporting log:
"torch.version = 2.1.2+cu121

Compiling cuda extensions with
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
from /usr/bin

Traceback (most recent call last):
File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 178, in
check_cuda_torch_binary_vs_bare_metal(CUDA_HOME)
File "/home/hwq/ray/adversarial_examples/apex/setup.py", line 40, in check_cuda_torch_binary_vs_bare_metal
raise RuntimeError(
RuntimeError: Cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries. Pytorch binaries were compiled with Cuda 12.1.
In some cases, a minor-version mismatch will not cause later errors: #323 (comment). You can try commenting out this check (at your own risk)."

CUDA Version is 12.2.

Expected Behavior
Install apex successfully
Environment
uname -a
Linux ps 6.2.0-36-generic #37~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Mon Oct 9 15:34:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
nvidia-smi
Fri Dec 22 00:15:43 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 |

foreverpiano · 2024-01-30T05:57:30Z

same issue

cs-wangfeng · 2024-02-03T13:21:50Z

Similar issue:

My GPU version is also CUDA 12.2. Installing apex directly results in the same error as mentioned above.

Then I switched to a conda virtual environment with CUDA version 11.3. My Torch version corresponds to CUDA 11.3, which is PyTorch 1.10. After that, using pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ installs apex successfully. However, when running the code, an error occurs:

Traceback (most recent call last):
  File ".../VALOR/./train.py", line 88, in <module>
    main(args)
  File ".../VALOR/./train.py", line 55, in main
    model = VALOR.from_pretrained(opts,checkpoint)
  File ".../VALOR/model/modeling.py", line 109, in from_pretrained
    model = cls(opts, *inputs, **kwargs)
  File ".../VALOR/model/pretrain.py", line 67, in __init__
    super().__init__(opts)
  File ".../VALOR/model/modeling.py", line 328, in __init__
    self.load_ast_model(base_cfg,config)
  File ".../VALOR/model/modeling.py", line 609, in load_ast_model
    self.audio_encoder = TransformerEncoder(model_cfg_audio, mode='prenorm')
  File ".../VALOR/model/transformer.py", line 149, in __init__
    layer = TransformerLayer(config, mode)
  File ".../VALOR/model/transformer.py", line 62, in __init__
    self.layernorm1 = LayerNorm(config.hidden_size, eps=1e-12)
  File ".../anaconda3/envs/valor1/lib/python3.9/site-packages/apex/normalization/fused_layer_norm.py", line 268, in __init__
    fused_layer_norm_cuda = importlib.import_module("fused_layer_norm_cuda")
  File ".../anaconda3/envs/valor1/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 984, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

'Readme' shows that we can use the option '--cuda_ext' to install fused_layer_norm_cuda, but that doesn't work.

Tsuki0125 · 2024-03-07T06:11:56Z

same issue：
File "", line 994, in _gcd_import
File "", line 971, in _find_and_load
File "", line 953, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Zhangwq76 · 2024-05-22T16:16:27Z

I think you can remove the check code in setup.py, then use
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

adafok · 2024-05-25T00:17:25Z

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

Zhangwq76 · 2024-05-25T07:11:26Z

I've encountered the same issue. @Zhangwq76 could you tell which part of check code we should remove?

line 39, in check_cuda_torch_binary_vs_bare_metal
# if (bare_metal_version != torch_binary_version):
# raise RuntimeError(
# "Cuda extensions are being compiled with a version of Cuda that does "
# "not match the version used to compile Pytorch binaries. "
# "Pytorch binaries were compiled with Cuda {}.\n".format(torch.version.cuda)
# + "In some cases, a minor-version mismatch will not cause later errors: "
# "#323 (comment). "
# "You can try commenting out this check (at your own risk)."
# )

momo1986 added the bug Something isn't working label Dec 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot install apex on the machine of CUDA 12.2 #1761

Cannot install apex on the machine of CUDA 12.2 #1761

momo1986 commented Dec 21, 2023

foreverpiano commented Jan 30, 2024

cs-wangfeng commented Feb 3, 2024 •

edited

Loading

Tsuki0125 commented Mar 7, 2024

Zhangwq76 commented May 22, 2024

adafok commented May 25, 2024

Zhangwq76 commented May 25, 2024

Cannot install apex on the machine of CUDA 12.2 #1761

Cannot install apex on the machine of CUDA 12.2 #1761

Comments

momo1986 commented Dec 21, 2023

foreverpiano commented Jan 30, 2024

cs-wangfeng commented Feb 3, 2024 • edited Loading

Tsuki0125 commented Mar 7, 2024

Zhangwq76 commented May 22, 2024

adafok commented May 25, 2024

Zhangwq76 commented May 25, 2024

cs-wangfeng commented Feb 3, 2024 •

edited

Loading