Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RTC models do not compile for unknown future CUDA Architectures #844

Closed
ptheywood opened this issue May 3, 2022 · 2 comments · Fixed by #845
Closed

RTC models do not compile for unknown future CUDA Architectures #844

ptheywood opened this issue May 3, 2022 · 2 comments · Fixed by #845
Labels

Comments

@ptheywood
Copy link
Member

RTC models are compiled for the device's compute capability, i.e when running on a consumer Ampere GPU, nvrtc is passed --gpu-architecture=compute_86.

However, if the version of NVRTC does not know about the GPU architecture this will fail to compile, and the user can do nothing about this (other than use a more recent NVRTC)

This means that RTC models will not run on newer GPUs (without using newer features), unlike non-RTC models which will (via PTX embedding / JITing).

To reproduce this, CUDA 11.0 knows SM_80 but not SM_86, so attempting to run a CUDA 11.0 RTC model on consume ampere will fail RTC compialtion, with an error during RTC compilation such as:

Compiler options: --gpu-architecture=compute_86 --generate-line-info -DNDEBUG --std=c++17 --define-macro=SEATBELTS=0 --pre-include=/usr/local/cuda-11.0/include//cuda.h 
@ptheywood ptheywood added the bug label May 3, 2022
@ptheywood
Copy link
Member Author

The fix for this is to make use of nvrtcGetNumSupportedArchs and nvrtcGetSupportedArchs (docs) to find the arch's supported by the current nvrtc, and only pass the device's specific arch if it is in the list of supported arch's.

If it is not in the list, passing the latest arch that is supported should work (i.e. the last value returned by nvrtcGetSupportedArchs).

@ptheywood
Copy link
Member Author

ptheywood commented May 6, 2022

nvrtcGetNumSupportedArchs and nvrtcGetSupportedArchs were introduced in CUDA 11.0, so are not available in (the deprecated but not yet removed) CUDA 10.x.

In this case we have no idea about what CUDA arch's would work (other than the minimum configured at cmake time) so the only safe thing to do is remove setting the gencode if CUDA < 11.0.

As this is deprecated and can be removed at any time now, that's the easier option than worrying about a workaround.

Edit:
nvrtcGetNumSupportedArchs and nvrtcGetSupportedArchs were introduced in CUDA 11.2, so not available in CUDA 11.1 and older.

Support for these older CUDA versions could be:

  • Don't set the gencode if we can't query if it exists
  • Try an nvrtc compilation with the current gencode, if it fails, don't set a gencode
  • Hardcode the earliest and latest supported version based on the CUDA version macro(s) and / or the nvrtc version.

ptheywood added a commit that referenced this issue May 6, 2022
…the current nvrtc + device

Closes #844

The maximum compute capability supported by the currently linked nvrt that is less than or equal to the device's architecture is used for RTC compilation.

This fixes an issue where running an RTC model on consume ampere (SM_86) would fail on CUDA 11.0 and older, which are not aware of SM_86's existance.

CUDA 11.2+ includes methods to query which architectures are supported by the dynamically linked NVRTC (which may add or remove architectures in new releases, and due to a stable ABI from 11.2 for all 11.x releases the linked version can be different than the version available at compile time).
CUDA 11.1 and below (11.1, 11.0 and 10.x currently in our case) do not include these methods, and due to the absence of a stable nvrtc ABI for these versions the known values can be hardcoded at compile time (grim but simple).

A method to select the most appropriate value form an ascending order vector has also been introduced, so this gencode functionality can be programatically tested without having to predict what values would be appropraite based on the current device and the cuda version used, which is a moving target.
ptheywood added a commit that referenced this issue May 6, 2022
…c & device

Closes #844

The maximum compute capability supported by the currently linked NVRTC that is less than or equal to the device's architecture is used for RTC compilation.

This fixes an issue where running an RTC model on consume ampere (SM_86) would fail on CUDA 11.0 and older, which are not aware of SM_86's existence.

CUDA 11.2+ includes methods to query which architectures are supported by the dynamically linked NVRTC (which may add or remove architectures in new releases, and due to a stable ABI from 11.2 for all 11.x releases the linked version can be different than the version available at compile time).
CUDA 11.1 and below (11.1, 11.0 and 10.x currently in our case) do not include these methods, and due to the absence of a stable nvrtc ABI for these versions the known values can be hardcoded at compile time (grim but simple).

A method to select the most appropriate value form an ascending order vector has also been introduced, so this gencode functionality can be programmatically tested without having to predict what values would be appropriate based on the current device and the cuda version used, which is a moving target.
ptheywood added a commit that referenced this issue May 9, 2022
…c & device

Closes #844

The maximum compute capability supported by the currently linked NVRTC that is less than or equal to the device's architecture is used for RTC compilation.

This fixes an issue where running an RTC model on consume ampere (SM_86) would fail on CUDA 11.0 and older, which are not aware of SM_86's existence.

CUDA 11.2+ includes methods to query which architectures are supported by the dynamically linked NVRTC (which may add or remove architectures in new releases, and due to a stable ABI from 11.2 for all 11.x releases the linked version can be different than the version available at compile time).
CUDA 11.1 and below (11.1, 11.0 and 10.x currently in our case) do not include these methods, and due to the absence of a stable nvrtc ABI for these versions the known values can be hardcoded at compile time (grim but simple).

A method to select the most appropriate value form an ascending order vector has also been introduced, so this gencode functionality can be programmatically tested without having to predict what values would be appropriate based on the current device and the cuda version used, which is a moving target.
mondus pushed a commit that referenced this issue May 11, 2022
…c & device

Closes #844

The maximum compute capability supported by the currently linked NVRTC that is less than or equal to the device's architecture is used for RTC compilation.

This fixes an issue where running an RTC model on consume ampere (SM_86) would fail on CUDA 11.0 and older, which are not aware of SM_86's existence.

CUDA 11.2+ includes methods to query which architectures are supported by the dynamically linked NVRTC (which may add or remove architectures in new releases, and due to a stable ABI from 11.2 for all 11.x releases the linked version can be different than the version available at compile time).
CUDA 11.1 and below (11.1, 11.0 and 10.x currently in our case) do not include these methods, and due to the absence of a stable nvrtc ABI for these versions the known values can be hardcoded at compile time (grim but simple).

A method to select the most appropriate value form an ascending order vector has also been introduced, so this gencode functionality can be programmatically tested without having to predict what values would be appropriate based on the current device and the cuda version used, which is a moving target.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant