Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatibility with CUDA 11.3 #115

Closed
stephenswat opened this issue Jan 24, 2022 · 6 comments
Closed

Incompatibility with CUDA 11.3 #115

stephenswat opened this issue Jan 24, 2022 · 6 comments
Assignees
Labels
bug Something isn't working build This relates to the build system

Comments

@stephenswat
Copy link
Member

This is a quick continuation of #113, where we find that traccc is currently not compatible with CUDA 11.3, and I would like to know why. I'll keep this as a running log of my findings.

@stephenswat stephenswat added bug Something isn't working build This relates to the build system labels Jan 24, 2022
@stephenswat stephenswat self-assigned this Jan 24, 2022
@stephenswat
Copy link
Member Author

This is the compatibility matrix of CUDA versions installed on atspot01:

CUDA toolkit version Works
10.1.243 ❌ (expected)
10.2.89 ❌ (expected)
11.0.3 ✔️
11.1.1 ✔️
11.2.2 ✔️
11.3.1
11.4.3 ✔️
11.5.0 ✔️
11.5.1 ✔️

@stephenswat
Copy link
Member Author

Okay, there is some extremely odd behaviour happening inside nvcc, and I think inside cudafe++. It seems that for CUDA toolkit 11.3.1, the cudafe1.cpp translations of counting_grid_capabilities.cu and populating_grid.cu are incorrectly referencing spacepoint_t:

zone_t(scalar v, const spacepoint_t< neighbor_t, 2>  &nhood) const {
dindex_sequence zone(scalar v, const spacepoint_t< dindex, 2U>  &nhood) const {
dindex_sequence zone(scalar v, const spacepoint_t< dindex, 2U>  &nhood) const {

Here are the corresponding lines for CUDA 11.4.3:

zone_t(scalar v, const array_type< neighbor_t, 2>  &nhood) const {
dindex_sequence zone(scalar v, const array_type< dindex, 2U>  &nhood) const {
dindex_sequence zone(scalar v, const array_type< scalar, 2U>  &nhood) const {

The corresponding lines from detray/core/include/detray/grids/axis.hpp are:

zone_t(scalar v, const array_type<neighbor_t, 2> &nhood) const {
dindex_sequence zone(scalar v, const array_type<dindex, 2> &nhood) const {
dindex_sequence zone(scalar v, const array_type<scalar, 2> &nhood) const {

These files are generated from the corresponding .cpp4.ii files by cudafe++.

@stephenswat
Copy link
Member Author

I can confirm that the .cpp4.ii files have identical versions of these lines.

@stephenswat
Copy link
Member Author

Invoking the two versions of cudafe++ (11.3.1 and 11.4.3) on exactly the same input (the cudafe1.stub.cpp generated by cicc 11.3.1, and the .cpp4.ii by the 11.3.1 preprocessor) results in the same behaviour: the 11.3.1 version erroneously inserts spacepoint_t where it shouldn't be. Running the 11.2.2 version of cudafe++ produces the same correct output that 11.4.3 does.

@stephenswat
Copy link
Member Author

Okay, I am sufficiently convinced that this is a bug in cudafe++.

@stephenswat
Copy link
Member Author

Okay, I can't really debug this any further, because cudafe++ is opaque as hell, and as far as I know there aren't really any changelogs or documentation for it. However, I have boiled down the issue to the detray::axis::regular type. My guess is that cudafe++ can't cope with the complex kind (* → uint → *) → (*^n → *) → *, which I suspect is either due to the n-ary nature of the kind of the second type parameter, or because the first type parameter accepts a non-* kind.

The symptom of this is that it starts substituting (seemingly) random (incompatible) types, such as spacepoint_t, where it expects the array type or the vector type. I presume that this might be some kind of indexing error happening at template resolution time, but I don't have enough evidence to make any concrete claims.

To conclude, CUDA 11.3.1 is completely bat-shit insane. The only next steps might be to investigate CUDA 11.3.0 and CUDA 11.4.0, the directly preceding and following versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working build This relates to the build system
Projects
None yet
Development

No branches or pull requests

1 participant