Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workaround for syevd in CUDA 12.0 #2332

Merged
merged 14 commits into from
Jul 25, 2024

Conversation

lowener
Copy link
Contributor

@lowener lowener commented May 22, 2024

Bug 4580093 of cusolver is causing an issue with cusolverDnXsyevd. This bug has been seen in rapidsai/cuml#5555 and impact PCA and Linear Regression with CUDA 12.0+. Setting the stream to a different value than cudaStreamPerThread seems to solve it as a workaround.

@lowener lowener added bug Something isn't working 3 - Ready for Review non-breaking Non-breaking change labels May 22, 2024
@lowener lowener requested a review from a team as a code owner May 22, 2024 20:37
@github-actions github-actions bot added the cpp label May 22, 2024
@lowener lowener changed the base branch from branch-24.06 to branch-24.08 May 23, 2024 00:01
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lowener for the fix. While the fix looks good, I think we should only apply it for the affected CTX versions.

@@ -91,6 +91,15 @@ void eigDC(raft::resources const& handle,
#if CUDART_VERSION < 11010
eigDC_legacy(handle, in, n_rows, n_cols, eig_vectors, eig_vals, stream);
#else

// Use a new stream instead of `cudaStreamPerThread` to avoid cusolver bug # 4580093.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the cusolver bug is solved in cuda toolkit 12.4.1.003. It would be great if we apply this workaround only when we use an earlier cuda version.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple notes on manually creating streams / events.

cudaStream_t stream_new;
cudaEvent_t sync_event;
RAFT_CUDA_TRY(cudaStreamCreate(&stream_new));
RAFT_CUDA_TRY(cudaEventCreate(&sync_event));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, in this case it would be justified to use the resource::detail::get_cuda_stream_sync_event instead of manually managing the resource.
This event resource is normally intended for synchronization between the streams in the resource pool, but as long as you don't use the stream pool resource at the same time, it's ok to reuse the event.

You can also use the stream from the stream pool resource, but there is a small problem with it, that raft/rmm would create 16 streams by default instead of one :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that we shouldn't be calling detail APIs in any namdespaces outside of the immediate namespace where those detail APIs reside. Please expose this function if it's going to be used publicly.

cpp/include/raft/linalg/detail/eig.cuh Outdated Show resolved Hide resolved
@lowener lowener requested review from tfeher and achirkin July 22, 2024 14:17
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lowener for the update. LGTM.

Copy link
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update, LGTM!

@cjnolet
Copy link
Member

cjnolet commented Jul 24, 2024

/merge

@rapids-bot rapids-bot bot merged commit 7004a8e into rapidsai:branch-24.08 Jul 25, 2024
74 checks passed
@lowener lowener deleted the 24.06-light-syevd branch July 29, 2024 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review bug Something isn't working cpp non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

4 participants