Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

Merged
merged 3 commits into from
Mar 16, 2022

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Mar 14, 2022

With NVIDIA/spark-rapids#4710 we found some issues with the async pool that may cause memory errors with older drivers. This was confirmed with the cuda team. For driver version < 11.5, we'll disable cudaMemPoolReuseAllowOpportunistic.

@abellina

@rongou rongou added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change cpp Pertains to C++ code labels Mar 14, 2022
@rongou rongou self-assigned this Mar 14, 2022
@rongou rongou requested a review from a team as a code owner March 14, 2022 16:09
Copy link
Member

@leofang leofang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not fully bullet proof: the old cudaDevAttrMemoryPoolsSupported check needs to be kept as the support could be hardware dependent.

Also, #990 is a renovation of the async MR support, so I'd suggest to keep others in the loop.

@rongou
Copy link
Contributor Author

rongou commented Mar 14, 2022

Added back the device attribute check.

Copy link
Contributor

@jrhemstad jrhemstad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to fully disable cudaMallocAsync for <11.5. We just need to disable opportunistic reuse. This is done via cudaMemPoolSetAttribute and setting cudaMemPoolReuseAllowOpportunistic to zero.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g0229135f7ef724b4f479a435ca300af5

@rongou
Copy link
Contributor Author

rongou commented Mar 14, 2022

Do we want to disable it for all cuda driver versions?

@jrhemstad
Copy link
Contributor

Do we want to disable it for all cuda driver versions?

No, just for less than 11.5.

@rongou
Copy link
Contributor Author

rongou commented Mar 14, 2022

There is no need to fully disable cudaMallocAsync for <11.5. We just need to disable opportunistic reuse. This is done via cudaMemPoolSetAttribute and setting cudaMemPoolReuseAllowOpportunistic to zero.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g0229135f7ef724b4f479a435ca300af5

Done.

@harrism
Copy link
Member

harrism commented Mar 14, 2022

@rongou can you please edit the title and description to better reflect the changes?

@rongou rongou changed the title Require CUDA driver 11.5+ for the async mr Disable opportunistic reuse in async mr when cuda driver < 11.5 Mar 15, 2022
@rongou
Copy link
Contributor Author

rongou commented Mar 15, 2022

@harrism done.

@harrism
Copy link
Member

harrism commented Mar 15, 2022

@leofang please re-review.

@rongou
Copy link
Contributor Author

rongou commented Mar 16, 2022

@leofang please take another look. Thanks!

@jrhemstad
Copy link
Contributor

@robertmaynard @rongou fyi, this is going to need to be updated with #990. So which ever merges first, the other will need to update.

@rongou
Copy link
Contributor Author

rongou commented Mar 16, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 438d312 into rapidsai:branch-22.04 Mar 16, 2022
@rongou rongou deleted the async-cuda-driver branch June 10, 2024 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team bug Something isn't working cpp Pertains to C++ code non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants