Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

rongou · 2022-03-14T16:09:23Z

With NVIDIA/spark-rapids#4710 we found some issues with the async pool that may cause memory errors with older drivers. This was confirmed with the cuda team. For driver version < 11.5, we'll disable cudaMemPoolReuseAllowOpportunistic.

@abellina

leofang

This is not fully bullet proof: the old cudaDevAttrMemoryPoolsSupported check needs to be kept as the support could be hardware dependent.

Also, #990 is a renovation of the async MR support, so I'd suggest to keep others in the loop.

rongou · 2022-03-14T16:28:36Z

Added back the device attribute check.

jrhemstad

There is no need to fully disable cudaMallocAsync for <11.5. We just need to disable opportunistic reuse. This is done via cudaMemPoolSetAttribute and setting cudaMemPoolReuseAllowOpportunistic to zero.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g0229135f7ef724b4f479a435ca300af5

rongou · 2022-03-14T19:14:10Z

Do we want to disable it for all cuda driver versions?

jrhemstad · 2022-03-14T19:29:32Z

Do we want to disable it for all cuda driver versions?

No, just for less than 11.5.

rongou · 2022-03-14T21:55:51Z

There is no need to fully disable cudaMallocAsync for <11.5. We just need to disable opportunistic reuse. This is done via cudaMemPoolSetAttribute and setting cudaMemPoolReuseAllowOpportunistic to zero.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY__POOLS.html#group__CUDART__MEMORY__POOLS_1g0229135f7ef724b4f479a435ca300af5

Done.

harrism · 2022-03-14T23:46:04Z

@rongou can you please edit the title and description to better reflect the changes?

rongou · 2022-03-15T00:25:00Z

@harrism done.

harrism · 2022-03-15T03:17:36Z

@leofang please re-review.

rongou · 2022-03-16T16:43:28Z

@leofang please take another look. Thanks!

jrhemstad · 2022-03-16T18:14:22Z

@robertmaynard @rongou fyi, this is going to need to be updated with #990. So which ever merges first, the other will need to update.

rongou · 2022-03-16T20:01:43Z

@gpucibot merge

Require CUDA driver 11.5+ for the async mr

2e4bd03

rongou added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change cpp Pertains to C++ code labels Mar 14, 2022

rongou self-assigned this Mar 14, 2022

rongou requested a review from a team as a code owner March 14, 2022 16:09

rongou requested review from harrism and jrhemstad March 14, 2022 16:09

rongou mentioned this pull request Mar 14, 2022

[BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator NVIDIA/spark-rapids#4710

Closed

abellina approved these changes Mar 14, 2022

View reviewed changes

leofang suggested changes Mar 14, 2022

View reviewed changes

also check device attribute

bc6c5d7

jrhemstad requested changes Mar 14, 2022

View reviewed changes

disable reuse for driver < 11.5

96acc78

jrhemstad approved these changes Mar 15, 2022

View reviewed changes

rongou changed the title ~~Require CUDA driver 11.5+ for the async mr~~ Disable opportunistic reuse in async mr when cuda driver < 11.5 Mar 15, 2022

harrism approved these changes Mar 15, 2022

View reviewed changes

rongou requested a review from leofang March 15, 2022 16:15

harrism mentioned this pull request Mar 16, 2022

[DOC] RAPIDS 22.04 Release Blog Outline #987

Closed

leofang approved these changes Mar 16, 2022

View reviewed changes

rapids-bot bot merged commit 438d312 into rapidsai:branch-22.04 Mar 16, 2022

robertmaynard mentioned this pull request Mar 16, 2022

Use CUDA 11.2+ features via dlopen #990

Merged

rongou deleted the async-cuda-driver branch June 10, 2024 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

rongou commented Mar 14, 2022 •

edited

Loading

leofang left a comment

rongou commented Mar 14, 2022

jrhemstad left a comment

rongou commented Mar 14, 2022

jrhemstad commented Mar 14, 2022

rongou commented Mar 14, 2022

harrism commented Mar 14, 2022

rongou commented Mar 15, 2022

harrism commented Mar 15, 2022

rongou commented Mar 16, 2022

jrhemstad commented Mar 16, 2022

rongou commented Mar 16, 2022

Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

Disable opportunistic reuse in async mr when cuda driver < 11.5 #993

Conversation

rongou commented Mar 14, 2022 • edited Loading

leofang left a comment

Choose a reason for hiding this comment

rongou commented Mar 14, 2022

jrhemstad left a comment

Choose a reason for hiding this comment

rongou commented Mar 14, 2022

jrhemstad commented Mar 14, 2022

rongou commented Mar 14, 2022

harrism commented Mar 14, 2022

rongou commented Mar 15, 2022

harrism commented Mar 15, 2022

rongou commented Mar 16, 2022

jrhemstad commented Mar 16, 2022

rongou commented Mar 16, 2022

rongou commented Mar 14, 2022 •

edited

Loading