Skip async mr tests when cuda runtime/driver < 11.2 #986

rongou · 2022-02-23T17:58:38Z

The cuda async allocator was added in 11.2, if the runtime or driver is older than that the tests wouldn't work.

Fixes #985

jakirkham · 2022-02-23T18:01:13Z

cc @ajschmidt8 (for awareness)

ajschmidt8 · 2022-02-23T18:18:24Z

@rongou, thanks for this! Can you apply the patch below to your PR? This will correctly execute all of the tests and let us ensure that these changes are working as expected.

diff --git a/ci/gpu/build.sh b/ci/gpu/build.sh
index d2065e0d..95f95d4a 100755
--- a/ci/gpu/build.sh
+++ b/ci/gpu/build.sh
@@ -102,7 +102,7 @@ else
 
     gpuci_logger "Running googletests"
     # run gtests from librmm_tests package
-    for gt in "$CONDA_PREFIX/bin/gtests/librmm/*" ; do
+    for gt in "$CONDA_PREFIX/bin/gtests/librmm/"*; do
         ${gt} --gtest_output=xml:${TESTRESULTS_DIR}/
         exitcode=$?
         if (( ${exitcode} != 0 )); then

rongou · 2022-02-23T18:57:24Z

@ajschmidt8 done.

ajschmidt8 · 2022-02-23T19:29:13Z

@rongou, ah. So these failures are where I got stumped in my previous PR. Not sure what to make of them.

jrhemstad · 2022-02-23T19:53:53Z

tests/mr/device/cuda_async_mr_tests.cpp

+    int driver_version{};
+    RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));


Suggested change

int driver_version{};

RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));

int driver_supports_pool;

RMM_CUDA_TRY(cudaDeviceGetAttribute(&driver_supports_pool, cudaDevAttrMemoryPoolsSupported);

This is also needed here:

rmm/tests/mr/device/mr_test.hpp

Lines 258 to 261 in d8dc715

inline auto make_arena()

{

return rmm::mr::make_owning_wrapper<rmm::mr::arena_memory_resource>(make_cuda());

}

We should wrap this logic in a function like is_cuda_pool_available.

Maybe static bool cuda_async_memory_resource::is_supported(). Then use this in the cuda_async_mr ctor as well as here in the tests.

I get a symbol not found error if I use cudaDevAttrMemoryPoolsSupported with cuda 11.0.

Well you still have to guard it just like in the ctor using the flag we already have. The problem we are solving isn't compiling with cuda 11.0. It's compiling with 11.2+ and running on <11.2.

harrism

Thanks for fixing. I think the checks can be simpler and all be encapsulated in a static MR method.

harrism · 2022-02-23T20:54:52Z

tests/mr/device/skip_async.hpp

+{
+  static auto runtime_version{[] {
+    int runtime_version{};
+    RMM_CUDA_TRY(cudaRuntimeGetVersion(&runtime_version));


I think this should be a static member of the cuda_async_memory_resource. And it should use the simple check that is already in the ctor, not check explicit driver and runtime versions. The check in the ctor already factors in both driver and runtime.

harrism · 2022-02-23T21:12:14Z

include/rmm/mr/device/cuda_async_memory_resource.hpp

+   */
+  static bool is_supported()
+  {
+    static auto runtime_version{[] {


Convince me that this isn't sufficient.

Suggested change

static auto runtime_version{[] {

static bool is_supported()

{

#ifdef RMM_CUDA_MALLOC_ASYNC_SUPPORT

// Check if cudaMallocAsync Memory pool supported

auto const device = rmm::detail::current_device();

int cuda_pool_supported{};

auto result =

cudaDeviceGetAttribute(&cuda_pool_supported, cudaDevAttrMemoryPoolsSupported, device.value());

return result == cudaSuccess and cuda_pool_supported;

#else

return false;

#endif

}

You can compile with a newer cuda toolkit but run in an older runtime, then you'd get a symbol not found error.

Which symbol would not be found? cudaDeviceAttribute will exist on all CUDA versions we support.

cudaDevAttrMemoryPoolsSupported

Once compiled that would be a value compiled into our binary that would get passed to the API, wouldn't it? Not a symbol.

Pretty sure this code already works in the constructor.

Hmm actually I think it's just the check doesn't work (build with cuda 11.6, run with 11.0):

./gtests/CUDA_ASYNC_MR_TEST: symbol lookup error: ./gtests/CUDA_ASYNC_MR_TEST: undefined symbol: cudaMemPoolCreate, version libcudart.so.11.0

I printed out the value of cuda_pool_supported, it's actually 1 when running under 11.0.

jrhemstad · 2022-02-23T21:28:48Z

tests/mr/device/mr_test.hpp

@@ -233,7 +232,7 @@ struct mr_factory {
 struct mr_test : public ::testing::TestWithParam<mr_factory> {
  void SetUp() override
  {
-    if (GetParam().name == "CUDA_Async" && should_skip_async()) {
+    if (GetParam().name == "CUDA_Async" && !rmm::mr::cuda_async_memory_resource::is_supported()) {


Wouldn't it be more robust to put it in the make_async factory?

What would it return?

Not sure if this is better, but please take a look.

harrism

One very small request. Otherwise looks good. Thanks @rongou !

harrism · 2022-02-24T01:19:29Z

tests/mr/device/mr_test.hpp

+    return std::make_shared<rmm::mr::cuda_async_memory_resource>();
+  }
+  return std::shared_ptr<rmm::mr::cuda_async_memory_resource>{};


This is a bit subtle since the only difference is () vs. {}. Maybe use an explicit nullptr?

jrhemstad · 2022-02-24T02:18:34Z

tests/mr/device/mr_test.hpp

@@ -234,6 +234,10 @@ struct mr_test : public ::testing::TestWithParam<mr_factory> {
  {
    auto factory = GetParam().factory;
    mr           = factory();
+    if (mr == nullptr) {
+      GTEST_SKIP() << "Skipping tests since the memory resource is not supported with this CUDA "


Putting the GTEST_SKIP directly in the make_cuda_async doesn't work? I'm not sure what GTEST_SKIP actually does or what scope it needs to be used in.

It's a macro that only works in a test method or SetUp.

Ah, I thought it might throw an exception that gtest knows to catch or something.

ajschmidt8 · 2022-02-24T23:42:37Z

rerun tests

ajschmidt8 · 2022-02-25T15:30:44Z

rerun tests

ajschmidt8

Awesome! LGTM 🔥

rongou · 2022-02-25T17:57:07Z

@gpucibot merge

harrism · 2022-02-27T00:37:05Z

@rongou and @ajschmidt8 in the future please remember that we require two C++ codeowner approvals before merging C++ PRs. There was only one approval.

Skip async mr tests when cuda runtime/driver < 11.2

445001e

rongou added bug Something isn't working 3 - Ready for review Ready for review by team non-breaking Non-breaking change labels Feb 23, 2022

rongou requested review from harrism and ajschmidt8 February 23, 2022 17:58

rongou requested a review from a team as a code owner February 23, 2022 17:58

rongou self-assigned this Feb 23, 2022

rongou requested a review from vyasr February 23, 2022 17:58

github-actions bot added the cpp Pertains to C++ code label Feb 23, 2022

patch to correctly execute all of the tests

7f9e08e

rongou requested a review from a team as a code owner February 23, 2022 18:56

github-actions bot added the gpuCI label Feb 23, 2022

jrhemstad reviewed Feb 23, 2022

View reviewed changes

skip other async tests

62daf8e

harrism requested changes Feb 23, 2022

View reviewed changes

add is_supported() to async mr

8bf1edf

harrism requested changes Feb 23, 2022

View reviewed changes

jrhemstad reviewed Feb 23, 2022

View reviewed changes

rongou added 2 commits February 23, 2022 13:49

return nullptr if async is not supported

b1cbb1f

use cudaDeviceGetAttribute

f6ce7bd

rongou requested review from harrism and jrhemstad February 24, 2022 01:12

harrism approved these changes Feb 24, 2022

View reviewed changes

jrhemstad reviewed Feb 24, 2022

View reviewed changes

explicit nullptr

0470d44

fix tests for older cuda driver

4222061

rongou requested a review from jrhemstad February 25, 2022 16:45

ajschmidt8 approved these changes Feb 25, 2022

View reviewed changes

rapids-bot bot merged commit ace0964 into rapidsai:branch-22.04 Feb 25, 2022

rongou deleted the skip-async branch June 10, 2024 20:33

		int driver_version{};
		RMM_CUDA_TRY(cudaDriverGetVersion(&driver_version));

	inline auto make_arena()
	{
	return rmm::mr::make_owning_wrapper<rmm::mr::arena_memory_resource>(make_cuda());
	}

-    static auto runtime_version{[] {
+  static bool is_supported()
+  {
+   #ifdef RMM_CUDA_MALLOC_ASYNC_SUPPORT
+    // Check if cudaMallocAsync Memory pool supported
+    auto const device = rmm::detail::current_device();
+    int cuda_pool_supported{};
+    auto result =
+      cudaDeviceGetAttribute(&cuda_pool_supported, cudaDevAttrMemoryPoolsSupported, device.value());
+    return result == cudaSuccess and cuda_pool_supported;
+#else
+    return false;
+#endif
+}

Skip async mr tests when cuda runtime/driver < 11.2 #986

Skip async mr tests when cuda runtime/driver < 11.2 #986

Conversation

rongou commented Feb 23, 2022

jakirkham commented Feb 23, 2022

ajschmidt8 commented Feb 23, 2022 • edited Loading

rongou commented Feb 23, 2022

ajschmidt8 commented Feb 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

harrism left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajschmidt8 commented Feb 24, 2022

ajschmidt8 commented Feb 25, 2022

ajschmidt8 left a comment

Choose a reason for hiding this comment

rongou commented Feb 25, 2022

harrism commented Feb 27, 2022

ajschmidt8 commented Feb 23, 2022 •

edited

Loading

ajschmidt8 commented Feb 23, 2022 •

edited

Loading