Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throw rmm::out_of_memory when we know for sure #894

Merged
merged 5 commits into from
Oct 26, 2021

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Oct 16, 2021

When RMM fails to allocate a buffer, it currently throws a rmm::bad_alloc exception, which a user might want to catch, spill some GPU buffers, and try again. But that exception covers all error conditions, catching it blindly may hide some other more serious CUDA errors, making the code hard to debug. Adding a more specific rmm::out_of_memory exception and throwing it when we are certain we are running out of memory, so that it can be caught to trigger spilling.

@rongou rongou added feature request New feature or request 3 - Ready for review Ready for review by team non-breaking Non-breaking change improvement Improvement / enhancement to an existing function cpp Pertains to C++ code labels Oct 16, 2021
@rongou rongou self-assigned this Oct 16, 2021
@rongou rongou requested a review from a team as a code owner October 16, 2021 02:13
@rongou rongou removed the feature request New feature or request label Oct 16, 2021
Comment on lines 167 to 188
#define RMM_CUDA_TRY(...) \
GET_RMM_CUDA_TRY_MACRO(__VA_ARGS__, RMM_CUDA_TRY_4, INVALID, RMM_CUDA_TRY_2, RMM_CUDA_TRY_1) \
(__VA_ARGS__)
#define GET_RMM_CUDA_TRY_MACRO(_1, _2, NAME, ...) NAME
#define RMM_CUDA_TRY_2(_call, _exception_type) \
do { \
cudaError_t const error = (_call); \
if (cudaSuccess != error) { \
cudaGetLastError(); \
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \
throw _exception_type{std::string{"CUDA error at: "} + __FILE__ + ":" + \
RMM_STRINGIFY(__LINE__) + ": " + cudaGetErrorName(error) + " " + \
cudaGetErrorString(error)}; \
} \
#define GET_RMM_CUDA_TRY_MACRO(_1, _2, _3, _4, NAME, ...) NAME
#define RMM_CUDA_TRY_4(_call, _exception_type, _custom_error, _custom_exception_type) \
do { \
cudaError_t const error = (_call); \
if (cudaSuccess != error) { \
cudaGetLastError(); \
auto const msg = std::string{"CUDA error at: "} + __FILE__ + ":" + RMM_STRINGIFY(__LINE__) + \
": " + cudaGetErrorName(error) + " " + cudaGetErrorString(error); \
if ((_custom_error) == error) { \
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \
throw _custom_exception_type{msg}; \
} else { \
/*NOLINTNEXTLINE(bugprone-macro-parentheses)*/ \
throw _exception_type{msg}; \
} \
} \
} while (0)
#define RMM_CUDA_TRY_2(_call, _exception_type) \
RMM_CUDA_TRY_4(_call, _exception_type, cudaSuccess, rmm::cuda_error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this macro is smelly now.

First, it's non-intuitive at the call site what the different arguments mean.

Second, if we allow customizing the exception for one error, why not allow customizing the exception for an arbitrary number of errors?

If we're going to go this route, I think I'd like to see something more generic. Maybe a macro that accepts some kind of trait that maps a cudaError_t to a particular exception that defaults to rmm::bad_alloc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a new macro for allocation calls. Seems cleaner, let me know what you think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it solves the first issue of the callsite being confusing, but it doesn't solve the second issue of customizing exceptions for one or more cudaError_t values.

Comment on lines 69 to 70
out_of_memory(const char* msg) : bad_alloc{msg} {}
out_of_memory(std::string const& msg) : bad_alloc{msg} {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
out_of_memory(const char* msg) : bad_alloc{msg} {}
out_of_memory(std::string const& msg) : bad_alloc{msg} {}
using bad_alloc::bad_alloc;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 63 to 67
/**
* @brief Exception thrown when RMM runs out of memory
*
*/
class out_of_memory : public bad_alloc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be very precise about defining what "out of memory" means and what the (non)guarantees are about the behavior of this exception vs. a more generic bad_alloc. i.e., what exactly is it that this exception conveys that a normal bad_alloc does not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

/**
* @brief Exception thrown when RMM runs out of memory
*
* This is thrown under the following conditions:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having a list in a comment that we have to maintain. I think instead we should make it very clear that this error should only be thrown when we know for sure a resource is out of memory.

I don't know for sure that cudaErrorMemoryAllocation always means OOM, BTW. Is this documented somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

According to the CUDA Runtime API doc:

cudaErrorMemoryAllocation = 2
The API call failed because it was unable to allocate enough memory to perform the requested operation.

@rongou
Copy link
Contributor Author

rongou commented Oct 26, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fcf92c5 into rapidsai:branch-21.12 Oct 26, 2021
@rongou rongou deleted the oom-exception branch November 23, 2021 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for review Ready for review by team cpp Pertains to C++ code improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants