-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CUDAMonitoringService #129
Add CUDAMonitoringService #129
Conversation
I think it may get confusing also with a single cmsRun job, as soon as we go multi-stream. Of course, if we introduced our own memory allocator for CUDA, we could include such functionality. On the other hand, if we go with unified memory, I think we loose the possibility of tracing the memory usage, as the CUDA runtime will be swapping memory in and out of the GPU. |
void dumpUsedMemory(T& log, int num) { | ||
for(int i = 0; i < num; ++i) { | ||
size_t freeMemory, totalMemory; | ||
cudaSetDevice(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it set the current device back to the original one afterwards ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it would probably be polite. On the other hand it shouldn't affect anything we do, as we should set the current device explicitly everywhere. I'll add the setting-back on Monday.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it set the current device back to the original one afterwards ?
Done.
Yeah, that would be even nicer to have. Here I wanted something quick&dirty to give hints which module could be allocating lots of memory. |
Add a new Service for simple CUDA monitoring. Currently included are the used/total memory
beginStream()
of each module (default on)The approach to obtain the memory information should be rethought at some point. Currently it reports the global state of the device, which gets confusing if there are multiple processes using on the same device.