Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CUDAMonitoringService #129

Merged

Conversation

makortel
Copy link

Add a new Service for simple CUDA monitoring. Currently included are the used/total memory

  • after the constructor of each module (default off)
  • after the beginStream() of each module (default on)
  • after the each event (default on)

The approach to obtain the memory information should be rethought at some point. Currently it reports the global state of the device, which gets confusing if there are multiple processes using on the same device.

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

I think it may get confusing also with a single cmsRun job, as soon as we go multi-stream.
However, I am not aware of any API to monitor the CUDA memory on a per-thread, per-stream or even per-context basis.

Of course, if we introduced our own memory allocator for CUDA, we could include such functionality.

On the other hand, if we go with unified memory, I think we loose the possibility of tracing the memory usage, as the CUDA runtime will be swapping memory in and out of the GPU.

void dumpUsedMemory(T& log, int num) {
for(int i = 0; i < num; ++i) {
size_t freeMemory, totalMemory;
cudaSetDevice(i);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it set the current device back to the original one afterwards ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it would probably be polite. On the other hand it shouldn't affect anything we do, as we should set the current device explicitly everywhere. I'll add the setting-back on Monday.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it set the current device back to the original one afterwards ?

Done.

@makortel
Copy link
Author

I think it may get confusing also with a single cmsRun job, as soon as we go multi-stream.
However, I am not aware of any API to monitor the CUDA memory on a per-thread, per-stream or even per-context basis.

Yeah, that would be even nicer to have. Here I wanted something quick&dirty to give hints which module could be allocating lots of memory.

@fwyzard fwyzard merged commit fcecd3c into cms-patatrack:CMSSW_10_2_X_Patatrack Aug 13, 2018
@fwyzard fwyzard added this to the CMSSW_10_2_2_Patatrack milestone Aug 14, 2018
@fwyzard fwyzard modified the milestone: CMSSW_10_2_2_Patatrack Sep 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants