Add CUDAMonitoringService #129

makortel · 2018-08-10T11:17:27Z

Add a new Service for simple CUDA monitoring. Currently included are the used/total memory

after the constructor of each module (default off)
after the beginStream() of each module (default on)
after the each event (default on)

The approach to obtain the memory information should be rethought at some point. Currently it reports the global state of the device, which gets confusing if there are multiple processes using on the same device.

fwyzard · 2018-08-10T16:34:31Z

I think it may get confusing also with a single cmsRun job, as soon as we go multi-stream.
However, I am not aware of any API to monitor the CUDA memory on a per-thread, per-stream or even per-context basis.

Of course, if we introduced our own memory allocator for CUDA, we could include such functionality.

On the other hand, if we go with unified memory, I think we loose the possibility of tracing the memory usage, as the CUDA runtime will be swapping memory in and out of the GPU.

fwyzard · 2018-08-10T16:36:40Z

HeterogeneousCore/CUDAServices/plugins/CUDAMonitoringService.cc

+  void dumpUsedMemory(T& log, int num) {
+    for(int i = 0; i < num; ++i) {
+      size_t freeMemory, totalMemory;
+      cudaSetDevice(i);


shouldn't it set the current device back to the original one afterwards ?

Yeah, it would probably be polite. On the other hand it shouldn't affect anything we do, as we should set the current device explicitly everywhere. I'll add the setting-back on Monday.

shouldn't it set the current device back to the original one afterwards ?

Done.

makortel · 2018-08-10T19:13:40Z

I think it may get confusing also with a single cmsRun job, as soon as we go multi-stream.
However, I am not aware of any API to monitor the CUDA memory on a per-thread, per-stream or even per-context basis.

Yeah, that would be even nicer to have. Here I wanted something quick&dirty to give hints which module could be allocating lots of memory.

Add CUDAMonitoringService

de574c8

makortel mentioned this pull request Aug 10, 2018

Speed up CPU side of GPU rechits #125

Merged

fwyzard reviewed Aug 10, 2018

View reviewed changes

Set the original device back, decorate with cudaCheck

a5522a0

fwyzard merged commit fcecd3c into cms-patatrack:CMSSW_10_2_X_Patatrack Aug 13, 2018

fwyzard added this to the CMSSW_10_2_2_Patatrack milestone Aug 14, 2018

fwyzard removed the comparison-pending label Sep 2, 2018

fwyzard modified the milestone: CMSSW_10_2_2_Patatrack Sep 2, 2018

fwyzard removed alca-pending labels Sep 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CUDAMonitoringService #129

Add CUDAMonitoringService #129

makortel commented Aug 10, 2018

fwyzard commented Aug 10, 2018

fwyzard Aug 10, 2018

makortel Aug 10, 2018

makortel Aug 13, 2018

makortel commented Aug 10, 2018

Add CUDAMonitoringService #129

Add CUDAMonitoringService #129

Conversation

makortel commented Aug 10, 2018

fwyzard commented Aug 10, 2018

fwyzard Aug 10, 2018

Choose a reason for hiding this comment

makortel Aug 10, 2018

Choose a reason for hiding this comment

makortel Aug 13, 2018

Choose a reason for hiding this comment

makortel commented Aug 10, 2018