-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up CPU side of GPU rechits #125
Speed up CPU side of GPU rechits #125
Conversation
Successfully tested at #127 (comment) . |
I think reusing some per-module, per-stream buffers does make sense. |
Still, we should eventually coalesce all the allocations into fewer ones (here and everywhere else). |
Validation summaryReference release CMSSW_10_2_1 at d00b7b4
|
Running the tests again to see if we can get a fair comparison of the performance... |
I started to test something I've been thinking for a while, could you hold off merging for a few hours? |
… use cudaMemcpy2DAsync to transfer Trying to reduce memory copy overheads
Ok, the last commit has an experiment of replacing 9 32-bit buffers with one buffer, and 5 16-bit buffers with one buffer. The memory copy reductions are 5->1 and 3->1, respectively. The memory allocation is done with Performance comparison (on a single event)
So yes, it does improve, even if the numbers are rather small in the absolute scale. I'd really like to not spread the pattern as-is, but build some abstraction (e.g. along |
On the other hand, it seems to me (with #129) that Ok, looking from the code cmssw/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorGPU.cu Lines 121 to 127 in 907e17c
cmssw/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorGPU.h Lines 152 to 158 in 907e17c
cmssw/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h Lines 228 to 243 in 907e17c
the allocated memory (per EDM stream) are (if I got them right)
|
I'd be interested to know how much of this memory we actually use... |
The amount of doublets and the amount of doublets per hit depends on the layer. I set this to be safe in bpix1 and it's the same for all the layers, but one can think of setting a different maximum which changes per layer.. |
Validation summaryReference release CMSSW_10_2_1 at d00b7b4
|
@makortel would you rather merge it as it is now, or see if we can build a more general approach first ? |
I would merge it now, and leave the better abstraction to a later exercise. |
I haven't looked closely the validations of recent PRs, are the fluctuations shown for 10824.8 in above on the expected magnitude? |
Yes, I think we regularly see fluctuations at the percent level in the summary. On the other hand, |
032a572
to
c04392d
Compare
That should do the trick. |
Thanks. More in general, I am wondering at what gets flagged by As there were no |
@@ -48,6 +48,7 @@ namespace pixelgpudetails { | |||
// Order such that the first ones are the ones transferred to CPU | |||
static_assert(sizeof(uint32_t) == sizeof(float)); // just stating the obvious | |||
cudaCheck(cudaMallocPitch(&gpu_.owner_32bit_, &gpu_.owner_32bit_pitch_, MAX_HITS*sizeof(uint32_t), 9)); | |||
cudaCheck(cudaMemsetAsync(gpu_.owner_32bit_, 0x0, gpu_.owner_32bit_pitch_*9, cudaStream.id())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why pitch * 9
?
shouldn't it be pitch * 9 * MAX_HITS*sizeof(uint32_t)
?
or am I just being confused ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok, I was being confused: pitch
is the rounded up value of MAX_HITS*sizeof(uint32_t)
, so pitch * 9
it is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pitch
is the allocated width of a row in bytes
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1g32bd7a39135594788a542ae72217775c
Since the |
So am I just hiding something under the carpet with the |
I have no idea... |
Taking a closer look, the first complaint is from a 32-byte area to which the (and to my earlier worry about the padding, there is no padding in this case in |
Thanks for looking into it. I am still confused: does it mean that we were reading from that memory before having written anything into it ? P.S I would run the tests again, but EOS seems to be down, so I won't be able to upload the test results anyway... |
Good questions, I don't know. It could be some weird interplay between I made a little test program |
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
- allocate HitsOnCPU buffers once per job per edm stream - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync - initialise the full memory buffer to keep cuda-memchekc happy
HitsOnCPU
allocates the (pinned) host memory on each event. Profiler shows that this is rather costly (more than half of the wall clock time spent in the kernels...). The first commit moves the allocations as the first action in the function, because I first saw a peculiar profile where one of thecudaMallocHost
took very long time and delayed the queueing of the asynchronous memory copies, thus making theacquire()
to occupy CPU.Of course this didn't significantly improve the situation though, so the second commit changes the pattern to the same as in raw2cluster, i.e. allocate (also host side) buffers once per module (i.e. per job per EDM stream), and the event product just holds pointers to them. Not ideal, but faster (in a single event the wall clock time of
acquire()
reduces from ~240 us to ~170 us).We can of course discuss whether this approach is what we really want (probably not), but at the moment it seems to be the fastest.
@fwyzard @VinInn @felicepantaleo