-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move to View #354
Move to View #354
Conversation
uint32_t const* __restrict__ hitsModuleStart, | ||
TrackingRecHit2DSOAView* phits) { | ||
auto& hits = *phits; | ||
|
||
auto const digis = *pdigis; // the copy is intentional! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto const & digis
makes this version 10% slower.
if all digis.clus(i) etc
were not locally copied it was 20% slower...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would I know.... (I have not looked at ptx_
It is only an observation based on nvprof. I suspect TrackingRecHit2DSOAView is large and is really copied "somewhere" (it is also declared non const...)
while SiPixelDigisCUDA::DeviceConstView most probably just goes in registers. if not copied I suspect that the call to __lcg are for some reason repeated each time...
I noticed the problem of indirection already in the previous version (see around line 161),
it really seems that with indirection it keeps loading from memory each time: we need to check in ptx
The point is: more investigative work is needed to understand the origin of performance differences: do not exclude that changing compiler (say clang) changes performance diff as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine, but I think it would be useful to include these notes also in the code as comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
uint32_t const* __restrict__ digiModuleStart, | ||
uint32_t const* __restrict__ clusInModule, | ||
uint32_t const* __restrict__ moduleId, | ||
int32_t const* __restrict__ clus, | ||
int numElements, | ||
uint32_t const* __restrict__ hitsModuleStart, | ||
TrackingRecHit2DSOAView* phits) { | ||
auto& hits = *phits; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here if I remove "&" it becomes twice (factor 2) slower!!!!
now gpuPixelRecHits::getHits is 5% slower than original |
reducing the number of threads to 128 makes it only 2% slower than the original (with 128 as well) |
Validation summaryReference release CMSSW_10_6_0 at b45186e
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good to me. Interesting to see the cost of additional level of pointer indirection(?)...
uint32_t const* __restrict__ hitsModuleStart, | ||
TrackingRecHit2DSOAView* phits) { | ||
auto& hits = *phits; | ||
|
||
auto const digis = *pdigis; // the copy is intentional! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?
the fact that loads from a view (both from global or nc) are not optimized if repeated is pretty obvious even with a trivial test case (see https://godbolt.org/z/cPoP1z ) We need to report to cuda developers: |
There is a lot of jitter for the overall performance on the V100's, and the T4 is still off. That said, what I got, running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 V100, is
So the ~1% drop seems real ? |
On the T4, running 16 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs:
So the effect here seems much smaller. |
|
Thanks for spotting it, I'll fix it by hand. |
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables
Finish to move to view...
Moving Raw2digi and clusterizer to view most probably not worth (reading and writing from the same data-structures)
PR ready for review!
Just to make everybody aware that deferencing device views in global memory is utterly slow.
The first attempt of this PR resulted in having gpuPixelRecHits::getHits 20% slower.
now is only 2% slower (see inline comments)