Move to View #354

VinInn · 2019-06-12T14:30:07Z

Finish to move to view...
Moving Raw2digi and clusterizer to view most probably not worth (reading and writing from the same data-structures)

PR ready for review!

Just to make everybody aware that deferencing device views in global memory is utterly slow.
The first attempt of this PR resulted in having gpuPixelRecHits::getHits 20% slower.
now is only 2% slower (see inline comments)

VinInn · 2019-06-12T14:32:22Z

RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h

                          uint32_t const* __restrict__ hitsModuleStart,
                          TrackingRecHit2DSOAView* phits) {
    auto& hits = *phits;

+    auto const digis = *pdigis; // the copy is intentional!


auto const & digis makes this version 10% slower.
if all digis.clus(i) etc were not locally copied it was 20% slower...

Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?

would I know.... (I have not looked at ptx_
It is only an observation based on nvprof. I suspect TrackingRecHit2DSOAView is large and is really copied "somewhere" (it is also declared non const...)
while SiPixelDigisCUDA::DeviceConstView most probably just goes in registers. if not copied I suspect that the call to __lcg are for some reason repeated each time...

I noticed the problem of indirection already in the previous version (see around line 161),
it really seems that with indirection it keeps loading from memory each time: we need to check in ptx

The point is: more investigative work is needed to understand the origin of performance differences: do not exclude that changing compiler (say clang) changes performance diff as well

Fine, but I think it would be useful to include these notes also in the code as comments.

ok, will do

VinInn · 2019-06-12T14:38:18Z

RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h

                          uint32_t const* __restrict__ digiModuleStart,
                          uint32_t const* __restrict__ clusInModule,
                          uint32_t const* __restrict__ moduleId,
-                          int32_t const* __restrict__ clus,
-                          int numElements,
                          uint32_t const* __restrict__ hitsModuleStart,
                          TrackingRecHit2DSOAView* phits) {
    auto& hits = *phits;


here if I remove "&" it becomes twice (factor 2) slower!!!!

VinInn · 2019-06-13T13:12:27Z

now gpuPixelRecHits::getHits is 5% slower than original

VinInn · 2019-06-13T13:53:45Z

reducing the number of threads to 128 makes it only 2% slower than the original (with 128 as well)
with 64 will crash w/o next commit (and with it, it is not faster then 128)

fwyzard · 2019-06-14T13:17:16Z

Validation summary

Reference release CMSSW_10_6_0 at b45186e
Development branch CMSSW_10_6_X_Patatrack at ec3c3e6
Testing PRs:

Move to View #354 at 0ee4099

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.52

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
development release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.5
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.51
- ✔️ step3.py: log, visual profile and summary
testing release, workflow 10824.52
- ✔️ step3.py: log, visual profile and summary
- ✔️ profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/70111c0b92efcd209e60db21a2fc9817f30ea5d2/log .

makortel

Overall looks good to me. Interesting to see the cost of additional level of pointer indirection(?)...

makortel · 2019-06-14T14:01:54Z

RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h

                          uint32_t const* __restrict__ hitsModuleStart,
                          TrackingRecHit2DSOAView* phits) {
    auto& hits = *phits;

+    auto const digis = *pdigis; // the copy is intentional!


Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?

VinInn · 2019-06-16T09:44:57Z

the fact that loads from a view (both from global or nc) are not optimized if repeated is pretty obvious even with a trivial test case (see https://godbolt.org/z/cPoP1z )
for what concern the effect of the local copy of the view, it does not show up in the test, I do not exclude that in a longer code it may force reload as well.

We need to report to cuda developers:
test code in https://github.com/VinInn/ctest/blob/master/cuda/view.cu as well

fwyzard · 2019-06-18T14:20:56Z

There is a lot of jitter for the overall performance on the V100's, and the T4 is still off.

That said, what I got, running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 V100, is

original, with 256 threads per block: 1717.3 ± 5.6 ev/s
updated, with 256 threads per block: 1704.3 ± 12.9 ev/s
updated, with 128 threads per block: 1701.1 ± 11.1 ev/s
original, with 256 threads per block: 1712.4 ± 8.3 ev/s

So the ~1% drop seems real ?

fwyzard · 2019-06-20T13:46:24Z

On the T4, running 16 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs:

original, with 256 threads per block: 982.6 ± 7.3 ev/s
updated, with 256 threads per block: 979.2 ± 6.0 ev/s
updated, with 128 threads per block: 981.2 ± 6.1 ev/s
original, with 256 threads per block: 982.1 ± 7.8 ev/s

So the effect here seems much smaller.

makortel · 2019-06-20T14:02:09Z

master, not CMSSW_10_6_X_Patatrack?

fwyzard · 2019-06-20T14:05:01Z

Thanks for spotting it, I'll fix it by hand.

…ltiple pointers (#354) Other changes and optimisations: - take into account the case where `nclus > blockDim.x` - use a smaller block size - document why why we copy or not to local variables

move to view

4525db0

VinInn commented Jun 12, 2019

View reviewed changes

finish migration to view in PixelRecHits

981a3d2

reduce number of thread (micro-optimization)

0af9ede

account for nclus > blockDim.x

0ee4099

VinInn requested review from makortel and fwyzard June 14, 2019 08:41

makortel approved these changes Jun 14, 2019

View reviewed changes

cms-patatrack deleted a comment from felicepantaleo Jun 14, 2019

explain why we copy or not to local variables

3cf3cea

fwyzard added the Pixels Pixels-related developments label Jun 17, 2019

makortel mentioned this pull request Jun 19, 2019

Running code-format for added files in existing packages #358

Closed

fwyzard approved these changes Jun 20, 2019

View reviewed changes

fwyzard merged commit 23c7e35 into cms-patatrack:master Jun 20, 2019

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel local reconstruction (9/N) cms-sw/cmssw#31721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move to View #354

Move to View #354

VinInn commented Jun 12, 2019 •

edited

Loading

VinInn Jun 12, 2019

makortel Jun 14, 2019

VinInn Jun 14, 2019 •

edited

Loading

makortel Jun 14, 2019

VinInn Jun 14, 2019

makortel Jun 17, 2019

makortel Jun 17, 2019

VinInn Jun 12, 2019

VinInn commented Jun 13, 2019 •

edited

Loading

VinInn commented Jun 13, 2019 •

edited

Loading

fwyzard commented Jun 14, 2019 •

edited

Loading

makortel left a comment

makortel Jun 14, 2019

VinInn commented Jun 16, 2019 •

edited

Loading

fwyzard commented Jun 18, 2019 •

edited

Loading

fwyzard commented Jun 20, 2019

makortel commented Jun 20, 2019

fwyzard commented Jun 20, 2019

Move to View #354

Move to View #354

Conversation

VinInn commented Jun 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VinInn Jun 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VinInn commented Jun 13, 2019 • edited Loading

VinInn commented Jun 13, 2019 • edited Loading

fwyzard commented Jun 14, 2019 • edited Loading

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Logs

makortel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

VinInn commented Jun 16, 2019 • edited Loading

fwyzard commented Jun 18, 2019 • edited Loading

fwyzard commented Jun 20, 2019

makortel commented Jun 20, 2019

fwyzard commented Jun 20, 2019

VinInn commented Jun 12, 2019 •

edited

Loading

VinInn Jun 14, 2019 •

edited

Loading

VinInn commented Jun 13, 2019 •

edited

Loading

VinInn commented Jun 13, 2019 •

edited

Loading

fwyzard commented Jun 14, 2019 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

VinInn commented Jun 16, 2019 •

edited

Loading

fwyzard commented Jun 18, 2019 •

edited

Loading