Speed up Patatrack CA #34250

VinInn · 2021-06-25T13:48:15Z

Release

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.58%  642.29ms      5000  128.46us  6.1120us  340.77us  kernel_find_ntuplets(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleVector<cms$
                   13.03%  573.68ms      5000  114.74us  16.704us  618.75us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const $
                    9.25%  407.24ms     10000  40.724us  2.5600us  204.64us  void kernel_BLFit<int=4>(cms::cuda::OneToManyAssoc<unsigned short, int=8, int=24576> const *, double, TrackSoAHeter$
                    8.82%  388.30ms      5000  77.659us  6.5590us  245.57us  gpuVertexFinder::vertexFinderOneKernel(ZVertexSoA*, gpuVertexFinder::WorkSpace*, int, float, float, float)
                    7.89%  347.50ms      5000  69.499us  9.1200us  316.80us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, cms::cuda::SimpleVector<cms::cuda::VecArray<unsig$
                    5.09%  223.98ms      5000  44.795us  3.3280us  209.54us  gpuPixelDoublets::fishbone(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::VecArray<u$
                    5.03%  221.56ms      5000  44.311us  20.544us  174.02us  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCA$
                    3.99%  175.88ms      5000  35.176us  9.8230us  161.92us  gpuPixelRecHits::getHits(pixelCPEforGPU::ParamsOnGPU const *, BeamSpotPOD const *, SiPixelDigisCUDA::DeviceConstVie$
                    3.96%  174.45ms     35010  4.9820us  1.0240us  635.20us  [CUDA memcpy HtoD]

This PR

          Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.63%  558.73ms      5000  111.75us  15.072us  663.13us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const $
                    9.98%  409.10ms      5000  81.820us  6.0480us  193.79us  kernel_find_ntuplets(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleVector<cms$
                    9.86%  404.24ms     10000  40.424us  2.5600us  111.68us  void kernel_BLFit<int=4>(cms::cuda::OneToManyAssoc<unsigned short, int=8, int=24576> const *, double, TrackSoAHeter$
                    9.41%  385.77ms      5000  77.153us  6.7520us  146.88us  gpuVertexFinder::vertexFinderOneKernel(ZVertexSoA*, gpuVertexFinder::WorkSpace*, int, float, float, float)
                    8.32%  341.18ms      5000  68.235us  8.8960us  211.33us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, cms::cuda::SimpleVector<cms::cuda::VecArray<unsig$
                    5.43%  222.45ms      5000  44.490us  3.3280us  119.14us  gpuPixelDoublets::fishbone(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::VecArray<u$
                    5.36%  219.69ms      5000  43.937us  20.480us  168.96us  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCA$
                    4.16%  170.61ms      5000  34.121us  10.367us  97.888us  gpuPixelRecHits::getHits(pixelCPEforGPU::ParamsOnGPU const *, BeamSpotPOD const *, SiPixelDigisCUDA::DeviceConstVie$
                    4.16%  170.50ms     35010  4.8700us  1.0240us  595.10us  [CUDA memcpy HtoD]
                    3.32%  135.98ms     15000  9.0650us  4.5760us  54.048us  void cms::cuda::multiBlockPrefixScan<unsigned int>(unsigned int const *, cms::cuda::multiBlockPrefixScan<unsigned i$

Throughput (triplets on T4)

release
702.0 ± 1.6 ev/s
this PR
752.4 ± 1.4 ev/s
only the first commit
750.5 ± 1.2 ev/s

Purely technical. no regression expected. no regression observed

VinInn · 2021-06-25T13:48:57Z

@cmsbuild, please test

cmsbuild · 2021-06-25T13:54:48Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34250/23527

This PR adds an extra 28KB to repository
There are other open Pull requests which might conflict with changes you have proposed:
- File RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsImpl.h modified in PR(s): Code cleanliness improvements in follow-up to #31722 #33865

cmsbuild · 2021-06-25T13:55:09Z

A new Pull Request was created by @VinInn (Vincenzo Innocente) for master.

It involves the following packages:

HeterogeneousCore/CUDAUtilities
RecoLocalTracker/SiPixelClusterizer
RecoPixelVertexing/PixelTriplets
RecoPixelVertexing/PixelVertexFinding

@perrotta, @makortel, @fwyzard, @jpata, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @threus, @dgulhan, @tvami this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

VinInn · 2021-06-25T13:56:38Z

@mmusich @vmariani FYI

mmusich · 2021-06-25T14:00:00Z

@VinInn I am watching the whole cmssw, no need to tag me explicitly.

VinInn · 2021-06-25T15:39:41Z

@cmsbuild , enable GPU

slava77 · 2021-06-25T15:54:55Z

enable gpu

slava77 · 2021-06-25T15:57:14Z

enable gpu

@smuzaffar
should I expect a "+1" from the bot to indicate that this hook was accepted and properly formatted, similar to what we get for the "test parameters"?

smuzaffar · 2021-06-25T16:33:08Z

No @slava77 , bot does not react to these enable test comments

cmsbuild · 2021-06-25T17:12:51Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16241/summary.html
COMMIT: 48c6319
CMSSW: CMSSW_12_0_X_2021-06-24-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34250/16241/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 8 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2785631
DQMHistoTests: Total failures: 13
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 2785596
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
Checked 160 log files, 37 edm output root files, 38 DQM output files
TriggerResults: no differences found

smuzaffar · 2021-06-25T17:24:25Z

please test
re-start with GPU

cmsbuild · 2021-06-25T20:39:14Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16245/summary.html
COMMIT: 48c6319
CMSSW: CMSSW_12_0_X_2021-06-25-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34250/16245/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 4
DQMHistoTests: Total histograms compared: 19723
DQMHistoTests: Total failures: 6
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 19717
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
Checked 12 log files, 9 edm output root files, 4 DQM output files
TriggerResults: no differences found

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 7 differences found in the comparisons
DQMHistoTests: Total files compared: 38
DQMHistoTests: Total histograms compared: 2785711
DQMHistoTests: Total failures: 12
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 2785676
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.004 KiB( 37 files compared)
DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
Checked 160 log files, 37 edm output root files, 38 DQM output files
TriggerResults: no differences found

slava77 · 2021-06-28T21:52:22Z

HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h

+    template <typename T1, typename T2>
+    T1 atomicCAS_block(T1* address, T1 compare, T2 val) {
+      return atomicCAS(address, compare, val);


for my education,
what does this and the other _block methods do in this PR, especially in the context of speeding things up.

fwyzard · 2021-06-28T21:57:50Z

They restrict the scope of the atomic operations to a single CUDA block. For "shared" memory this is always the case. For "global" memory this may speed up the operation, for example by not requiring to consider the cache of other blocks See https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions for more details.

slava77 · 2021-06-28T22:31:10Z

Purely technical. no regression expected. no regression observed

there are quite a few plots in pixel tracking DQM which show small differences.
E.g. in 10824.502

I guess either the order of entries is not exactly the same or there are some differences in results at numerical precision level.
I suppose, these are expected.
Or is it worth to investigate in more detail?

VinInn · 2021-06-29T06:23:53Z

GPU tracking is known NOT to be fully reproducible for a variety of reasons that have been reported earlier in various meetings. (CPU tracking would also not be reproducible if the "order" of seeds/iterations whould change (say because of high granularity threading)

slava77 · 2021-06-29T12:12:56Z

GPU tracking is known NOT to be fully reproducible for a variety of reasons that have been reported earlier in various meetings. (CPU tracking would also not be reproducible if the "order" of seeds/iterations whould change (say because of high granularity threading)

OK, thanks for clarifying. It was not particularly obvious that "no regression expected. no regression observed" was not a literal statement.

slava77 · 2021-06-29T12:14:19Z

+reconstruction

for #34250 48c6319

code changes look reasonable
- some summary of the functional changes in the PR description would be useful in addition to just providing evidence that this is better as measured
jenkins tests pass and comparisons with the baseline show no (relevant) differences in the tested CPU workflows; the GPU workfow 10824.502 shows some differences in pixel track DQM distributions, mainly at numerical precision level, attributed to some changes in the order of the produced data.

makortel · 2021-06-29T12:36:19Z

+heterogeneous

cmsbuild · 2021-06-29T12:36:41Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

VinInn · 2021-06-29T12:44:22Z

RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h

@@ -218,12 +218,13 @@ namespace gpuClustering {
              auto l = nn[k][kk];
              auto m = l + firstPixel;
              assert(m != i);
-              auto old = atomicMin(&clusterId[m], clusterId[i]);
+              auto old = atomicMin_block(&clusterId[m], clusterId[i]);


clusterId is in global memory, but "m" (and "i") below refers to digis on the module that is reconstructed in this block. (one module per block, one block per module)

VinInn · 2021-06-29T12:45:41Z

RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernelsImpl.h

@@ -350,7 +350,9 @@ __global__ void kernel_find_ntuplets(GPUCACell::Hits const *__restrict__ hhp,
    auto const &thisCell = cells[idx];
    if (thisCell.isKilled())
      continue;  // cut by earlyFishbone
-
+    // we require at least three hits...
+    if (thisCell.outerNeighbors().empty())


simply avoids the whole lot below (in particular the function call that being recursive cannot be inlined)

VinInn · 2021-06-29T12:46:55Z

RecoPixelVertexing/PixelVertexFinding/plugins/gpuFitVertices.h

@@ -63,8 +63,8 @@ namespace gpuVertexFinder {
      assert(iv[i] >= 0);
      assert(iv[i] < int(foundClusters));
      auto w = 1.f / ezt2[i];
-      atomicAdd(&zv[iv[i]], zt[i] * w);
-      atomicAdd(&wv[iv[i]], w);
+      atomicAdd_block(&zv[iv[i]], zt[i] * w);


vertex reconstruction is performed in a single block.

VinInn · 2021-06-29T12:47:17Z

RecoPixelVertexing/PixelVertexFinding/plugins/gpuSortByPt2.h

@@ -46,7 +46,7 @@ namespace gpuVertexFinder {
    for (auto i = threadIdx.x; i < nt; i += blockDim.x) {
      if (iv[i] > 9990)
        continue;
-      atomicAdd(&ptv2[iv[i]], ptt2[i]);
+      atomicAdd_block(&ptv2[iv[i]], ptt2[i]);


vertex sorting is performed in a single block

qliphy · 2021-06-30T00:24:30Z

+1

VinInn added 3 commits June 25, 2021 11:29

bail out asap if no triplet possible

c946026

introduce atomic for block

ca8617f

code format

48c6319

cmsbuild added this to the CMSSW_12_0_X milestone Jun 25, 2021

cmsbuild added code-checks-pending heterogeneous-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Jun 25, 2021

cmsbuild added tests-started and removed tests-pending labels Jun 25, 2021

cmsbuild added code-checks-approved and removed code-checks-pending labels Jun 25, 2021

cmsbuild added tests-approved and removed tests-started labels Jun 25, 2021

cmsbuild added tests-started and removed tests-approved labels Jun 25, 2021

cmsbuild added tests-approved and removed tests-started labels Jun 25, 2021

slava77 reviewed Jun 28, 2021

View reviewed changes

ericcano mentioned this pull request Jun 29, 2021

Code cleanliness improvements in follow-up to #31722 #33865

Closed

cmsbuild added reconstruction-approved and removed reconstruction-pending labels Jun 29, 2021

cmsbuild added fully-signed heterogeneous-approved and removed heterogeneous-pending pending-signatures labels Jun 29, 2021

VinInn commented Jun 29, 2021

View reviewed changes

cmsbuild added orp-approved and removed orp-pending labels Jun 30, 2021

cmsbuild merged commit 56b5460 into cms-sw:master Jun 30, 2021

This was referenced Jun 30, 2021

add gil acquire when modifying python object #34288

Merged

make CPEFast to better reproduce Generic (w/o track angle) #34286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up Patatrack CA #34250

Speed up Patatrack CA #34250

VinInn commented Jun 25, 2021

VinInn commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

VinInn commented Jun 25, 2021

mmusich commented Jun 25, 2021 •

edited

Loading

VinInn commented Jun 25, 2021

slava77 commented Jun 25, 2021

slava77 commented Jun 25, 2021

smuzaffar commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

smuzaffar commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

slava77 Jun 28, 2021

fwyzard commented Jun 28, 2021 via email

slava77 commented Jun 28, 2021

VinInn commented Jun 29, 2021

slava77 commented Jun 29, 2021

slava77 commented Jun 29, 2021

makortel commented Jun 29, 2021

cmsbuild commented Jun 29, 2021

VinInn Jun 29, 2021

VinInn Jun 29, 2021

VinInn Jun 29, 2021

VinInn Jun 29, 2021

qliphy commented Jun 30, 2021

Speed up Patatrack CA #34250

Speed up Patatrack CA #34250

Conversation

VinInn commented Jun 25, 2021

VinInn commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

VinInn commented Jun 25, 2021

mmusich commented Jun 25, 2021 • edited Loading

VinInn commented Jun 25, 2021

slava77 commented Jun 25, 2021

slava77 commented Jun 25, 2021

smuzaffar commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

Comparison Summary

smuzaffar commented Jun 25, 2021

cmsbuild commented Jun 25, 2021

GPU Comparison Summary

Comparison Summary

slava77 Jun 28, 2021

Choose a reason for hiding this comment

fwyzard commented Jun 28, 2021 via email

slava77 commented Jun 28, 2021

VinInn commented Jun 29, 2021

slava77 commented Jun 29, 2021

slava77 commented Jun 29, 2021

makortel commented Jun 29, 2021

cmsbuild commented Jun 29, 2021

VinInn Jun 29, 2021

Choose a reason for hiding this comment

VinInn Jun 29, 2021

Choose a reason for hiding this comment

VinInn Jun 29, 2021

Choose a reason for hiding this comment

VinInn Jun 29, 2021

Choose a reason for hiding this comment

qliphy commented Jun 30, 2021

mmusich commented Jun 25, 2021 •

edited

Loading