Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up Patatrack CA #34250

Merged
merged 3 commits into from
Jun 30, 2021
Merged

Speed up Patatrack CA #34250

merged 3 commits into from
Jun 30, 2021

Conversation

VinInn
Copy link
Contributor

@VinInn VinInn commented Jun 25, 2021

Release

           Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   14.58%  642.29ms      5000  128.46us  6.1120us  340.77us  kernel_find_ntuplets(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleVector<cms$
                   13.03%  573.68ms      5000  114.74us  16.704us  618.75us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const $
                    9.25%  407.24ms     10000  40.724us  2.5600us  204.64us  void kernel_BLFit<int=4>(cms::cuda::OneToManyAssoc<unsigned short, int=8, int=24576> const *, double, TrackSoAHeter$
                    8.82%  388.30ms      5000  77.659us  6.5590us  245.57us  gpuVertexFinder::vertexFinderOneKernel(ZVertexSoA*, gpuVertexFinder::WorkSpace*, int, float, float, float)
                    7.89%  347.50ms      5000  69.499us  9.1200us  316.80us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, cms::cuda::SimpleVector<cms::cuda::VecArray<unsig$
                    5.09%  223.98ms      5000  44.795us  3.3280us  209.54us  gpuPixelDoublets::fishbone(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::VecArray<u$
                    5.03%  221.56ms      5000  44.311us  20.544us  174.02us  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCA$
                    3.99%  175.88ms      5000  35.176us  9.8230us  161.92us  gpuPixelRecHits::getHits(pixelCPEforGPU::ParamsOnGPU const *, BeamSpotPOD const *, SiPixelDigisCUDA::DeviceConstVie$
                    3.96%  174.45ms     35010  4.9820us  1.0240us  635.20us  [CUDA memcpy HtoD]

This PR

          Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   13.63%  558.73ms      5000  111.75us  15.072us  663.13us  gpuClustering::findClus(unsigned short const *, unsigned short const *, unsigned short const *, unsigned int const $
                    9.98%  409.10ms      5000  81.820us  6.0480us  193.79us  kernel_find_ntuplets(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::SimpleVector<cms$
                    9.86%  404.24ms     10000  40.424us  2.5600us  111.68us  void kernel_BLFit<int=4>(cms::cuda::OneToManyAssoc<unsigned short, int=8, int=24576> const *, double, TrackSoAHeter$
                    9.41%  385.77ms      5000  77.153us  6.7520us  146.88us  gpuVertexFinder::vertexFinderOneKernel(ZVertexSoA*, gpuVertexFinder::WorkSpace*, int, float, float, float)
                    8.32%  341.18ms      5000  68.235us  8.8960us  211.33us  gpuPixelDoublets::getDoubletsFromHisto(GPUCACell*, unsigned int*, cms::cuda::SimpleVector<cms::cuda::VecArray<unsig$
                    5.43%  222.45ms      5000  44.490us  3.3280us  119.14us  gpuPixelDoublets::fishbone(TrackingRecHit2DSOAView const *, GPUCACell*, unsigned int const *, cms::cuda::VecArray<u$
                    5.36%  219.69ms      5000  43.937us  20.480us  168.96us  kernel_connect(cms::cuda::AtomicPairCounter*, cms::cuda::AtomicPairCounter*, TrackingRecHit2DSOAView const *, GPUCA$
                    4.16%  170.61ms      5000  34.121us  10.367us  97.888us  gpuPixelRecHits::getHits(pixelCPEforGPU::ParamsOnGPU const *, BeamSpotPOD const *, SiPixelDigisCUDA::DeviceConstVie$
                    4.16%  170.50ms     35010  4.8700us  1.0240us  595.10us  [CUDA memcpy HtoD]
                    3.32%  135.98ms     15000  9.0650us  4.5760us  54.048us  void cms::cuda::multiBlockPrefixScan<unsigned int>(unsigned int const *, cms::cuda::multiBlockPrefixScan<unsigned i$

Throughput (triplets on T4)

release
702.0 ± 1.6 ev/s
this PR
752.4 ± 1.4 ev/s
only the first commit
750.5 ± 1.2 ev/s

Purely technical. no regression expected. no regression observed

@VinInn
Copy link
Contributor Author

VinInn commented Jun 25, 2021

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-34250/23527

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @VinInn (Vincenzo Innocente) for master.

It involves the following packages:

HeterogeneousCore/CUDAUtilities
RecoLocalTracker/SiPixelClusterizer
RecoPixelVertexing/PixelTriplets
RecoPixelVertexing/PixelVertexFinding

@perrotta, @makortel, @fwyzard, @jpata, @slava77 can you please review it and eventually sign? Thanks.
@mtosi, @makortel, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @OzAmram, @ferencek, @dkotlins, @gpetruc, @threus, @dgulhan, @tvami this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@VinInn
Copy link
Contributor Author

VinInn commented Jun 25, 2021

@mmusich @vmariani FYI

@mmusich
Copy link
Contributor

mmusich commented Jun 25, 2021

@VinInn I am watching the whole cmssw, no need to tag me explicitly.

@VinInn
Copy link
Contributor Author

VinInn commented Jun 25, 2021

@cmsbuild , enable GPU

@slava77
Copy link
Contributor

slava77 commented Jun 25, 2021

enable gpu

@slava77
Copy link
Contributor

slava77 commented Jun 25, 2021

enable gpu

@smuzaffar
should I expect a "+1" from the bot to indicate that this hook was accepted and properly formatted, similar to what we get for the "test parameters"?

@smuzaffar
Copy link
Contributor

No @slava77 , bot does not react to these enable test comments

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16241/summary.html
COMMIT: 48c6319
CMSSW: CMSSW_12_0_X_2021-06-24-2300/slc7_amd64_gcc900
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34250/16241/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2785631
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2785596
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 37 files compared)
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

@smuzaffar
Copy link
Contributor

please test
re-start with GPU

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-01daf9/16245/summary.html
COMMIT: 48c6319
CMSSW: CMSSW_12_0_X_2021-06-25-1100/slc7_amd64_gcc900
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/34250/16245/install.sh to create a dev area with all the needed externals and cmssw changes.

GPU Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 0 differences found in the comparisons
  • DQMHistoTests: Total files compared: 4
  • DQMHistoTests: Total histograms compared: 19723
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 19717
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 3 files compared)
  • Checked 12 log files, 9 edm output root files, 4 DQM output files
  • TriggerResults: no differences found

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 7 differences found in the comparisons
  • DQMHistoTests: Total files compared: 38
  • DQMHistoTests: Total histograms compared: 2785711
  • DQMHistoTests: Total failures: 12
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 2785676
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 37 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 160 log files, 37 edm output root files, 38 DQM output files
  • TriggerResults: no differences found

Comment on lines +42 to +44
template <typename T1, typename T2>
T1 atomicCAS_block(T1* address, T1 compare, T2 val) {
return atomicCAS(address, compare, val);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my education,
what does this and the other _block methods do in this PR, especially in the context of speeding things up.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 28, 2021 via email

@slava77
Copy link
Contributor

slava77 commented Jun 28, 2021

Purely technical. no regression expected. no regression observed

there are quite a few plots in pixel tracking DQM which show small differences.
E.g. in 10824.502
wf10824 502_pv_dz

I guess either the order of entries is not exactly the same or there are some differences in results at numerical precision level.
I suppose, these are expected.
Or is it worth to investigate in more detail?

@VinInn
Copy link
Contributor Author

VinInn commented Jun 29, 2021

GPU tracking is known NOT to be fully reproducible for a variety of reasons that have been reported earlier in various meetings. (CPU tracking would also not be reproducible if the "order" of seeds/iterations whould change (say because of high granularity threading)

@slava77
Copy link
Contributor

slava77 commented Jun 29, 2021

GPU tracking is known NOT to be fully reproducible for a variety of reasons that have been reported earlier in various meetings. (CPU tracking would also not be reproducible if the "order" of seeds/iterations whould change (say because of high granularity threading)

OK, thanks for clarifying. It was not particularly obvious that "no regression expected. no regression observed" was not a literal statement.

@slava77
Copy link
Contributor

slava77 commented Jun 29, 2021

+reconstruction

for #34250 48c6319

  • code changes look reasonable
    • some summary of the functional changes in the PR description would be useful in addition to just providing evidence that this is better as measured
  • jenkins tests pass and comparisons with the baseline show no (relevant) differences in the tested CPU workflows; the GPU workfow 10824.502 shows some differences in pixel track DQM distributions, mainly at numerical precision level, attributed to some changes in the order of the produced data.

@makortel
Copy link
Contributor

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@@ -218,12 +218,13 @@ namespace gpuClustering {
auto l = nn[k][kk];
auto m = l + firstPixel;
assert(m != i);
auto old = atomicMin(&clusterId[m], clusterId[i]);
auto old = atomicMin_block(&clusterId[m], clusterId[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clusterId is in global memory, but "m" (and "i") below refers to digis on the module that is reconstructed in this block. (one module per block, one block per module)

@@ -350,7 +350,9 @@ __global__ void kernel_find_ntuplets(GPUCACell::Hits const *__restrict__ hhp,
auto const &thisCell = cells[idx];
if (thisCell.isKilled())
continue; // cut by earlyFishbone

// we require at least three hits...
if (thisCell.outerNeighbors().empty())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simply avoids the whole lot below (in particular the function call that being recursive cannot be inlined)

@@ -63,8 +63,8 @@ namespace gpuVertexFinder {
assert(iv[i] >= 0);
assert(iv[i] < int(foundClusters));
auto w = 1.f / ezt2[i];
atomicAdd(&zv[iv[i]], zt[i] * w);
atomicAdd(&wv[iv[i]], w);
atomicAdd_block(&zv[iv[i]], zt[i] * w);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vertex reconstruction is performed in a single block.

@@ -46,7 +46,7 @@ namespace gpuVertexFinder {
for (auto i = threadIdx.x; i < nt; i += blockDim.x) {
if (iv[i] > 9990)
continue;
atomicAdd(&ptv2[iv[i]], ptt2[i]);
atomicAdd_block(&ptv2[iv[i]], ptt2[i]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vertex sorting is performed in a single block

@qliphy
Copy link
Contributor

qliphy commented Jun 30, 2021

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants