Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move to View #354

Merged
merged 5 commits into from
Jun 20, 2019
Merged

Move to View #354

merged 5 commits into from
Jun 20, 2019

Conversation

VinInn
Copy link

@VinInn VinInn commented Jun 12, 2019

Finish to move to view...
Moving Raw2digi and clusterizer to view most probably not worth (reading and writing from the same data-structures)

PR ready for review!

Just to make everybody aware that deferencing device views in global memory is utterly slow.
The first attempt of this PR resulted in having gpuPixelRecHits::getHits 20% slower.
now is only 2% slower (see inline comments)

uint32_t const* __restrict__ hitsModuleStart,
TrackingRecHit2DSOAView* phits) {
auto& hits = *phits;

auto const digis = *pdigis; // the copy is intentional!
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto const & digis makes this version 10% slower.
if all digis.clus(i) etc were not locally copied it was 20% slower...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?

Copy link
Author

@VinInn VinInn Jun 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would I know.... (I have not looked at ptx_
It is only an observation based on nvprof. I suspect TrackingRecHit2DSOAView is large and is really copied "somewhere" (it is also declared non const...)
while SiPixelDigisCUDA::DeviceConstView most probably just goes in registers. if not copied I suspect that the call to __lcg are for some reason repeated each time...

I noticed the problem of indirection already in the previous version (see around line 161),
it really seems that with indirection it keeps loading from memory each time: we need to check in ptx

The point is: more investigative work is needed to understand the origin of performance differences: do not exclude that changing compiler (say clang) changes performance diff as well

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine, but I think it would be useful to include these notes also in the code as comments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

uint32_t const* __restrict__ digiModuleStart,
uint32_t const* __restrict__ clusInModule,
uint32_t const* __restrict__ moduleId,
int32_t const* __restrict__ clus,
int numElements,
uint32_t const* __restrict__ hitsModuleStart,
TrackingRecHit2DSOAView* phits) {
auto& hits = *phits;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here if I remove "&" it becomes twice (factor 2) slower!!!!

@VinInn
Copy link
Author

VinInn commented Jun 13, 2019

now gpuPixelRecHits::getHits is 5% slower than original

@VinInn
Copy link
Author

VinInn commented Jun 13, 2019

reducing the number of threads to 128 makes it only 2% slower than the original (with 128 as well)
with 64 will crash w/o next commit (and with it, it is not faster then 128)

@VinInn VinInn requested review from makortel and fwyzard June 14, 2019 08:41
@fwyzard
Copy link

fwyzard commented Jun 14, 2019

Validation summary

Reference release CMSSW_10_6_0 at b45186e
Development branch CMSSW_10_6_X_Patatrack at ec3c3e6
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.52

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW

/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/70111c0b92efcd209e60db21a2fc9817f30ea5d2/log .

Copy link

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me. Interesting to see the cost of additional level of pointer indirection(?)...

uint32_t const* __restrict__ hitsModuleStart,
TrackingRecHit2DSOAView* phits) {
auto& hits = *phits;

auto const digis = *pdigis; // the copy is intentional!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add more elaborate comments along the first one becoming slow if copied, and the latter one becoming slow if not copied?

@cms-patatrack cms-patatrack deleted a comment from felicepantaleo Jun 14, 2019
@VinInn
Copy link
Author

VinInn commented Jun 16, 2019

the fact that loads from a view (both from global or nc) are not optimized if repeated is pretty obvious even with a trivial test case (see https://godbolt.org/z/cPoP1z )
for what concern the effect of the local copy of the view, it does not show up in the test, I do not exclude that in a longer code it may force reload as well.

We need to report to cuda developers:
test code in https://github.com/VinInn/ctest/blob/master/cuda/view.cu as well

@fwyzard fwyzard added the Pixels Pixels-related developments label Jun 17, 2019
@fwyzard
Copy link

fwyzard commented Jun 18, 2019

There is a lot of jitter for the overall performance on the V100's, and the T4 is still off.

That said, what I got, running 10 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 V100, is

  • original, with 256 threads per block: 1717.3 ± 5.6 ev/s
  • updated, with 256 threads per block: 1704.3 ± 12.9 ev/s
  • updated, with 128 threads per block: 1701.1 ± 11.1 ev/s
  • original, with 256 threads per block: 1712.4 ± 8.3 ev/s

So the ~1% drop seems real ?

@fwyzard
Copy link

fwyzard commented Jun 20, 2019

On the T4, running 16 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs:

  • original, with 256 threads per block: 982.6 ± 7.3 ev/s
  • updated, with 256 threads per block: 979.2 ± 6.0 ev/s
  • updated, with 128 threads per block: 981.2 ± 6.1 ev/s
  • original, with 256 threads per block: 982.1 ± 7.8 ev/s

So the effect here seems much smaller.

@fwyzard fwyzard merged commit 23c7e35 into cms-patatrack:master Jun 20, 2019
@makortel
Copy link

master, not CMSSW_10_6_X_Patatrack?

@fwyzard
Copy link

fwyzard commented Jun 20, 2019

Thanks for spotting it, I'll fix it by hand.

fwyzard pushed a commit that referenced this pull request Jun 20, 2019
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard added a commit that referenced this pull request Nov 27, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
…ltiple pointers (#354)

Other changes and optimisations:
  - take into account the case where `nclus > blockDim.x`
  - use a smaller block size
  - document why why we copy or not to local variables
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pixels Pixels-related developments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants