Move BeamSpot transfer to GPU to its own producer #318

makortel · 2019-04-10T19:14:53Z

PR description:

This PR is a followup to #245 and makes the first attempt to transfer BeamSpot data to GPU in its own producer instead of in rechit producer. I left the covariance matrix for subsequent work as it is not strictly needed at the moment, and currently Eigen apparently does not support minimal storage for symmetric matrices.

In addition, a perfect forwarding overload is added for CUDAProduct constructor enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

As in #245, a mechanism is added to create non-cached pinned host memory unique_ptrs with the possibility to pass custom flags to cudaHostAlloc(). There is one commit for moving the BeamSpot transfer to use once-per-stream-per-job allocated write-combined buffer, and another commit for doing the same for the raw data. A difference wrt. #245 is that the empty GPU::SimpleVector<PixelErrorCompact> is still transferred via a pinned host memory from the caching allocator (with the current SiPixelDigiErrorsCUDA providing the transfer buffer outside of the class would look ugly, but could be done if really wanted).

PR validation:

Tested that a profile configuration runs, and with nvprof that the BeamSpot transfer can occur in parallel to e.g. clustering kernels.

… cudaHostAlloc()

…ansfer

… cudaHostAllocWriteCombined

makortel · 2019-04-11T16:11:50Z

Rebased on top of 10_6_X_Patatrack (HEAD corresponding to #315), and fixed the compilation errors (fixes squashed to the original commits).

makortel · 2019-04-11T16:37:22Z

Now including the fix of #320 also here. I'll rebase if #320 gets merged before this PR.

fwyzard · 2019-04-16T09:04:44Z

Validation summary

Reference release CMSSW_10_6_0_pre2 at 1313262
Development branch CMSSW_10_6_X_Patatrack at 49f2c7f
Testing PRs:

Move BeamSpot transfer to GPU to its own producer #318 at c31e4cb

`makeTrackValidationPlots.py` plots

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.52
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.53

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

tracking validation plots and summary for workflow 10824.5
tracking validation plots and summary for workflow 10824.52
tracking validation plots and summary for workflow 10824.51
tracking validation plots and summary for workflow 10824.53

logs and `nvprof`/`nvvp` profiles

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.52
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.51
- step3.py: log, visual profile and summary
development release, workflow 10824.53
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.52
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.51
- step3.py: log, visual profile and summary
testing release, workflow 10824.53
- step3.py: log, visual profile and summary

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

reference release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.5
- step3.py: log, visual profile and summary
development release, workflow 10824.52
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
development release, workflow 10824.51
- step3.py: log, visual profile and summary
development release, workflow 10824.53
- step3.py: log, visual profile and summary
testing release, workflow 10824.5
- step3.py: log, visual profile and summary
testing release, workflow 10824.52
- step3.py: log, visual profile and summary
- profile.py: log, visual profile and summary
- ✔️ cuda-memcheck --tool initcheck --track-unused-memory no (report, log) did not find any errors
- ✔️ cuda-memcheck --tool memcheck --leak-check full --report-api-errors all (report, log) did not find any errors
- ✔️ cuda-memcheck --tool synccheck (report, log) did not find any errors
testing release, workflow 10824.51
- step3.py: log, visual profile and summary
testing release, workflow 10824.53
- step3.py: log, visual profile and summary

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/7178c28abfafbe57bd10e98cebcf24fe4dbd71b1/log .

fwyzard · 2019-04-16T16:29:22Z

No impact on physics.

fwyzard · 2019-04-16T16:30:34Z

Looks like GitHub is smart enough, no need to rebase this.

makortel · 2019-04-16T16:31:55Z

Looks like GitHub is smart enough, no need to rebase this.

I'm amazed.

fwyzard · 2019-04-16T17:03:40Z

No impact on timing, measured on a T4 over TTbar MC.

Before:

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   729.2 ±   0.3 ev/s (4000 events)
   727.9 ±   0.4 ev/s (4000 events)
   728.1 ±   0.3 ev/s (4000 events)
   726.2 ±   0.4 ev/s (4000 events)
 --------------------
   727.9 ±   1.2 ev/s

After:

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   727.1 ±   0.4 ev/s (4000 events)
   727.8 ±   0.4 ev/s (4000 events)
   728.8 ±   0.3 ev/s (4000 events)
   728.2 ±   0.4 ev/s (4000 events)
 --------------------
   728.0 ±   0.7 ev/s

fwyzard · 2019-04-16T23:15:45Z

@makortel can you remind the use case for the non-cached pinned host memory ?

makortel · 2019-04-17T13:54:00Z

can you remind the use case for the non-cached pinned host memory ?

@fwyzard Your suggestion in #245 (comment)

fwyzard · 2019-04-17T16:31:14Z

Ah, I see, it is a different set of functions because these do not rely on any CUDA stream - correct ?

makortel · 2019-04-17T16:35:59Z

Right, no CUDA stream and no caching by our allocator. The difference wrt. the api wrappers cuda::memory::host::make_unique() is that the added cudautils::make_host_noncached_unique() accepts flags to be passed to cudaHostAlloc().

fwyzard · 2019-04-17T16:36:45Z

We seem to have a proliferation of memory allocation functions ...

makortel · 2019-04-17T16:45:58Z

Yeah, I'm not too happy about that either. A challenge for supporting flags from the caching allocator is that AFAICT the flags create additional dimension for the binning (in addition to the device index and the allocation size).

VinInn · 2019-04-20T13:50:05Z

problem with multi-threaded job on data

cd /home/vin/BeamSpot
  1315	15:17	source /data/cmssw/cmsset_default.csh
  1316	15:17	cmsrel CMSSW_10_6_0_pre2_Patatrack
  1317	15:17	cd CMSSW_10_6_0_pre2_Patatrack/src
  1319	15:17	history | grep init
  1320	15:17	git cms-init -x cms-patatrack
  1322	15:18	git cms-merge-topic makortel:beamspotToCUDA
  1323	15:18	cmsenv
echo $CMSSW_BASE
/home/vin/BeamSpot/CMSSW_10_6_0_pre2_Patatrack
[innocent@vinzen0 src]$ git diff
diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h b/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
index cbd354e7114..de0194a32ea 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
@@ -144,6 +144,7 @@ namespace gpuPixelRecHits {
     // to global and compute phi...
     cpeParams->detParams(me).frame.toGlobal(xl[h],yl[h], xg[h],yg[h],zg[h]);
     // here correct for the beamspot...
+    if (std::abs(bs->x)>1.f) printf ("beam spot problem %f,%f,%f\n", bs->x,bs->y,bs->z);
     xg[h]-=bs->x;
     yg[h]-=bs->y;
     zg[h]-=bs->z;
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
index d00a5db1861..70e467040ff 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
@@ -19,8 +19,7 @@ void CAHitQuadrupletGeneratorKernels::allocateOnGPU()
   // ALLOCATIONS FOR THE INTERMEDIATE RESULTS (STAYS ON WORKER)
   //////////////////////////////////////////////////////////

-  cudaCheck(cudaMalloc(&device_theCells_,
-             CAConstants::maxNumberOfLayerPairs() * CAConstants::maxNumberOfDoublets() * sizeof(GPUCACell)));
+  cudaCheck(cudaMalloc(&device_theCells_, CAConstants::maxNumberOfDoublets() * sizeof(GPUCACell)));
   cudaCheck(cudaMalloc(&device_nCells_, sizeof(uint32_t)));
   cudaCheck(cudaMemset(device_nCells_, 0, sizeof(uint32_t)));

mc: ok
data single thread
ok (log ~/data/beamspotproblemONEthread.log)

data multi-thread

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 6 at 20-Apr-2019 15:34:35.845 CEST
Tracks overflow 30331 in 4
beam spot problem -62306304.000000,0.000000,-62437376.000000
beam spot problem -62306304.000000,0.000000,-62437376.000000
beam spot problem -62306304.000000,0.000000,-62437376.000000

full log in ls -l ~/data/beamspotproblem.log
-rw-r--r--. 1 innocent zh 6688743 Apr 20 15:35 /afs/cern.ch/user/i/innocent/data/beamspotproblem.log
and beamspotproblem2.log

not necessarily fully reproducible

I can try to run with valgrind (ahhhh)

[1]  + Segmentation fault            valgrind --tool=memcheck --smc-check=all-non-file --suppressions=/data/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_0_pre2_Patatrack/src/Utilities/ReleaseScripts/data/cms-valgrind-memcheck.supp --num-callers=20 --xml=yes --xml-file=valgrind-%p.xml cmsRun gpuOld.py >& val.log (core dumped)
1.555u 0.182s 0:02.57 67.3%	0+0k 8768+136io 31pf+0w
[innocent@vinzen0 pixelraw]$
[innocent@vinzen0 pixelraw]$ cat val.log

 *** Break *** segmentation violation
#0  0x00000000580ff3bf in ?? ()
#1  0x00000000580998d0 in ?? ()
#2  0x000000005809641b in ?? ()
#3  0x000000005809799f in ?? ()
#4  0x00000000580a6ff7 in ?? ()
#5  0x0000000000000000 in ?? ()
==5553==  If you believe this happened as a result of a stack
==5553==  overflow in your program's main thread (unlikely but
==5553==  possible), you can try to increase the size of the
==5553==  main thread stack using the --main-stacksize= flag.
==5553==  The main thread stack size used in this run was 8388608.

VinInn · 2019-04-20T13:54:36Z

btw:
assert in kernel: sequentialize
cudaFree: synchronize
cudaGetLastError (gives error at a random place)
cudaDeviceSynchronize (sequentialize)
so any of these is like running single threaded

I add to add

#undef assert
#define assert(EX) if(!(EX)) printf("failed : %s \n",#EX);

in cuda_assert.h and make sure it is the last include in each file....

VinInn · 2019-04-22T08:52:49Z

the symptoms seems to be cured by

diff --git a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
index 8168954c4b9..715cc3e4c91 100644
--- a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
+++ b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
@@ -64,9 +64,11 @@ CUDAScopedContext::~CUDAScopedContext() {
 }

 void CUDAScopedContext::createEventIfStreamBusy() {
+  /*
   if(event_ or stream_->is_clear()) {
     return;
   }
+  */
   edm::Service<CUDAService> cs;
   event_ = cs->getCUDAEvent();
 }

BUT still crashing multi-job
logs for instance in

/afs/cern.ch/user/i/innocent/data/crashMultiJobAfterFixInContext_1.log
/afs/cern.ch/user/i/innocent/data/crashMultiJobAfterFixInContext_2.log

makortel · 2019-04-22T15:18:51Z

I added the printout of #318 (comment), but so far I have been unable to reproduce on felk40.

I can try to run with valgrind (ahhhh)

Can you try with cmsRunGlibC? Jemalloc doesn't support valgrind since version 5.

makortel · 2019-04-22T15:23:42Z

diff --git a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
index 8168954c4b9..715cc3e4c91 100644
--- a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
+++ b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
@@ -64,9 +64,11 @@ CUDAScopedContext::~CUDAScopedContext() {
 }

 void CUDAScopedContext::createEventIfStreamBusy() {
+  /*
   if(event_ or stream_->is_clear()) {
     return;
   }
+  */
   edm::Service<CUDAService> cs;
   event_ = cs->getCUDAEvent();
 }

This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?

VinInn · 2019-04-22T16:05:19Z

On 22 Apr, 2019, at 5:23 PM, Matti Kortelainen ***@***.***> wrote: This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?

apparently yes for what concern the meaningless beamspot. still crashing in multijob (usually at the very beginning as in ~/data/crashMultiJobAfterFixInContext_3.log) v.

makortel · 2019-04-22T16:18:35Z

On 22 Apr, 2019, at 5:23 PM, Matti Kortelainen @.***> wrote: This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?
apparently yes for what concern the meaningless beamspot.

Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDA producer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.

still crashing in multijob (usually at the very beginning as in ~/data/crashMultiJobAfterFixInContext_3.log) v.

The multijob crash must be something different as it was first reported 3.5 weeks ago in #306.

VinInn · 2019-04-22T16:22:31Z

On 22 Apr, 2019, at 6:18 PM, Matti Kortelainen ***@***.***> wrote: Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDAproducer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.

ok, let me know as I used the same pattern for TrackingRecHit in #322

makortel · 2019-04-22T16:26:27Z

On 22 Apr, 2019, at 6:18 PM, Matti Kortelainen @.***> wrote: Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDAproducer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.
ok, let me know as I used the same pattern for TrackingRecHit in #322

The pattern itself is fine (and I want to keep it), so I'll make a fix under the hoods.

makortel · 2019-04-22T18:03:45Z

@VinInn the fix is in #327.

VinInn · 2019-04-23T07:49:35Z

#327 and #328 conflicts with this and conflict with each other...
could you please make a new PR that merges these three together?

Implement a non-caching host allocator, useful for host-to-device copy buffers: - not bound to any CUDA stream to allow use in EDM beginStream(); - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined. Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce(). Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.

makortel mentioned this pull request Apr 10, 2019

Make transfer of BeamSpot asynchronous #245

Closed

makortel added 5 commits April 11, 2019 18:09

Add perfect forwarding overload for CUDAProduct constructor

e7659d2

Move BeamSpot transfer to GPU to its own producer

fc23822

Add non-cached make_host_unique with the possibility to pass flags to…

6deb682

… cudaHostAlloc()

Use beginStream()-allocated write-combined memory for the BeamSpot tr…

2864b42

…ansfer

Move raw data pinned host buffers to be allocated in constructor with…

0135dbf

… cudaHostAllocWriteCombined

makortel force-pushed the beamspotToCUDA branch from 3ac4c7c to 0135dbf Compare April 11, 2019 16:10

Fix the CUDA product availability logic in rechit

c31e4cb

VinInn mentioned this pull request Apr 15, 2019

implement RecHit SOA and move to new framework #322

Closed

fwyzard mentioned this pull request Apr 16, 2019

Fix the CUDA product availability logic in rechit #320

Merged

fwyzard added the Beam spot label Apr 17, 2019

felicepantaleo requested a review from VinInn April 17, 2019 16:41

makortel mentioned this pull request Apr 22, 2019

Fix synchronization mistake in CUDAScopedContext #327

Merged

fwyzard merged commit fcbb820 into cms-patatrack:CMSSW_10_6_X_Patatrack Apr 23, 2019

fwyzard mentioned this pull request May 21, 2019

Prepare the Patatrack branch for merging into CMSSW #200

Closed

fwyzard mentioned this pull request Aug 13, 2020

Patatrack integration - GPU beamspot data format and transfer (4/N) cms-sw/cmssw#31130

Merged

fwyzard mentioned this pull request Oct 8, 2020

Patatrack integration - Pixel local reconstruction (9/N) cms-sw/cmssw#31721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move BeamSpot transfer to GPU to its own producer #318

Move BeamSpot transfer to GPU to its own producer #318

makortel commented Apr 10, 2019

makortel commented Apr 11, 2019

makortel commented Apr 11, 2019

fwyzard commented Apr 16, 2019 •

edited

Loading

fwyzard commented Apr 16, 2019

fwyzard commented Apr 16, 2019

makortel commented Apr 16, 2019

fwyzard commented Apr 16, 2019

fwyzard commented Apr 16, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented Apr 17, 2019

VinInn commented Apr 20, 2019

VinInn commented Apr 20, 2019

VinInn commented Apr 22, 2019 •

edited

Loading

makortel commented Apr 22, 2019

makortel commented Apr 22, 2019

VinInn commented Apr 22, 2019 via email

makortel commented Apr 22, 2019

VinInn commented Apr 22, 2019 via email

makortel commented Apr 22, 2019

makortel commented Apr 22, 2019

VinInn commented Apr 23, 2019

Move BeamSpot transfer to GPU to its own producer #318

Move BeamSpot transfer to GPU to its own producer #318

Conversation

makortel commented Apr 10, 2019

PR description:

PR validation:

makortel commented Apr 11, 2019

makortel commented Apr 11, 2019

fwyzard commented Apr 16, 2019 • edited Loading

Validation summary

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

Logs

fwyzard commented Apr 16, 2019

fwyzard commented Apr 16, 2019

makortel commented Apr 16, 2019

fwyzard commented Apr 16, 2019

fwyzard commented Apr 16, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented Apr 17, 2019

fwyzard commented Apr 17, 2019

makortel commented Apr 17, 2019

VinInn commented Apr 20, 2019

VinInn commented Apr 20, 2019

VinInn commented Apr 22, 2019 • edited Loading

makortel commented Apr 22, 2019

makortel commented Apr 22, 2019

VinInn commented Apr 22, 2019 via email

makortel commented Apr 22, 2019

VinInn commented Apr 22, 2019 via email

makortel commented Apr 22, 2019

makortel commented Apr 22, 2019

VinInn commented Apr 23, 2019

fwyzard commented Apr 16, 2019 •

edited

Loading

`makeTrackValidationPlots.py` plots

logs and `nvprof`/`nvvp` profiles

VinInn commented Apr 22, 2019 •

edited

Loading