Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move BeamSpot transfer to GPU to its own producer #318

Merged
merged 6 commits into from
Apr 23, 2019

Conversation

makortel
Copy link

PR description:

This PR is a followup to #245 and makes the first attempt to transfer BeamSpot data to GPU in its own producer instead of in rechit producer. I left the covariance matrix for subsequent work as it is not strictly needed at the moment, and currently Eigen apparently does not support minimal storage for symmetric matrices.

In addition, a perfect forwarding overload is added for CUDAProduct constructor enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

As in #245, a mechanism is added to create non-cached pinned host memory unique_ptrs with the possibility to pass custom flags to cudaHostAlloc(). There is one commit for moving the BeamSpot transfer to use once-per-stream-per-job allocated write-combined buffer, and another commit for doing the same for the raw data. A difference wrt. #245 is that the empty GPU::SimpleVector<PixelErrorCompact> is still transferred via a pinned host memory from the caching allocator (with the current SiPixelDigiErrorsCUDA providing the transfer buffer outside of the class would look ugly, but could be done if really wanted).

PR validation:

Tested that a profile configuration runs, and with nvprof that the BeamSpot transfer can occur in parallel to e.g. clustering kernels.

@makortel
Copy link
Author

Rebased on top of 10_6_X_Patatrack (HEAD corresponding to #315), and fixed the compilation errors (fixes squashed to the original commits).

@makortel
Copy link
Author

Now including the fix of #320 also here. I'll rebase if #320 gets merged before this PR.

@fwyzard
Copy link

fwyzard commented Apr 16, 2019

Validation summary

Reference release CMSSW_10_6_0_pre2 at 1313262
Development branch CMSSW_10_6_X_Patatrack at 49f2c7f
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.52
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.53

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

  • tracking validation plots and summary for workflow 10824.5
  • tracking validation plots and summary for workflow 10824.52
  • tracking validation plots and summary for workflow 10824.51
  • tracking validation plots and summary for workflow 10824.53

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_5_0-PU25ns_105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_5_0-105X_upgrade2018_realistic_v4_HS-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/7178c28abfafbe57bd10e98cebcf24fe4dbd71b1/log .

@fwyzard
Copy link

fwyzard commented Apr 16, 2019

No impact on physics.

@fwyzard
Copy link

fwyzard commented Apr 16, 2019

Looks like GitHub is smart enough, no need to rebase this.

@makortel
Copy link
Author

Looks like GitHub is smart enough, no need to rebase this.

I'm amazed.

@fwyzard
Copy link

fwyzard commented Apr 16, 2019

No impact on timing, measured on a T4 over TTbar MC.

Before:

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   729.2 ±   0.3 ev/s (4000 events)
   727.9 ±   0.4 ev/s (4000 events)
   728.1 ±   0.3 ev/s (4000 events)
   726.2 ±   0.4 ev/s (4000 events)
 --------------------
   727.9 ±   1.2 ev/s

After:

Running 4 times over 4200 events with 1 jobs, each with 8 threads, 8 streams and 1 GPUs
   727.1 ±   0.4 ev/s (4000 events)
   727.8 ±   0.4 ev/s (4000 events)
   728.8 ±   0.3 ev/s (4000 events)
   728.2 ±   0.4 ev/s (4000 events)
 --------------------
   728.0 ±   0.7 ev/s

@fwyzard
Copy link

fwyzard commented Apr 16, 2019

@makortel can you remind the use case for the non-cached pinned host memory ?

@makortel
Copy link
Author

can you remind the use case for the non-cached pinned host memory ?

@fwyzard Your suggestion in #245 (comment)

@fwyzard
Copy link

fwyzard commented Apr 17, 2019

Ah, I see, it is a different set of functions because these do not rely on any CUDA stream - correct ?

@makortel
Copy link
Author

Right, no CUDA stream and no caching by our allocator. The difference wrt. the api wrappers cuda::memory::host::make_unique() is that the added cudautils::make_host_noncached_unique() accepts flags to be passed to cudaHostAlloc().

@fwyzard
Copy link

fwyzard commented Apr 17, 2019

We seem to have a proliferation of memory allocation functions ...

@makortel
Copy link
Author

Yeah, I'm not too happy about that either. A challenge for supporting flags from the caching allocator is that AFAICT the flags create additional dimension for the binning (in addition to the device index and the allocation size).

@VinInn
Copy link

VinInn commented Apr 20, 2019

problem with multi-threaded job on data

cd /home/vin/BeamSpot
  1315	15:17	source /data/cmssw/cmsset_default.csh
  1316	15:17	cmsrel CMSSW_10_6_0_pre2_Patatrack
  1317	15:17	cd CMSSW_10_6_0_pre2_Patatrack/src
  1319	15:17	history | grep init
  1320	15:17	git cms-init -x cms-patatrack
  1322	15:18	git cms-merge-topic makortel:beamspotToCUDA
  1323	15:18	cmsenv
echo $CMSSW_BASE
/home/vin/BeamSpot/CMSSW_10_6_0_pre2_Patatrack
[innocent@vinzen0 src]$ git diff
diff --git a/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h b/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
index cbd354e7114..de0194a32ea 100644
--- a/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
+++ b/RecoLocalTracker/SiPixelRecHits/plugins/gpuPixelRecHits.h
@@ -144,6 +144,7 @@ namespace gpuPixelRecHits {
     // to global and compute phi...
     cpeParams->detParams(me).frame.toGlobal(xl[h],yl[h], xg[h],yg[h],zg[h]);
     // here correct for the beamspot...
+    if (std::abs(bs->x)>1.f) printf ("beam spot problem %f,%f,%f\n", bs->x,bs->y,bs->z);
     xg[h]-=bs->x;
     yg[h]-=bs->y;
     zg[h]-=bs->z;
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
index d00a5db1861..70e467040ff 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitQuadrupletGeneratorKernels.cc
@@ -19,8 +19,7 @@ void CAHitQuadrupletGeneratorKernels::allocateOnGPU()
   // ALLOCATIONS FOR THE INTERMEDIATE RESULTS (STAYS ON WORKER)
   //////////////////////////////////////////////////////////

-  cudaCheck(cudaMalloc(&device_theCells_,
-             CAConstants::maxNumberOfLayerPairs() * CAConstants::maxNumberOfDoublets() * sizeof(GPUCACell)));
+  cudaCheck(cudaMalloc(&device_theCells_, CAConstants::maxNumberOfDoublets() * sizeof(GPUCACell)));
   cudaCheck(cudaMalloc(&device_nCells_, sizeof(uint32_t)));
   cudaCheck(cudaMemset(device_nCells_, 0, sizeof(uint32_t)));

mc: ok
data single thread
ok (log ~/data/beamspotproblemONEthread.log)

data multi-thread

Begin processing the 1st record. Run 321177, Event 187588887, LumiSection 142 on stream 6 at 20-Apr-2019 15:34:35.845 CEST
Tracks overflow 30331 in 4
beam spot problem -62306304.000000,0.000000,-62437376.000000
beam spot problem -62306304.000000,0.000000,-62437376.000000
beam spot problem -62306304.000000,0.000000,-62437376.000000

full log in ls -l ~/data/beamspotproblem.log
-rw-r--r--. 1 innocent zh 6688743 Apr 20 15:35 /afs/cern.ch/user/i/innocent/data/beamspotproblem.log
and beamspotproblem2.log

not necessarily fully reproducible

I can try to run with valgrind (ahhhh)

[1]  + Segmentation fault            valgrind --tool=memcheck --smc-check=all-non-file --suppressions=/data/cmssw/slc7_amd64_gcc700/cms/cmssw/CMSSW_10_6_0_pre2_Patatrack/src/Utilities/ReleaseScripts/data/cms-valgrind-memcheck.supp --num-callers=20 --xml=yes --xml-file=valgrind-%p.xml cmsRun gpuOld.py >& val.log (core dumped)
1.555u 0.182s 0:02.57 67.3%	0+0k 8768+136io 31pf+0w
[innocent@vinzen0 pixelraw]$
[innocent@vinzen0 pixelraw]$ cat val.log

 *** Break *** segmentation violation
#0  0x00000000580ff3bf in ?? ()
#1  0x00000000580998d0 in ?? ()
#2  0x000000005809641b in ?? ()
#3  0x000000005809799f in ?? ()
#4  0x00000000580a6ff7 in ?? ()
#5  0x0000000000000000 in ?? ()
==5553==  If you believe this happened as a result of a stack
==5553==  overflow in your program's main thread (unlikely but
==5553==  possible), you can try to increase the size of the
==5553==  main thread stack using the --main-stacksize= flag.
==5553==  The main thread stack size used in this run was 8388608.

@VinInn
Copy link

VinInn commented Apr 20, 2019

btw:
assert in kernel: sequentialize
cudaFree: synchronize
cudaGetLastError (gives error at a random place)
cudaDeviceSynchronize (sequentialize)
so any of these is like running single threaded

I add to add

#undef assert
#define assert(EX) if(!(EX)) printf("failed : %s \n",#EX);

in cuda_assert.h and make sure it is the last include in each file....

@VinInn
Copy link

VinInn commented Apr 22, 2019

the symptoms seems to be cured by

diff --git a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
index 8168954c4b9..715cc3e4c91 100644
--- a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
+++ b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
@@ -64,9 +64,11 @@ CUDAScopedContext::~CUDAScopedContext() {
 }

 void CUDAScopedContext::createEventIfStreamBusy() {
+  /*
   if(event_ or stream_->is_clear()) {
     return;
   }
+  */
   edm::Service<CUDAService> cs;
   event_ = cs->getCUDAEvent();
 }

BUT still crashing multi-job
logs for instance in

/afs/cern.ch/user/i/innocent/data/crashMultiJobAfterFixInContext_1.log
/afs/cern.ch/user/i/innocent/data/crashMultiJobAfterFixInContext_2.log

@makortel
Copy link
Author

I added the printout of #318 (comment), but so far I have been unable to reproduce on felk40.

I can try to run with valgrind (ahhhh)

Can you try with cmsRunGlibC? Jemalloc doesn't support valgrind since version 5.

@makortel
Copy link
Author

diff --git a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
index 8168954c4b9..715cc3e4c91 100644
--- a/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
+++ b/HeterogeneousCore/CUDACore/src/CUDAScopedContext.cc
@@ -64,9 +64,11 @@ CUDAScopedContext::~CUDAScopedContext() {
 }

 void CUDAScopedContext::createEventIfStreamBusy() {
+  /*
   if(event_ or stream_->is_clear()) {
     return;
   }
+  */
   edm::Service<CUDAService> cs;
   event_ = cs->getCUDAEvent();
 }

This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?

@VinInn
Copy link

VinInn commented Apr 22, 2019 via email

@makortel
Copy link
Author

On 22 Apr, 2019, at 5:23 PM, Matti Kortelainen @.***> wrote: This will lead to starvation if an EDProducer produces multiple CUDA products that get consumed by downstream. Would just removing the stream_->is_clear() from the if be sufficient?
apparently yes for what concern the meaningless beamspot.

Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDA producer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.

still crashing in multijob (usually at the very beginning as in ~/data/crashMultiJobAfterFixInContext_3.log) v.

The multijob crash must be something different as it was first reported 3.5 weeks ago in #306.

@VinInn
Copy link

VinInn commented Apr 22, 2019 via email

@makortel
Copy link
Author

On 22 Apr, 2019, at 6:18 PM, Matti Kortelainen @.***> wrote: Thanks. In the mean time I poked around more and realized that there indeed is a synchronization mistake that comes up now with the BeamSpotToCUDAproducer (requirements are basically: no ExternalWork, use of ctx.emplace(), and queueing asynchronous work in the constructor of the CUDA product). I'll submit a fix later today.
ok, let me know as I used the same pattern for TrackingRecHit in #322

The pattern itself is fine (and I want to keep it), so I'll make a fix under the hoods.

@makortel
Copy link
Author

@VinInn the fix is in #327.

@VinInn
Copy link

VinInn commented Apr 23, 2019

#327 and #328 conflicts with this and conflict with each other...
could you please make a new PR that merges these three together?

@fwyzard fwyzard merged commit fcbb820 into cms-patatrack:CMSSW_10_6_X_Patatrack Apr 23, 2019
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard added a commit that referenced this pull request Nov 27, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Implement a non-caching host allocator, useful for host-to-device copy buffers:
  - not bound to any CUDA stream to allow use in EDM beginStream();
  - with the possibility to pass flags to cudaHostAlloc(), e.g. cudaHostAllocWriteCombined.

Add perfect forwarding overload for CUDAProduct constructor, enabling the use of CUDAScopedContext::emplace() in BeamSpotToCUDA::produce().

Move the BeamSpot host-to-device transfer to its own EDProducer, making use of beginStream()-allocated write-combined memory for the transfer.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants