Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up CPU side of GPU rechits #125

Merged

Conversation

makortel
Copy link

@makortel makortel commented Aug 9, 2018

HitsOnCPU allocates the (pinned) host memory on each event. Profiler shows that this is rather costly (more than half of the wall clock time spent in the kernels...). The first commit moves the allocations as the first action in the function, because I first saw a peculiar profile where one of the cudaMallocHost took very long time and delayed the queueing of the asynchronous memory copies, thus making the acquire() to occupy CPU.

Of course this didn't significantly improve the situation though, so the second commit changes the pattern to the same as in raw2cluster, i.e. allocate (also host side) buffers once per module (i.e. per job per EDM stream), and the event product just holds pointers to them. Not ideal, but faster (in a single event the wall clock time of acquire() reduces from ~240 us to ~170 us).

We can of course discuss whether this approach is what we really want (probably not), but at the moment it seems to be the fastest.

@fwyzard @VinInn @felicepantaleo

@fwyzard
Copy link

fwyzard commented Aug 9, 2018

Successfully tested at #127 (comment) .

@fwyzard
Copy link

fwyzard commented Aug 9, 2018

... the second commit changes the pattern to the same as in raw2cluster, i.e. allocate (also host side) buffers once per module (i.e. per job per EDM stream), and the event product just holds pointers to them. Not ideal, but faster (in a single event the wall clock time of acquire() reduces from ~240 us to ~170 us).

I think reusing some per-module, per-stream buffers does make sense.
Eventually we can look into wrapping it in some construct that guarantees the correctness of the memory reuse, like some kind of vector or stack with an element per stream - but we already know the present usage pattern is fine, by construction.

@fwyzard
Copy link

fwyzard commented Aug 9, 2018

Still, we should eventually coalesce all the allocations into fewer ones (here and everywhere else).

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

Validation summary

Reference release CMSSW_10_2_1 at d00b7b4
Development branch CMSSW_10_2_X_Patatrack at 907e17c
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/f8f28fa55a265ae3fcedaec343749ea5f33b1129/log .

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

Running the tests again to see if we can get a fair comparison of the performance...

@makortel
Copy link
Author

Still, we should eventually coalesce all the allocations into fewer ones (here and everywhere else).

I started to test something I've been thinking for a while, could you hold off merging for a few hours?

… use cudaMemcpy2DAsync to transfer

Trying to reduce memory copy overheads
@makortel
Copy link
Author

Ok, the last commit has an experiment of replacing 9 32-bit buffers with one buffer, and 5 16-bit buffers with one buffer. The memory copy reductions are 5->1 and 3->1, respectively. The memory allocation is done with cudaMallocPitch and the copy with cudaMemcpy2DAsync to keep the ability to transfer only nhits elements.

Performance comparison (on a single event)

  • rechit kernel time stays the same (~180 us, as expected)
  • rechit acquire time decreases further from ~170 us to ~110 us
  • device->host transfer time decreases from ~39 us to ~27 us

So yes, it does improve, even if the numbers are rather small in the absolute scale.

I'd really like to not spread the pattern as-is, but build some abstraction (e.g. along FWCore/SOA or FWCore/Utilities/interface/SoATuple.h) on top of it. Also other places seem not to be poised as much with "many small transfers" compared to the shortness of kernel time.

@makortel
Copy link
Author

makortel commented Aug 10, 2018

Eventually we can look into wrapping it in some construct that guarantees the correctness of the memory reuse, like some kind of vector or stack with an element per stream - but we already know the present usage pattern is fine, by construction.

On the other hand, it seems to me (with #129) that CAHitNtupletHeterogeneousEDProducer is currently allocating ~2 GB on its beginStream(), which starts to severely limit the number of EDM streams we can run in parallel (even on V100).

Ok, looking from the code

cudaCheck(cudaMalloc(&device_theCells_,
maxNumberOfLayerPairs_ * maxNumberOfDoublets_ * sizeof(GPUCACell)));
cudaCheck(cudaMalloc(&device_nCells_, sizeof(uint32_t)));
cudaCheck(cudaMemset(device_nCells_, 0, sizeof(uint32_t)));
cudaCheck(cudaMalloc(&device_isOuterHitOfCell_,
maxNumberOfLayers_ * maxNumberOfHits_ * sizeof(GPU::VecArray<unsigned int, maxCellsPerHit_>)));

static constexpr int maxNumberOfQuadruplets_ = 10000;
static constexpr int maxCellsPerHit_ = 2048; // 512;
static constexpr int maxNumberOfLayerPairs_ = 13;
static constexpr int maxNumberOfLayers_ = 10;
static constexpr int maxNumberOfDoublets_ = 262144;
static constexpr int maxNumberOfHits_ = 20000;
static constexpr int maxNumberOfRegions_ = 2;

GPU::VecArray< unsigned int, 40> theOuterNeighbors;
int theDoubletId;
int theLayerPairId;
private:
unsigned int theInnerHitId;
unsigned int theOuterHitId;
float theInnerX;
float theOuterX;
float theInnerY;
float theOuterY;
float theInnerZ;
float theOuterZ;
float theInnerR;
float theOuterR;

the allocated memory (per EDM stream) are (if I got them right)

  • device_theCells_: 689 MB
  • device_isOuterHitOfCell_: 1563 MB

@felicepantaleo

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

I'd be interested to know how much of this memory we actually use...

@felicepantaleo
Copy link

The amount of doublets and the amount of doublets per hit depends on the layer. I set this to be safe in bpix1 and it's the same for all the layers, but one can think of setting a different maximum which changes per layer..
I also thought that we would have had an arena to solve this problem, but the developer hired to do this defected.

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

Validation summary

Reference release CMSSW_10_2_1 at d00b7b4
Development branch CMSSW_10_2_X_Patatrack at 0301b0d
Testing PRs:

makeTrackValidationPlots.py plots

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

DQM GUI plots

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

logs and nvprof/nvvp profiles

/RelValTTbar_13/CMSSW_10_2_1-PU25ns_102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

/RelValZMM_13/CMSSW_10_2_1-102X_upgrade2018_realistic_v9_gcc7-v1/GEN-SIM-DIGI-RAW

Logs

The full log is available at https://fwyzard.web.cern.ch/fwyzard/patatrack/pulls/d977ae4b9f60ddd1a243d974977bc257732033de/log .

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

@makortel would you rather merge it as it is now, or see if we can build a more general approach first ?

@makortel
Copy link
Author

I would merge it now, and leave the better abstraction to a later exercise.

@makortel
Copy link
Author

I haven't looked closely the validations of recent PRs, are the fluctuations shown for 10824.8 in above on the expected magnitude?

@fwyzard
Copy link

fwyzard commented Aug 10, 2018

Yes, I think we regularly see fluctuations at the percent level in the summary.

On the other hand, initcheck complains about few thousands places now.
Can you have a look ?

@makortel
Copy link
Author

On the other hand, initcheck complains about few thousands places now.
Can you have a look ?

That should do the trick.

@fwyzard
Copy link

fwyzard commented Aug 13, 2018

Thanks.

More in general, I am wondering at what gets flagged by initcheck: I assumed it would flag a memory area if it is read from before having been written to.

As there were no cudaMemset in the original code, does it mean we were using uninitialised memory ?
Or, are the messages from initcheck false positives ?

@@ -48,6 +48,7 @@ namespace pixelgpudetails {
// Order such that the first ones are the ones transferred to CPU
static_assert(sizeof(uint32_t) == sizeof(float)); // just stating the obvious
cudaCheck(cudaMallocPitch(&gpu_.owner_32bit_, &gpu_.owner_32bit_pitch_, MAX_HITS*sizeof(uint32_t), 9));
cudaCheck(cudaMemsetAsync(gpu_.owner_32bit_, 0x0, gpu_.owner_32bit_pitch_*9, cudaStream.id()));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pitch * 9 ?
shouldn't it be pitch * 9 * MAX_HITS*sizeof(uint32_t) ?

or am I just being confused ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, ok, I was being confused: pitch is the rounded up value of MAX_HITS*sizeof(uint32_t), so pitch * 9 it is.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@makortel
Copy link
Author

Since the initcheck (presumably) started to complain after cudaMallocPitch+cudaMemcpy2DAsync I suppose it flags the padding. I don't know why anything whould be accessing that though. Maybe cudaMemcpy2DAsync? But even with that I'd find the access a bit strange as the target host memory does not have any padding.

@makortel
Copy link
Author

So am I just hiding something under the carpet with the cudaMemsetAsync?

@fwyzard
Copy link

fwyzard commented Aug 13, 2018

I have no idea...

@makortel
Copy link
Author

Taking a closer look, the first complaint is from a 32-byte area to which the nhits_*4 points into (in the first row), i.e. it is the first 32-byte area containing uninitialized memory. (I don't know why 32 bytes, but the initcheck seems to complain at each 32-byte interval) So probably doing the cudaMemsetAsync is the correct fix.

(and to my earlier worry about the padding, there is no padding in this case in felk40 at least)

@fwyzard
Copy link

fwyzard commented Aug 13, 2018

Thanks for looking into it.

I am still confused: does it mean that we were reading from that memory before having written anything into it ?
I understand that with this fix, the memory is now initialised to zero - but is it OK to read those zeros back ? are we missing some boundary checks ? or are we writing/reading to/from the wrong place ?

P.S I would run the tests again, but EOS seems to be down, so I won't be able to upload the test results anyway...

@makortel
Copy link
Author

Good questions, I don't know. It could be some weird interplay between cudaMemcpy2D() and initcheck (like the latter not knowing that what exactly the former does, or the former doing something different that one would naively expect).

I made a little test program
makortel@b5f47af
which works fine by default. When one changes the WIDTH to e.g. 257 (to mimick the case that we allocate more than use in the memcpy), initcheck starts to complain.

@fwyzard fwyzard merged commit df4d7cd into cms-patatrack:CMSSW_10_2_X_Patatrack Aug 13, 2018
@fwyzard fwyzard added this to the CMSSW_10_2_2_Patatrack milestone Aug 14, 2018
@fwyzard fwyzard modified the milestone: CMSSW_10_2_2_Patatrack Sep 2, 2018
fwyzard pushed a commit that referenced this pull request Oct 8, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
fwyzard pushed a commit that referenced this pull request Oct 19, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
fwyzard pushed a commit that referenced this pull request Oct 20, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
fwyzard pushed a commit that referenced this pull request Oct 23, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
  - allocate HitsOnCPU buffers once per job per edm stream
  - coalesce multiple 32 bit and multiple 16 bit rechit buffers to two larger buffers; the allocation is done with cudaMallocPitch, the transfer with cudaMemcpy2DAsync
  - initialise the full memory buffer to keep cuda-memchekc happy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants