Patatrack integration - Pixel track reconstruction (10/N) #31722

) When running the GPU algorithms, the pixel unpacker is reponsible for providing both the digis and the cluster. These changes make use of the unpacker label to access the clusters, conditionally on the presence of the `gpu` process modifier.

Matrix operations are based on Eigen. A first GPU version, running Eigen together with CUDA, is available in the test directory but currently disabled.

- reorganize `SiPixelRawToDigi` as `SiPixelRawToDigiHeterogeneous` using `HeterogeneousEDProducer` - output a `HeterogeneousEvent` - use `PixelThresholdClusterizer` - add `SiPixelDigiHeterogeneousConverter` - make cabling and gain transfers asynchronous - reorganize `SiPixelRecHits` as `SiPixelRecHitHeterogeneous` - move `PixelThresholdClusterizer` (back?) to interface+src in order to use it outside of RecoLocalTracker/SiPixelClusterizer - replace __host__ __device__ with constexpr to avoid weird compilation failures - split clusters to their own converter

Port the Cellular Automaton (back) to GPUs and CUDA, using the `HeterogeneousEDProducer` approach: - do memory allocations in the framework begin stream - run the memory copies and kernels asynchronously, in a dedicated CUDA stream per framework stream Use the new GPU::VecArray for holding repeated data structures. By default, run on the GPU in all gpu-enable workflows (e.g. 10824.8).

@makortel

Apply some clean up to the code and formatting of `CAHitNtupletHeterogeneousEDProducer` and `CAHitQuadrupletGeneratorGPU`, as suggested by @makortel during the review of #48: - clean up the `BuildFile.xml` - remove unused data members and arguments from function calls; - percolate the CUDA stream instead of storing it as a data member. Also: - add `cudaCheck` calls around memory allocations and copies; - reduce the number of memory allocations used to set up the GPU state.

- the CPU Riemann fit works using all combinations between the 2 booleans: `useErrors` and `useMultipleScattering`; - the standalone version of the GPU Riemann fit has been updated in order to explore all possibilities among the 2 booleans above: all of them work and produce identical results up to 1e-5 precision (the default one, 1e-6 fails when enabling multiScattering, most likely due to matrix inversions); - the GPU version of the Riemann fit within CMSSW works, with 1 fit assigned to each thread, with 32 threads/warps, all dynamically computed. Things that needs a "hack": - limit the "dynamic" size of Eigen matrices to at most, 4x4, which is just fine for quadruplets. Using anything wider will cause errors which I *believe* is related to the stack size of threads on the GPU; - cast matrices to be inverted to 4x4 (was done before the previous point: will revert it back and see if that's still needed or not, but I believe it is); this was done in order to "specialize" the `invert()` call to something that is "natively" supported by Eigen on GPU (that brought in also few `__host__` `__device__` here and there in Eigen); - fix the alignment of the `struct` holding the results of the fit, since its size was different on GPU and CPU, causing an annoying off-by-one effect.

…106) Can be included with the following snippet in the configuration: from RecoPixelVertexing.Configuration.customizePixelTracksForProfiling import customizePixelTracksForProfiling process = customizePixelTracksForProfiling(process) Removes validation, DQM, and output modules. As suggested in #70 (comment), an `AsciiOutputModule` is used to require the `pixelTracks`.

Implement a heterogeneous Cluster-to-TrackingParticle associator running on the GPU.

…low (#111)

Pixel doublets (actually CACells) are created on GPU and fed to CA. The whole workflow up to quadruplets candidates is now fully on GPU.

Do not #ifdef on __NVCC__: to protect CUDA-aware code sections, check if the __CUDACC__ symbol is defined. The symbol __NVCC__ is defined when building with nvcc, but not when building CUDA code with clang. Move header files referenced from outside their directory to the interface/ directory, and update the include guards accordingly. Include <cuda_runtime.h> instead of <cuda.h> to handle the CUDA attributes in non-CUDA compilations. Rename PixelTrackReconstructionGPU_impl.cu to PixelTrackReconstructionGPU.cu. Other cleanup: #defines, debug messages, change __inline__ to inline, fix include guards, whitespaces, etc.

Keep RiemannFit.h in the interface, as it is include-only.

Clean up unnecessary changes, whitespaces, defines and include directives.

…nsfer (#132) Always produce the CPU cluster and rechit collections, since they are needed anyway. Add transfer and conversion flags to clusterizer, rechits and CA. Add a skeleton for the future pixel track producer. Add customize functions to disable conversions to legacy formats, and to disable unnecessary GPU->CPU transfers.

Apply clang-format reformatting to RiemannFit.h

Fix for uninitialised variables. Always assume multiple scattering treatment and remove unused parameters. Remove test that has diverged from the actual implementation.

Add separate plots for tracks associated to the primary vertex.

Tune and speed up the pixel doublet alforithm, and take advantage of GPU read-only memory for a further speedup. Includes a python notebook to tune the cuts for doublets and triplets.

Pre-compute few constants that could not be declared constexpr. Reduce temporary buffer size. Reduce the block size of the calls to gpuPixelDoublets::getDoubletsFromHisto() from 256 to 64, to make better usage of the GPU processors.

Also, add back the stand-alone GPU fit test.

Reduce the number of blocks used to launch the Riemann fit kernels within the CA. Rename the kernels to avoid the ambiguiity with the standalone Riemann fit. Work around spurious warnings in the Eigen test.

Implement the multiple scattering treatments in the Riemann Fit. In particular: - modify the previous implementation of the multiple scattering in the circle fit to correctly cover both the barrel and the forward case; - implement the multiple scattering in the line fit in the S-Z plane both for the barrel and the forward case. The effective radiation length is still an approximate value since the phi angle is not taken into account (it is not known on a layer-by-layer case). Ad ad-hoc correction based on the inverse of the pt has been added, with a cut-off of 1 GeV. The pulls are ok-ish, the material could be further tuned. The Chi2 is flat on all eta range.

The Riemann Fit has been reworked so that both barrel and forward cases are naturally supported without branching. The underlying assumption is the uniform material distribution within the Pixel Tracker. The line fit has been reworked and is now using an ordinary least square fit in the S-Z plane. See the motivations and explanations inside the comments in the code. Additional changes: - code clean up - remove unused functions - fix standalone test of RiemannFit on GPU

…kport cms-sw#25163) (#202) Backport "Migrate tracker local reconstruction and pixel tracking to Tasks" (cms-sw#25163) to the Patatrack branch: - migrate RecoLocalTracker_cff to Tasks; - migrate RecoPixelVertexing_cff to Tasks; - keeping sequences to avoid massive migration (for now).

… MTV variation to pixel track validation sequence (#199) - add B-hadron MTV variation to pixel track validation sequence - fix MTV validation of initialStepPreSplitting tracks

Cleaned up by clang-tidy 7.0.0. Enabled checks: - boost-use-to-string - misc-uniqueptr-reset-release - modernize-deprecated-headers - modernize-make-shared - modernize-use-bool-literals - modernize-use-equals-delete - modernize-use-nullptr - modernize-use-override - performance-unnecessary-copy-initialization - readability-container-size-empty - readability-redundant-string-cstr - readability-static-definition-in-anonymous-namespace - readability-uniqueptr-delete-release See http://releases.llvm.org/7.0.0/tools/clang/tools/extra/docs/clang-tidy/index.html for details.

Port and optimise the full workflow from pixel raw data to pixel tracks and vertices to GPUs. Clean the pixel n-tuplets with the "fishbone" algorithm (only on GPUs). Other changes: - recover the Riemann fit updates lost during the merge with CMSSW 10.4.x; - speed up clustering and track fitting; - minor bug fix to avoid trivial regression with the optimized fit.

`#pragma unroll` is not supported by GCC, leading to compilation warnings in host code. GCC 8 supports `#pragma GCC unroll N` which could be used instead. However, benchmarking on a V100 with and without the `#pragma unroll` there is no observable difference, so it is simpler to remove them.

Make unit tests that require a CUDA device skip the test and exit succesfully if the CUDA runtime is not available, or no CUDA devices are available.

Introduce the inner loop parallelization in the doublet finder using the stride pattern already used in the "fishbone", and make use of a 2D grid instead of a hand-made stride.

Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream. Minimize data movements between the CPU and the device, and support multiple devices. Allow the same job configuration to be used on all hardware combinations. See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.

* Add DQM for pixel vertices * Add pT>0.9GeV pixel track collections to MTV * Add dzPV0p1, Pt0to1, Pt1 variants of pixel track DQM

Create modifiers for enabling the broken line fit on the cpu and on the gpu. Use dinamically-sized-matrices: the advantage over statically-sized ones is that the code would also work with n>4); the switch can be easily done at the start of the file. Update Eigen tests with the features used by the broken line fit.

Merge the Riemann and broken line fits into single configurable pixel n-tuplet fitter, and extend it to work with up to 5 hits. Mmake the broken line fit the default algorithm. Try both triplets and quadruplets in the pixel "hole". Limit pT used to compute the multple scattering. Use the inline Cholesky decomposition. Generic clean up and improvements.

Improve pixel doublets and CA: - add pixel cluster size and shape cuts in doublets; - add triplet cleaner; - improved cluster size studies - implement layer-dependent cuts in the CA. Add counters in GPU code and possibility to test full doublet combinatorics. Update python notebook and include z0 resolution.

…amework (#338) Use cleaned hits. Use pixel layer and ladders geometry, and use pixel triplets in the gaps. Optimise GPU memory usage: - reduce the number of memory allocations - fix the size of the cub workspace - allocate memory per event via the caching allocator - use constant memory for geometry and parameters - use shared memory where the content is the same for every thread Optimise kernel launches, and add a protection for empty events and overflows.

Enable pixel triplets with: process.pixelTracksHitQuadruplets.minHitsPerNtuplet = 3 process.pixelTracksHitQuadruplets.includeJumpingForwardDoublets = True Changes: - adjust for the average pixel geometry and the beam spot position; - allow "jumping doublets" in the forward region (FPIX1-FPIX3) for triplets.

- port the whole pixel workflow to new heterogeneous framework - implement a legacy cluster to SoA converter for the pixel RecHits - update the vertex producer to run on CPU as well as GPU

…vice (#364) To reduce dependencies on edm::Service, and to make CUDAService less of a collection of everything, split off from it: - the CUDAEventCache - the CUDAStreamCache - the caching allocators Other changes: - clean up unnecessary use of CUDAService - fix maxCachedFraction, add debug printouts - add make_*_unique_uninitialized that avoid the static_assert

…389) Replace cuda::stream_t<> with cudaStream_t in client code Replace cuda::event_t with cudaEvent_t in the client code Clean up BuildFiles

Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).

Reorder cuts and some factorize code to speed up doublets. Increase various buffers size not to overflow in case of very relaxed cuts. Rename some parameters to better reflect their actual action in code.

Migrate ClusterTPAssociationHeterogeneous using the depreacted HeterogeneousEDProducer to ClusterTPAssociationProducerCUDA, and implement a simple analyzer to consume its procuct. To test it, add a dummy analyzer to an MC workflow: process.load("SimTracker.TrackerHitAssociation.clusterTPCUDAdump_cfi") process.validation_step = cms.EndPath(process.globalValidationPixelTrackingOnly + process.clusterTPCUDAdump) process.tpClusterProducerCUDAPreSplitting.dumpCSV = True

Rename the cudautils namespace to cms::cuda or cms::cudatest, and drop the CUDA prefix from the symbols defined there. Always record and query the CUDA event, to minimize need for error checking in CUDAScopedContextProduce destructor. Add comments to highlight the pieces in CachingDeviceAllocator that have been changed wrt. cub. Various other updates and clean up: - enable CUDA for compute capability 3.5. - clean up CUDAService, CUDA tests and plugins. - add CUDA existence protections to BuildFiles. - mark thread-safe static variables with CMS_THREAD_SAFE.

Major changes: - restructure the RecoPixelVertexing/PixelVertexFinding package; - update the interface of PixelCPEFast.

Fix include guard in CUDADataFormats/Track/src/classes.h . Remove unused variables in DataFormats/Math/test/CholeskyInvert_t.cpp .

Clean up the Patatrack code base following the comments received during the integration into the upstream release. Currently tracks the changes introduced due to - cms-sw#29109: Patatrack integration - trivial changes (1/N) - cms-sw#29110: Patatrack integration - common tools (2/N) List of changes: * Remove unused files * Fix compilation warnings * Fix AtomicPairCounter unit test * Rename the cudaCompat namespace to cms::cudacompat * Remove extra semicolon * Move SimpleVector and VecArray to the cms::cuda namespace * Add missing dependency * Move HistoContainer, AtomicPairCounter, prefixScan and radixSort to the cms::cuda namespace * Remove rule exception for HeterogeneousCore * Fix code rule violations: - replace using namespace cms::cuda in test/OneToManyAssoc_t.h . - add an exception for cudaCompat.h: cudaCompat relies on defining equivalent symbols to the CUDA intrinsics in the cms::cudacompat namespace, and pulling them in the global namespace when compiling device code without CUDA. * Protect the headers to compile only with a CUDA compiler

Clean up instances of using namespace ... from header files, following the comments from the upstream integration.

Replace the use of the prefix scan from CUB with a home-brewed implementation, using dynamic instead of static shared memory. No changes to physics or timing performance.

Adjust the growth factor in the caching allocators to use more granular bins, reducing the memory wasted by the allocations. Use a dynamic buffer for CA cells components. Fix a possible data race in the prefix scan.

customizePixelOnlyForProfilingGPUOnly: Customise the Pixel-only reconstruction to run on GPU Run the unpacker, clustering, ntuplets, track fit and vertex reconstruction on GPU. customizePixelOnlyForProfilingGPUWithHostCopy: Customise the Pixel-only reconstruction to run on GPU, and copy the data to the host Run the unpacker, clustering, ntuplets, track fit and vertex reconstruction on GPU, and copy all the products to the host in SoA format. The same customisation can be also used on the SoA CPU workflow, running up to the tracks and vertices on the CPU in SoA format, without conversion to legacy format. customizePixelOnlyForProfiling: Customise the Pixel-only reconstruction to run on GPU, copy the data to the host, and convert to legacy format Run the unpacker, clustering, ntuplets, track fit and vertex reconstruction on GPU; copy all the products to the host in SoA format; and convert them to legacy format. The same customisation can be also used on the CPU workflow, running up to the tracks and vertices on the CPU.

Minor bug fixes: - fix a typo in EventFilter/EcalRawToDigi/plugins/BuildFile.xml . Clean up: - remove obsolete ArrayShadow class; - remove obsolete profiling functions.

Update the RelVal workflows and the CPU customisation: - change the .501 workflow to run the full Patatrack pixel track reconstruction on CPU - add a customisation to run the Patatrack reconstruction with triplets, on CPU and GPU - add the .505 and .506 workflows to reconstruct triplets, on CPU and GPU Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>

…560)

Add a counter for forlorn doublets.

Address the pixel local reconstruction review comments. General clean up of the pixel local reconstruction code: - remove commented out and obsolete code and data members - use named constants more consistently - update variable names to follow the coding rules and for better consistency - use member initializer lists in the constructors - allow `if constexpr` in CUDA code - use `std::size` instead of hardcoding the array size - convert iterator-based loops to range-based loops - replace `cout` and `printf` with `LogDebug` or `LogWarning` - use put tokens - reorganise the auto-generated cfi files and use them more consistently - adjust code after rearranging an `#ifdef GPU_DEBUG` block - apply code formatting - other minor changes Improve comments: - improve comments and remove obsolete ones - clarify comments and types regarding `HostProduct` - update comments about `GPU_SMALL_EVENTS` being kept for testing purposes - add notes about the original cpu code Reuse some more common code: - move common pixel cluster code to `PixelClusterizerBase` - extend the `SiPixelCluster` constructor Rename classes and modules for better consistency: - remove the `TrackingRecHit2DCUDA.h` and `gpuClusteringConstants.h` forwarding headers - rename `PixelRecHits` to `PixelRecHitGPUKernel` - rename SiPixelRecHitFromSOA to SiPixelRecHitFromCUDA - rename `siPixelClustersCUDAPreSplitting` to `siPixelClustersPreSplittingCUDA` - rename `siPixelRecHitsCUDAPreSplitting` to `siPixelRecHitsPreSplittingCUDA` - rename `siPixelRecHitsLegacyPreSplitting` to `siPixelRecHitsPreSplittingLegacy` - rename `siPixelRecHitHostSoA` to `siPixelRecHitSoAFromLegacy` Re-apply changes from cms-sw#29805 that were lost in the Patatrack branch.

Address the pixel local reconstruction review comments: - remove obsolete comments; - consistently use named constants; - rename data members and methods to be more descriptive; - rename local variables according to the coding rules and for consistency with cms-sw#32591; - update transient dictionaries to match data types.

Updat EDM access: - switch to consumes() scheme for event setup; - simplify some event data access. Style fixes: - make class member private & fixed problematic cast; - format of comments for clang-tidy; - chang to enum class to avoid creating a namespace (usage becomes: pixelTrack::Quality::loose); - add article reference in comment (it was already further down in the file); - fix member functions and classes capitalization; - fix one letter or upper case variable names in formulas (trying to keep the naming from the reference article). Avoid some code repetitions.

Fix RecoPixelVertexing/PixelTrackFitting/test/BuildFile.xml following file renames. Remove unnecessary customisation from RecoPixelVertexing/Configuration/python/customizePixelTracksSoAonCPU.py .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Patatrack integration - Pixel track reconstruction (10/N) #31722

Patatrack integration - Pixel track reconstruction (10/N) #31722

Commits on Jan 15, 2021

Commits on Apr 1, 2021