CTD19 developments for review and merging (fixed) #329

VinInn · 2019-04-23T08:31:33Z

This PR on top of #324

merges Fix synchronization mistake in CUDAScopedContext #327 and Remove redundant CUDA event occurred checks #328
fixes a major bug introduced when using cache allocator for cub workspace
fixes two long-standing issues happening in data only
3a) more then 256 hits in a module
3b) duplicate pixels

…atrices of static dimensions to run on the GPUs

…o use matrices of static dimensions in order to run on the GPUs.

- deleted the forgotten prints and time measurements; - created a new modifier for the broken line fit; - switched back from tipMax=1 to tipMax=0.1 (the change will maybe be done in another PR); - restored the original order of the cuts on chi2 and tip; - deleted the default label to pixelFitterByBrokenLine; - switched from CUDA_HOSTDEV to __host__ __device__; - BrokenLine.h now uses dinamically-sized-matrices (the advantage in using them over the statically-sized ones is that the code would also work with n>4) and, as before, the switch can be easily done at the start of the file; - hence, the test on GPUs now needs an increase in the stack size (at least 1761 bytes); - some doxygen comments in BrokenLine.h have been updated.

…line fit

…called Otherwise, if the stream was idle before, and the constructor queues work to it, the event is not created and downstream will assume that the product is always there (even if it isn't yet).

…merge-topic

VinInn · 2019-04-23T08:37:33Z

it continues to crash if many 8-threads jobs are run concurrently see for instance

ls -ls ~/data/crash*
56 -rw-r--r--. 1 innocent zh 56416 Apr 23 10:27 /afs/cern.ch/user/i/innocent/data/crash1@VZ0.log
48 -rw-r--r--. 1 innocent zh 48416 Apr 23 10:28 /afs/cern.ch/user/i/innocent/data/crash2@VZ0.log
55 -rw-r--r--. 1 innocent zh 55462 Apr 23 10:32 /afs/cern.ch/user/i/innocent/data/crash3@VZ0.log

when successful results seem consistent

VinInn · 2019-04-23T13:18:24Z

and crashes on V100 as well

fwyzard · 2019-04-23T13:52:48Z

what gpu is there on vinzen0 ?

VinInn · 2019-04-23T13:54:06Z

what gpu is there on vinzen0 ?
gtx1060

fwyzard · 2019-04-23T13:58:23Z

OK, so the multi-job crash happens on Pascal-based cards as well.
So far it has been difficult to reproduce it on a GTX 1080, now I'm stressing a P100.

VinInn · 2019-04-23T14:03:55Z

OK, so the multi-job crash happens on Pascal-based cards as well. So far it has been difficult to reproduce it on a GTX 1080, now I'm stressing a P100.

I run 4x8-threads on a 8cpu node against a small gtx1060 on V100 crashes just one 8-thread job (but on flatiron, not sure if there is other activity: the system is pretty overloaded these days) oopps in this moment I share a machine with you: workergpu16, so no real surprise

fwyzard · 2019-04-23T14:06:26Z

oopps in this moment I share a machine with you: workergpu16, so no real surprise

but we should not be able to share the GPU, so it should not be an issue...

fwyzard · 2019-04-23T16:30:27Z

OK, it's reproducible also on a P100.

By the way, in my tests (without this PR) the error always points to this line:

if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

And, if I add something like

if (CubDebug(error = cudaGetLastError())) return error;

right before it, that never triggers, and the error stays with the cudaEventRecord() call.

VinInn · 2019-04-23T16:54:37Z

@fwyzard , did you got the crashes with this PR or the current HEAD?

VinInn · 2019-04-23T17:28:54Z

[innocent@workergpu21 src]$ git diff diff --git a/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py b/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py index 1021918c0ce..f50ab120671 100644 --- a/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py +++ b/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py @@ -32,6 +32,6 @@ def customizePixelTracksForProfilingDisableTransfer(process): # Disable "unnecessary" transfers to CPU process.pixelTracksHitQuadruplets.gpuEnableTransfer = False - process.pixelVertices.gpuEnableTransfer = False + process.pixelVertices.gpuEnableTransfer = True return process same behaviour crashes eventually (less then before seen even a multigpu job ending)

On 23 Apr, 2019, at 4:06 PM, Andrea Bocci ***@***.***> wrote: oopps in this moment I share a machine with you: workergpu16, so no real surprise but we should not be able to share the GPU, so it should not be an issue...

But apparently is an issue... (btw is your jobs still running ok?) It crashes occasionally on an empty machine as well ... so now I asked for 4 gpus to get a maachine on my own (flatiron is now pretty empty) and for the first tiem in a week I manged to finish the benchmark still it even crashed single threaded while distructing things dropped waiting message count 0

>> processed 13608 events

||Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells || /mnt/home/innocent/CMSSW_10_6_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CUDAService.cc, line 358: cudaErrorIllegalAddress: an illegal memory access was encountered ok apparently it is using more than one device... Exception Message: Callback of CUDA stream 0x2aab109132b0 in device 3 error cudaErrorIllegalAddress: an illegal memory access was encountered how to tell to use only one device?

makortel · 2019-04-23T21:04:38Z

how to tell to use only one device?

With CUDA_VISIBLE_DEVICES environment variable (e.g. in bash CUDA_VISIBLE_DEVICES=0 cmsRun ...).

fwyzard · 2019-04-23T23:41:29Z

did you got the crashes with this PR or the current HEAD?

With the current HEAD, running with 5 jobs with 3 streams each, all on a single gpu. Instead, running with one job per GPU does not crash even after 500+ tests. .A

VinInn · 2019-04-25T11:42:44Z

w/o smart cache does not crash single job on V100.
continues to happily crash multi job on gtx1060

VinInn · 2019-04-25T11:54:46Z

But it does not crash multi job on V100....

fwyzard · 2019-04-30T10:20:54Z

Superseded by #338.

Fix HLT label for UL data reprocessing

rovere and others added 30 commits September 3, 2018 12:26

Add back standalone GPU fit test

4973ac7

First Broken Line import

5e21c6b

Merged PatatrackHackathon4 from repository rovere with cms-merge-topic

56ab682

BrokenLineGPU: work in progress

d447fbb

Full implementation of the broken line fit. For the moment it needs m…

55a86bc

…atrices of static dimensions to run on the GPUs

Full implementation of the broken line fit. For the moment it needs t…

d864079

…o use matrices of static dimensions in order to run on the GPUs.

Just forgot to add the files for the broken line modifier.

7557d8d

DO NOT MERGE - reuse the 10824.7 and .9 workflows to test the broken …

9b81f01

…line fit

Replaced by the autogenerated cfi file

174a6cc

Merge branch 'CMSSW_10_2_X_Patatrack' into broken_line_fit

ce31f94

Merge branch 'CMSSW_10_2_X_Patatrack' into broken_line_fit

e8faf5d

fix conflicts

a25628f

Merged broken_line_fit from repository VinInn with cms-merge-topic

fefb9cb

fix matrix init

7f731ad

make it working

e8207e6

make it working

a3950ea

tests runs

8578b61

some previous clenup

36550f2

change name of fit driver

790ccfa

BL driver ok

11482bf

trivial if

91eeafd

make BL default

7f0f80c

Merged broken_line_fit from repository VinInn with cms-merge-topic

92b33eb

allow dump of hits

2013a29

Merged OptFitting105 from repository VinInn with cms-merge-topic

cbc4064

inversion by Cholesky Decomposition

cc68930

use cholesky invert, enable 3,4,5 hit fit

9dbd97c

Merged OptFitting105 from repository VinInn with cms-merge-topic

ce8f8e1

remove sqrt from choleshy

ad060c9

makortel and others added 7 commits April 22, 2019 19:51

Check the event creation only after the product constructor has been …

70a6d95

…called Otherwise, if the stream was idle before, and the constructor queues work to it, the event is not created and downstream will assume that the product is always there (even if it isn't yet).

Merged ForReview_10_6_0_pre2 from repository felicepantaleo with cms-…

52b2cd8

…merge-topic

merge 327&328

c447f0a

fix cub workspace mess

ce9896f

fix hits in module overflow

7a4bc82

account for possible duplicated pixels

a5c2e1b

clean

69da529

fwyzard mentioned this pull request Apr 23, 2019

factorise and review the code changes used for CTD19 #309

Closed

15 tasks

fwyzard added the Pixels Pixels-related developments label Apr 23, 2019

fwyzard changed the title ~~CDOTS19 Fixed~~ CTD19 developments for review and merging (fixed) Apr 24, 2019

VinInn mentioned this pull request Apr 24, 2019

Synchronize stream in the CUDAProductBase destructor #334

Closed

remove smart cache to avoid stupid crash on V100

4bb4f2a

fwyzard mentioned this pull request Apr 30, 2019

implement RecHit SOA and move to new framework #322

Closed

fwyzard closed this Apr 30, 2019

fwyzard pushed a commit that referenced this pull request May 15, 2019

Merge pull request #329 from peruzzim/fix_reHLT_label_UL

cb9e4a2

Fix HLT label for UL data reprocessing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CTD19 developments for review and merging (fixed) #329

CTD19 developments for review and merging (fixed) #329

VinInn commented Apr 23, 2019

VinInn commented Apr 23, 2019 •

edited

Loading

VinInn commented Apr 23, 2019

fwyzard commented Apr 23, 2019

VinInn commented Apr 23, 2019

fwyzard commented Apr 23, 2019

VinInn commented Apr 23, 2019 via email

fwyzard commented Apr 23, 2019

fwyzard commented Apr 23, 2019 •

edited

Loading

VinInn commented Apr 23, 2019

VinInn commented Apr 23, 2019 via email

makortel commented Apr 23, 2019

fwyzard commented Apr 23, 2019 via email

VinInn commented Apr 25, 2019

VinInn commented Apr 25, 2019

fwyzard commented Apr 30, 2019

CTD19 developments for review and merging (fixed) #329

CTD19 developments for review and merging (fixed) #329

Conversation

VinInn commented Apr 23, 2019

VinInn commented Apr 23, 2019 • edited Loading

VinInn commented Apr 23, 2019

fwyzard commented Apr 23, 2019

VinInn commented Apr 23, 2019

fwyzard commented Apr 23, 2019

VinInn commented Apr 23, 2019 via email

fwyzard commented Apr 23, 2019

fwyzard commented Apr 23, 2019 • edited Loading

VinInn commented Apr 23, 2019

VinInn commented Apr 23, 2019 via email

makortel commented Apr 23, 2019

fwyzard commented Apr 23, 2019 via email

VinInn commented Apr 25, 2019

VinInn commented Apr 25, 2019

fwyzard commented Apr 30, 2019

VinInn commented Apr 23, 2019 •

edited

Loading

fwyzard commented Apr 23, 2019 •

edited

Loading