-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CTD19 developments for review and merging (fixed) #329
CTD19 developments for review and merging (fixed) #329
Conversation
…atrices of static dimensions to run on the GPUs
…o use matrices of static dimensions in order to run on the GPUs.
- deleted the forgotten prints and time measurements; - created a new modifier for the broken line fit; - switched back from tipMax=1 to tipMax=0.1 (the change will maybe be done in another PR); - restored the original order of the cuts on chi2 and tip; - deleted the default label to pixelFitterByBrokenLine; - switched from CUDA_HOSTDEV to __host__ __device__; - BrokenLine.h now uses dinamically-sized-matrices (the advantage in using them over the statically-sized ones is that the code would also work with n>4) and, as before, the switch can be easily done at the start of the file; - hence, the test on GPUs now needs an increase in the stack size (at least 1761 bytes); - some doxygen comments in BrokenLine.h have been updated.
…called Otherwise, if the stream was idle before, and the constructor queues work to it, the event is not created and downstream will assume that the product is always there (even if it isn't yet).
it continues to crash if many 8-threads jobs are run concurrently see for instance
when successful results seem consistent |
and crashes on V100 as well |
what gpu is there on vinzen0 ? |
|
OK, so the multi-job crash happens on Pascal-based cards as well. |
OK, so the multi-job crash happens on Pascal-based cards as well.
So far it has been difficult to reproduce it on a GTX 1080, now I'm stressing a P100.
I run 4x8-threads on a 8cpu node against a small gtx1060
on V100 crashes just one 8-thread job (but on flatiron, not sure if there is other activity: the system is pretty overloaded these days)
oopps in this moment I share a machine with you: workergpu16, so no real surprise
|
but we should not be able to share the GPU, so it should not be an issue... |
OK, it's reproducible also on a P100. By the way, in my tests (without this PR) the error always points to this line: if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error; And, if I add something like if (CubDebug(error = cudaGetLastError())) return error; right before it, that never triggers, and the error stays with the |
@fwyzard , did you got the crashes with this PR or the current HEAD? |
[innocent@workergpu21 src]$ git diff
diff --git a/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py b/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py
index 1021918c0ce..f50ab120671 100644
--- a/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py
+++ b/RecoPixelVertexing/Configuration/python/customizePixelTracksForProfiling.py
@@ -32,6 +32,6 @@ def customizePixelTracksForProfilingDisableTransfer(process):
# Disable "unnecessary" transfers to CPU
process.pixelTracksHitQuadruplets.gpuEnableTransfer = False
- process.pixelVertices.gpuEnableTransfer = False
+ process.pixelVertices.gpuEnableTransfer = True
return process
same behaviour
crashes eventually (less then before seen even a multigpu job ending)
On 23 Apr, 2019, at 4:06 PM, Andrea Bocci ***@***.***> wrote:
oopps in this moment I share a machine with you: workergpu16, so no real surprise
but we should not be able to share the GPU, so it should not be an issue...
But apparently is an issue... (btw is your jobs still running ok?)
It crashes occasionally on an empty machine as well ...
so now I asked for 4 gpus to get a maachine on my own (flatiron is now pretty empty)
and for the first tiem in a week I manged to finish the benchmark
still it even crashed single threaded while distructing things
dropped waiting message count 0
>> processed 13608 events
||Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
/mnt/home/innocent/CMSSW_10_6_0_pre2_Patatrack/src/HeterogeneousCore/CUDAServices/src/CUDAService.cc, line 358: cudaErrorIllegalAddress: an illegal memory access was encountered
ok apparently it is using more than one device...
Exception Message:
Callback of CUDA stream 0x2aab109132b0 in device 3 error cudaErrorIllegalAddress: an illegal memory access was encountered
how to tell to use only one device?
|
With |
did you got the crashes with this PR or the current HEAD?
With the current HEAD, running with 5 jobs with 3 streams each, all on a
single gpu.
Instead, running with one job per GPU does not crash even after 500+ tests.
.A
|
w/o smart cache does not crash single job on V100. |
But it does not crash multi job on V100.... |
Superseded by #338. |
Fix HLT label for UL data reprocessing
This PR on top of #324
3a) more then 256 hits in a module
3b) duplicate pixels