Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CTD19 developments for review and merging (fixed) #329

Closed
wants to merge 231 commits into from

Conversation

VinInn
Copy link

@VinInn VinInn commented Apr 23, 2019

This PR on top of #324

  1. merges Fix synchronization mistake in CUDAScopedContext #327 and Remove redundant CUDA event occurred checks #328
  2. fixes a major bug introduced when using cache allocator for cub workspace
  3. fixes two long-standing issues happening in data only
    3a) more then 256 hits in a module
    3b) duplicate pixels

rovere and others added 30 commits September 3, 2018 12:26
…atrices of static dimensions to run on the GPUs
…o use matrices of static dimensions in order to run on the GPUs.
- deleted the forgotten prints and time measurements;
- created a new modifier for the broken line fit;
- switched back from tipMax=1 to tipMax=0.1 (the change will maybe be done in another PR);
- restored the original order of the cuts on chi2 and tip;
- deleted the default label to pixelFitterByBrokenLine;
- switched from CUDA_HOSTDEV to __host__ __device__;
- BrokenLine.h now uses dinamically-sized-matrices (the advantage in using them over the statically-sized ones is that the code would also work with n>4) and, as before, the switch can be easily done at the start of the file;
- hence, the test on GPUs now needs an increase in the stack size (at least 1761 bytes);
- some doxygen comments in BrokenLine.h have been updated.
makortel and others added 7 commits April 22, 2019 19:51
…called

Otherwise, if the stream was idle before, and the constructor queues
work to it, the event is not created and downstream will assume that
the product is always there (even if it isn't yet).
@VinInn
Copy link
Author

VinInn commented Apr 23, 2019

it continues to crash if many 8-threads jobs are run concurrently see for instance

ls -ls ~/data/crash*
56 -rw-r--r--. 1 innocent zh 56416 Apr 23 10:27 /afs/cern.ch/user/i/innocent/data/crash1@VZ0.log
48 -rw-r--r--. 1 innocent zh 48416 Apr 23 10:28 /afs/cern.ch/user/i/innocent/data/crash2@VZ0.log
55 -rw-r--r--. 1 innocent zh 55462 Apr 23 10:32 /afs/cern.ch/user/i/innocent/data/crash3@VZ0.log

when successful results seem consistent

@VinInn
Copy link
Author

VinInn commented Apr 23, 2019

and crashes on V100 as well

@fwyzard
Copy link

fwyzard commented Apr 23, 2019

what gpu is there on vinzen0 ?

@VinInn
Copy link
Author

VinInn commented Apr 23, 2019

what gpu is there on vinzen0 ?
gtx1060

@fwyzard
Copy link

fwyzard commented Apr 23, 2019

OK, so the multi-job crash happens on Pascal-based cards as well.
So far it has been difficult to reproduce it on a GTX 1080, now I'm stressing a P100.

@VinInn
Copy link
Author

VinInn commented Apr 23, 2019 via email

@fwyzard
Copy link

fwyzard commented Apr 23, 2019

oopps in this moment I share a machine with you: workergpu16, so no real surprise

but we should not be able to share the GPU, so it should not be an issue...

@fwyzard fwyzard added the Pixels Pixels-related developments label Apr 23, 2019
@fwyzard
Copy link

fwyzard commented Apr 23, 2019

OK, it's reproducible also on a P100.

By the way, in my tests (without this PR) the error always points to this line:

if (CubDebug(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream))) return error;

And, if I add something like

if (CubDebug(error = cudaGetLastError())) return error;

right before it, that never triggers, and the error stays with the cudaEventRecord() call.

@VinInn
Copy link
Author

VinInn commented Apr 23, 2019

@fwyzard , did you got the crashes with this PR or the current HEAD?

@VinInn
Copy link
Author

VinInn commented Apr 23, 2019 via email

@makortel
Copy link

how to tell to use only one device?

With CUDA_VISIBLE_DEVICES environment variable (e.g. in bash CUDA_VISIBLE_DEVICES=0 cmsRun ...).

@fwyzard
Copy link

fwyzard commented Apr 23, 2019 via email

@fwyzard fwyzard changed the title CDOTS19 Fixed CTD19 developments for review and merging (fixed) Apr 24, 2019
@VinInn
Copy link
Author

VinInn commented Apr 25, 2019

w/o smart cache does not crash single job on V100.
continues to happily crash multi job on gtx1060

@VinInn
Copy link
Author

VinInn commented Apr 25, 2019

But it does not crash multi job on V100....

@fwyzard
Copy link

fwyzard commented Apr 30, 2019

Superseded by #338.

@fwyzard fwyzard closed this Apr 30, 2019
fwyzard pushed a commit that referenced this pull request May 15, 2019
Fix HLT label for UL data reprocessing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pixels Pixels-related developments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants