-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Synchronize event in the CUDAProductBase destructor #391
Synchronize event in the CUDAProductBase destructor #391
Conversation
Otherwise there are possibilities for weird races (e.g. combination non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event)).
After realizing that in our current "profiling workflow" the last module in the Schedule is a normal EDProducer launching GPU work, I think this fix is necessary. |
This comment has been minimized.
This comment has been minimized.
The throughput scan extracted from the log seems to indicate that introducing the extra synchronisation does have an impact on the overall throughput, and requires more threads or streams to recover the current performance:
|
@fwyzard Which variant is this? The one without output on CPU, or the SoA/legacy on CPU? |
The benchmarks are done with the |
Thanks. Then the performance degradation is expected (although I'm a bit surprised how large the drop is <= 6 streams/threads). The variants with transfers at the end should have close-to-zero effect. |
No way around it ?
OK, I can try to benchmark that as well. |
I'd need something similar to |
Mhm, I don't want to delay this or ask for an alternative solution based on an unrealistic benchmark.
If that's the case, do you have any guess as to what causes the drop in throughput then ? |
I think what happens now is that the CUDA stream synchronizes wrt. EDM stream in raw-to-cluster (the only ExternalWork module). The program is therefore able to overlap the last kernels of event N with
So yes, something could be gained wrt. the |
I see - thanks for the explanation. |
Validation summaryReference release CMSSW_11_0_0_pre7 at 411b633 Validation plots/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValZMM_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_realistic_v4-v1/GEN-SIM-DIGI-RAW
/RelValTTbar_13/CMSSW_10_6_0-PU25ns_106X_upgrade2018_design_v3-v1/GEN-SIM-DIGI-RAW
Throughput plots/EphemeralHLTPhysics1/Run2018D-v1/RAW run=323775 lumi=53logs and
|
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
Otherwise there are possibilities for weird races, e.g. combination of non-ExternalWork producers, consumed-but-not-read CUDAProducts, CUDA streams executing work later than expected (= on the next event).
PR description:
This PR is an updated version of #334:
The difference is that this PR synchronizes wrt. the CUDA event (if there is one) instead of the stream (hence new PR), in this way the destruction of a product A does not have to wait for completion of work queued on the CUDA stream after the product A was made.
PR validation:
Profiling workflow runs, unit test run.
On
felk40
with 8 EDM streams and threads I see ~1 % throughput decrease.