-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AlcaBeamMonitor has thread safety problem #21769
Comments
A new Issue was created by @Dr15Jones Chris Jones. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
The traceback for 136.767 is
|
The traceback for wf 136.822
|
assign alca, reconstruction |
Another instance has occurred:
|
@smuzaffar @davidlange6 @fabiocos These failures have all been on the slc6_amd64_gcc700 builds. So is there a difference in ROOT for gcc700? |
@pcanal Here is where we are tracking the fitting problems. |
I have a theory as to why this is happening. The If the compiler optimizer decided to change the code to be more like fgMinuit = new TMinuit(dim);
fMinuit = fgMinuit; Then we'd have a race condition which would cause such crashes. The optimizer is allowed to make such a change since
The optimizer could not make such a change. Alternatively, there appears to be no reason to even set |
This one, and the traceback in #21796 (which I opened before noticing this one) looks like a shared pointer problem.
|
wrt Chris's theory,
gets compiled to
In English,
I've cross-checked this with the disassembly of the actual library just to make sure there wasn't some difference in compilation. So there isn't an actual race condition there. I'm don't believe Chris's proposed reordering would even be legal...the standard says "Every value computation and side effect associated with a full-expression is sequenced before every value computation and side effect associated with the next full-expression to be evaluated", meaning those expressions are determinately sequenced. Race conditions would get really hellish if those lines were indeterminately sequenced. |
Another variation. Most of these seem to be pointing towards problems with FitResult/TFitResultPtr, but I'm not sure this one fits that pattern.
|
@Dr15Jones , root in both gcc630 and gcc700 is same (6.10.08 commit 4584aec) with following changes in gcc700
cms-sw/cmsdist@IB/CMSSW_10_0_X/gcc630...IB/CMSSW_10_0_X/gcc700#diff-81217efc115a8de6b75eeef2a6324e47 cms-sw/cmsdist@IB/CMSSW_10_0_X/gcc630...IB/CMSSW_10_0_X/gcc700#diff-425f96424ae9934b30df0e0e5a33d090 |
There's a "std::bad_alloc exception" message just before the seg fault, and this one points pretty directly at something with the TFitResultPtr smart pointers.
|
Another variation
|
The interesting thing about this one is the messages printed just before the crash, which come from here:
|
It looks like there's a thread safety issue with the static It gets set to true in and it gets used various places, e.g. with this pattern
which means the used state is getting mixed between different instances.
to respect the value of fUsed if fgUseStaticMinuit is false:
That still leaves the more subtle systematics issue with the global random number seed getting reinitialized. Maybe the TMinuit
should be made thread local? Or possibly not call mnrn15() if fgUseStaticMinuit is false? However, I doubt that problem is the cause of all the crashes. I have a gdb backtrace for a crash in ROOT::Fit::FitResult::FillResult() with debug symbols, which again points at shared pointer problems, but helgrind and memcheck haven't been much help. I'll try patching TMinuitMinimizer and see if I still get this:
|
After patching TMinuitMinimizer I haven't been able to reproduce the crashes, so fgUsed may be the main problem. I've submitted a ROOT PR that references this issue. |
In this context, are things setup to use TMinuit2? (but might not due to a thread related problem in the PluginManager we are solving at the moment). |
The BeamSpotProducer uses Minuit2 explicitly, but it also uses TH1::Fit() with the default TMinuitMinimizer. So far I haven't been able to reproduce the crashes in Minuit2 at all, I don't see any obvious connection between the TMinuitMinimizer bug and the Minuit2 crashes, and it is still a mystery why we are only seeing crashes with gcc700. |
This one is significant only because it's with gcc630, so apparently the problem isn't exclusive to gcc700, just more likely, and I can stop staring at assembler trying to identify the difference. Also, the ROOT pr has been merged.
|
I think I've found the cause of the pure virtual calls. With IMT enabled, ROOT::Fit::ExecutionPolicy is getting set to kMultithread. So here: we get TThreadExecutor::MapReduce() called on the EvaluatePoissonLogL() function leading to stack traces like
This obviously breaks the TMinuitMinimizer::Fcn() use of thread local storage. Immediate workaround is to change the fit ExecutionPolicy to |
@dan131riley Nice! Setting this globally is fine with me since every call to |
What do you mean? |
@dan131riley I've looked in more detail at the code in the traceback you gave. I do not believe there is an issue with thread_local storage since the thread_local storage is only accessed in |
@Dr15Jones @pcanal after tracing through some confusion about various versions of |
@dan131riley Another reason it seems unlikely that I'd expect the crash to happen as we are going down the stack (i.e. in the call to |
@dan131riley I just thought of how TBB and thread_local could cause the problem. Say we are in a call to |
This exact problem is outlined in where the relavant ROOT code is here: https://github.com/root-project/root/blob/6147b6dd18552083c4a21a0607bae25bb63cde5e/core/imt/src/TThreadExecutor.cxx#L91 |
@pcanal the ROOT code needs to be protected with either a |
@dan131riley could you create a pull request to CMSSW to switch the system to use |
There's no option to set this globally...for wrt isolating the fit evaluation, using a separate |
Related question. Do you know/remember why this code is not using TMinuit2? |
@pcanal how do you tell |
It looks like you need to call ROOT::Math::MinimizerOptions::SetDefaultMinimizer :) |
Setting serial seems to have a bad interaction with the GSL integrator?
|
As a first check, GSL is thread-friendly: |
I was able to catch the GLS problem in the debugger (after many, many tries). The stack trace for the threads are
|
I think I understand what is happening. When using non |
Note that this is from the 10_0_X IB, so not all patches are present. It's still a weird crash, with two effectively simultaneous segfaults, one in the message logger handling what should be a fixed string (all the messages from PVFitter::runFitter() are fixed strings). The other crashing thread I think is here: https://github.com/cms-sw/root/blob/master/math/mathcore/src/FitResult.cxx#L176
|
@Dr15Jones @dan131riley the problem has been very little visible in the IBs in recent weeks, but it has appeared in the RelVals as reported by @fabozzi and appears also in the T0 replay in a few runs |
@fabiocos this morning I just happen to have been working on the change to downgrade the following exception to be an INFO message. The other ROOT error message triggering an exception was already downgraded to an INFO message in CMSSW_10_1_0. |
#22859 covers the case for the new ROOT error. |
@Dr15Jones thanks, looking into it |
@Dr15Jones @dan131riley @smuzaffar |
@slava77 , this has been fixed. |
+1 based on #21769 (comment) |
In CMSSW_10_0_X_2017-12-21-1100 for slc6_amd64_gcc700 architecture the workflows 136.767 and 136.822 crashed at end of luminosity processing where all threads were in the AlcaZBeamMonitor. Specifically they were in function calls for
ROOT::Minuit2
.The text was updated successfully, but these errors were encountered: