-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105
Comments
A new Issue was created by @slava77 Slava Krutelyov. @davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction,core |
New categories assigned: core,reconstruction @Dr15Jones,@smuzaffar,@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@smuzaffar @slava77 by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded. |
On 5/1/18 2:14 PM, Chris Jones wrote:
@smuzaffar <https://github.com/smuzaffar> @slava77
<https://github.com/slava77> by Jenkins tests, do you mean the RelVals
or tests done for pull request? I had thought the pull request tests
were always run single threaded.
This is for tests of pull requests, in tests of unrelated PRs.
I think that these are all single threaded.
|
Since "minuit" and "fitting" were mentioned, let me note that e.g. in #23065 (some of?) the Just thinking if the root cause for these irreproducibilities could be the same as with |
@makortel |
The only substantive minuit change that I see is cms-sw/root@5efb8d6#diff-3e4e3008064c935331e2a484cd9157a1 but that shouldn't make any difference for single-threaded jobs, and looking at the logs confirms that these jobs were single-threaded. Expanding scope to all of root/math, I still don't see any changes that I would expect to affect single-threaded jobs. |
So the DQMGenericClient histograms appear to come from here
|
So it appears the other fit @slava77 was thinking about is
|
140.53 step2 is the place where it happens for the reco values. DQM diffs in jenkins show up in many places (in 25 out of 29 running workflows in one of the cases that I checked) |
@Dr15Jones for the DQM e.g. workflow 10024.0 HARVESTING step. |
@slava77 in what sense are the changes "somewhat random"? Are there different changes from run to run, or are they consistent but with no obvious pattern? Comparing, for example https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-04-30-1100+23065/26340/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017/Tracking_TrackFromPV_generalFromPVPt09_trackingParticleRecoAsssociation.html the differences in the failing plots look identical to me, which has implications for what kind of change we are hunting. |
@dan131riley Earlier these histograms had exactly the same values, unless any of the input histograms to these fits changed. In that case there has been numerical changes also in the fit results of unrelated histograms (because of some internal state of Minuit as far as I've understood). |
@pcanal FYI |
One thing I noticed is in both places which seem to be showing differences, neither of them check that the fit did not fail. |
Between the versions there are substantial changes to the mathcore Fitter and hist TF1 classes, mostly for IMT and vectorization. I can certainly imagine those could change what gets left in the results of failed fits. |
That theory isn't looking good...I modified HIPixelMedianVtxProducer with
and ran 140.53 step2 in CMSSW_10_2_0_pre1 and CMSSW_10_2_0_pre1_ROOT612. I got two fit failures (on the same event) in each, any they all printed the same values (to the resolution of the printout) in both releases:
Next step, print the result on every event and see if they ever differ, or if this is just the wrong tree. |
Looking at all fits and boosting the precision I get one difference:
I have no idea if that's large enough to produce the differences @slava77 saw. |
On 5/2/18 11:12 AM, Dan Riley wrote:
Looking at all fits and boosting the precision I get one difference:
|-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm +Fit
result 6.041964245416591162 +- 0.001913964442221971485 cm |
I have no idea if that's large enough to produce the differences
@slava77 <https://github.com/slava77> saw.
The comparison is rather strict.
So, I think that this difference can trigger the report of a difference
for the DQM plots.
The change in hiSelectedPixelVertex is much more significant in the end.
I don't know if this is just a butterfly effect or if in this case the
result change a lot from the start.
|
The DQMGenericClient fit is actually from here:
from here:
|
curiously: It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference. |
On 5/3/18 10:02 AM, Dan Riley wrote:
> On May 3, 2018, at 12:16, Slava Krutelyov ***@***.***>
wrote:
>
> It looks like the origin of the change is not completely random and is
likely a feature of the CMSSW (local) build with perhaps library memory
re-mapping making a consistent difference or no difference.
It's not entirely consistent, even...compare
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/
One has 158 failures, the other 181. The difference is 23 additional
failures in JetMET/Jet/CleanedslimmedJetsPuppi, and these failures, in
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/JetMET_Jet_CleanedslimmedJetsPuppi.html
this is a part of the PR changes in #23120
The "Puppi" plots are expected to change here.
… are not mean & sigma like all the others. Guess valgrind is in order.
|
We can run the comparison for same IB twice to see if it generate random results. |
Running valgrind on step4 gives thousands of uninitialized values used with tracebacks like
because a cmssw/DQMServices/Core/interface/QTest.h Line 209 in e6ba533
Using ROOT 6.12 on step4 (always with the same step3 input) I reproduce the comparison failures with a 6.10 baseline, but I don't get the same differences as the PR tests. I do get consistent differences running on a single system/architecture, but I get different changes running on different systems/architectures. For example, comparing a run with 6.12 on a Skylake machine vs. a Haswell system I get comparison failures in the same places but they are numerically different and not always in the same bins. Don't know yet if it's different on every system or tied to the architecture. I don't see that kind of variation on 6.10. The changes are all too small to discern on the comparison graphs, but still disturbing. |
Adding
to
and there are lots of differences in individual bins, some pretty significant:
Running twice on the same machine I get no differences. Next step is to package up srcME and see if I can make a small test case. |
The inconsistencies between different platforms goes away if I either change FitSlicesYTool to do
instead of
or if I add to the config
to turn off IMT. This is curious, since these jobs are single-threaded. InitRootHandlers does call My test program does not, so far, reproduce this behavior. |
For non-threaded jobs, cmsRun does set the number of TBB threads to 1, so there aren't additional threads created. IMT does get used in ROOT::Fit::FitUtil::EvaluateChi2 to create a bunch of tasks evaluating the chi2, and this is new in 6.12. It's at least plausible that, even with only one thread, the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors. I don't see any random numbers being used in the part that gets broken into TBB tasks (and I would expect much larger differences if there were). It's possible that we have some pathological cases where the order of summation makes a difference. I would think we'd want to turn off IMT for the PR tests, possibly turn it off by default when we aren't multi-threaded. Setting the "SERIAL" fit option is also a possibility, especially given that we know the default MINUIT minimizer can get into trouble due to its reliance on thread local storage. |
I think the "SERIAL" fit option would be good enough for both cases (I doubt the threading would bring any benefit in practice, and for |
I also believe that "SERIAL" is the best way to go. |
Adding "SERIAL" to the fit in HIPixelMedianVtxProducer also resolves the one difference I saw there, so I'll make a PR with both of those. |
If a |
There may be further cases of fits (or similar irreproducibilities), see #23160. |
@dan131riley ROOT does chunk based on number of CPUs. The chunk size for the integration is done here where the implementation of So different chunk sizes means different parts being summed together in different orders. |
@Dr15Jones the ROOT use of the number of CPUs for chunking explains it, it didn't really make sense that TBB would use the number of physical CPUs and not the configured thread limit. PR #23230 submitted. |
I was looking at diffs in #23258 made with CMSSW_10_2_X_2018-05-23-1100 (after #23230 was merged)
|
@dan131riley In the JIRA ticket (https://sft.its.cern.ch/jira/browse/ROOT-9422) @lmoneta mentions (and asks): A small numerical different is expected, not a large one. It is possible that in some case the fit converges and in other not, if the fit is rather un-stable. Do you have a reproducible test showing a large difference ? |
@pcanal I was not able to create a reproducible test case in the time budgeted for it. I suspect we're seeing an amplification of small differences, where we get small differences in one step that are amplified in a subsequent step, but I haven't traced through the operations to see how plausible that is. These are also low statistics, so they could be ill conditioned. So at this point I don't think we have convincing evidence of a ROOT issue. (I see you already fixed the automatic chunking to use the IMT pool size.) |
This is probably ready to close now |
There are somewhat random changes in hiSelectedPixelVertex in reco the HI workflows.
This started showing up in 20-40% of jenkins tests starting from CMSSW_10_2_X_2018-04-26-2300 (8 out of about 35 tests have this issue, while at least 60 tests before this IB do not have it).
This coincides with merging of cms-sw/cmsdist#3957, which advanced from ROOT 6.10.09 to 6.12.07.
HI vertex reco uses either hiPixelAdaptiveVertex (which comes from hiPixelMedianVertex, which is similar to the PrimaryVertexProducer used in the pp reco) or from HIPixelMedianVtxProducer, which has some minuit-based fitting.
I suspect that the problem is in RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc
@Dr15Jones @dan131riley please take a look at HIPixelMedianVtxProducer to see if it has some known features that may have changed with the new ROOT.
The text was updated successfully, but these errors were encountered: