reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105

slava77 · 2018-05-01T20:44:11Z

There are somewhat random changes in hiSelectedPixelVertex in reco the HI workflows.
This started showing up in 20-40% of jenkins tests starting from CMSSW_10_2_X_2018-04-26-2300 (8 out of about 35 tests have this issue, while at least 60 tests before this IB do not have it).
This coincides with merging of cms-sw/cmsdist#3957, which advanced from ROOT 6.10.09 to 6.12.07.

HI vertex reco uses either hiPixelAdaptiveVertex (which comes from hiPixelMedianVertex, which is similar to the PrimaryVertexProducer used in the pp reco) or from HIPixelMedianVtxProducer, which has some minuit-based fitting.
I suspect that the problem is in RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc

@Dr15Jones @dan131riley please take a look at HIPixelMedianVtxProducer to see if it has some known features that may have changed with the new ROOT.

cmsbuild · 2018-05-01T20:44:24Z

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 · 2018-05-01T20:44:25Z

assign reconstruction,core

cmsbuild · 2018-05-01T20:44:34Z

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

Dr15Jones · 2018-05-01T21:14:14Z

@smuzaffar @slava77 by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded.

slava77 · 2018-05-01T21:57:59Z

On 5/1/18 2:14 PM, Chris Jones wrote: @smuzaffar <https://github.com/smuzaffar> @slava77 <https://github.com/slava77> by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded.

This is for tests of pull requests, in tests of unrelated PRs. I think that these are all single threaded.

makortel · 2018-05-02T12:15:15Z

Since "minuit" and "fitting" were mentioned, let me note that e.g. in #23065 (some of?) the *_Mean and *_Sigma histograms produced by DQMGenericClient start to show differences as well (with their input histograms staying the same). See here #23065 (comment) and onwards for more discussion there.

Just thinking if the root cause for these irreproducibilities could be the same as with hiSelectedPixelVertex.

slava77 · 2018-05-02T13:20:54Z

@makortel
I confirm there are also issues with the reproducibility of DQM plots. This is happening in many workflows, which is much more inconvenient in checking out PRs

dan131riley · 2018-05-02T13:21:22Z

The only substantive minuit change that I see is cms-sw/root@5efb8d6#diff-3e4e3008064c935331e2a484cd9157a1 but that shouldn't make any difference for single-threaded jobs, and looking at the logs confirms that these jobs were single-threaded.

Expanding scope to all of root/math, I still don't see any changes that I would expect to affect single-threaded jobs.

Dr15Jones · 2018-05-02T13:24:44Z

So the DQMGenericClient histograms appear to come from here

cmssw/DQMServices/ClientConfig/plugins/DQMGenericClient.cc

Line 865 in 09c3fce

histoY->Fit(fitFcn,"QR0","",x1,x2);

Dr15Jones · 2018-05-02T13:35:57Z

So it appears the other fit @slava77 was thinking about is

cmssw/RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc

Line 164 in 4f8dcb1

histo.Fit(&f1,"QN");

Dr15Jones · 2018-05-02T13:36:58Z

@slava77 @makortel do either of you have an exact workflow showing the problem? I'm thinking a run of valgrind is in order.

slava77 · 2018-05-02T13:43:38Z

140.53 step2 is the place where it happens for the reco values.

DQM diffs in jenkins show up in many places (in 25 out of 29 running workflows in one of the cases that I checked)

makortel · 2018-05-02T13:45:29Z

@Dr15Jones for the DQM e.g. workflow 10024.0 HARVESTING step.

dan131riley · 2018-05-02T13:46:47Z

@slava77 in what sense are the changes "somewhat random"? Are there different changes from run to run, or are they consistent but with no obvious pattern? Comparing, for example

https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-04-30-1100+23065/26340/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017/Tracking_TrackFromPV_generalFromPVPt09_trackingParticleRecoAsssociation.html
https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-01-1100+23106/26350/10024.0_TTbar_13+TTbar_13TeV_TuneCUETP8M1_2017_GenSimFull+DigiFull_2017+RecoFull_2017+HARVESTFull_2017+ALCAFull_2017/Tracking_TrackFromPV_generalFromPVPt09_trackingParticleRecoAsssociation.html

the differences in the failing plots look identical to me, which has implications for what kind of change we are hunting.

makortel · 2018-05-02T13:49:27Z

@dan131riley Earlier these histograms had exactly the same values, unless any of the input histograms to these fits changed. In that case there has been numerical changes also in the fit results of unrelated histograms (because of some internal state of Minuit as far as I've understood).

Dr15Jones · 2018-05-02T14:11:08Z

@pcanal FYI

Dr15Jones · 2018-05-02T14:20:12Z

One thing I noticed is in both places which seem to be showing differences, neither of them check that the fit did not fail.

dan131riley · 2018-05-02T16:11:13Z

Between the versions there are substantial changes to the mathcore Fitter and hist TF1 classes, mostly for IMT and vectorization. I can certainly imagine those could change what gets left in the results of failed fits.

dan131riley · 2018-05-02T17:33:56Z

That theory isn't looking good...I modified HIPixelMedianVtxProducer with

auto status = histo.Fit(&f1,"QN");
if (status) {
  edm::LogSystem("HIPixelMedianVtxProducer") << "Fit failed " << med + f1.GetParameter(1) << " +- " << f1.GetParError(1) << " cm";
}

and ran 140.53 step2 in CMSSW_10_2_0_pre1 and CMSSW_10_2_0_pre1_ROOT612. I got two fit failures (on the same event) in each, any they all printed the same values (to the resolution of the printout) in both releases:

Fit failed 7.6079 +- 0.145983 cm

Next step, print the result on every event and see if they ever differ, or if this is just the wrong tree.

dan131riley · 2018-05-02T18:12:20Z

Looking at all fits and boosting the precision I get one difference:

-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm
+Fit result 6.041964245416591162 +- 0.001913964442221971485 cm

I have no idea if that's large enough to produce the differences @slava77 saw.

slava77 · 2018-05-02T18:25:41Z

On 5/2/18 11:12 AM, Dan Riley wrote: Looking at all fits and boosting the precision I get one difference: |-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm +Fit result 6.041964245416591162 +- 0.001913964442221971485 cm | I have no idea if that's large enough to produce the differences @slava77 <https://github.com/slava77> saw.

The comparison is rather strict. So, I think that this difference can trigger the report of a difference for the DQM plots. The change in hiSelectedPixelVertex is much more significant in the end. I don't know if this is just a butterfly effect or if in this case the result change a lot from the start.

dan131riley · 2018-05-03T13:34:59Z

The DQMGenericClient fit is actually from here:

cmssw/DQMServices/ClientConfig/src/FitSlicesYTool.cc

Line 14 in fcaf821

h->FitSlicesY();

from here:

cmssw/DQMServices/ClientConfig/plugins/DQMGenericClient.cc

Line 718 in fcaf821

FitSlicesYTool fitTool(srcME);

slava77 · 2018-05-03T16:16:31Z

curiously:
in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23119/26369/ there are no changes neither in DQM nor in the HI.
OTOH, the same IB but a different PR https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/ has differences both in the HI and in the DQM plots in many workflows (not related to the PR).

It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

dan131riley · 2018-05-03T17:02:35Z

On May 3, 2018, at 12:16, Slava Krutelyov ***@***.***> wrote: It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

It's not entirely consistent, even...compare https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/ https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/ One has 158 failures, the other 181. The difference is 23 additional failures in JetMET/Jet/CleanedslimmedJetsPuppi, and these failures, in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/JetMET_Jet_CleanedslimmedJetsPuppi.html are not mean & sigma like all the others. Guess valgrind is in order.

slava77 · 2018-05-03T17:08:45Z

On 5/3/18 10:02 AM, Dan Riley wrote: > On May 3, 2018, at 12:16, Slava Krutelyov ***@***.***> wrote: > > It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference. It's not entirely consistent, even...compare https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/ https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/ One has 158 failures, the other 181. The difference is 23 additional failures in JetMET/Jet/CleanedslimmedJetsPuppi, and these failures, in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23120/26370/25202.0_TTbar_13+TTbar_13+DIGIUP15_PU25+RECOUP15_PU25+HARVESTUP15_PU25/JetMET_Jet_CleanedslimmedJetsPuppi.html

this is a part of the PR changes in #23120 The "Puppi" plots are expected to change here.

…

are not mean & sigma like all the others. Guess valgrind is in order.

smuzaffar · 2018-05-03T19:35:54Z

We can run the comparison for same IB twice to see if it generate random results.

dan131riley · 2018-05-07T20:55:46Z

Running valgrind on step4 gives thousands of uninitialized values used with tracebacks like

==406690==    by 0x1BFE5AA3: Comp2RefChi2::setMessage()
==406690==    by 0x1BFED16D: MonitorElement::runQTests()
==406690==    by 0x1BFF7458: DQMStore::runQTests()
==406690==    by 0x3845FFC0: QualityTester::performTests()
==406690==    by 0x4B9254A: edm::EDAnalyzer::doEndLuminosityBlock(edm::LuminosityBlockPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*)

because a Comp2RefChi2 is getting used without being initialized. Modifying

cmssw/DQMServices/Core/interface/QTest.h

Line 209 in e6ba533

int Ndof_; double chi2_;

to initialize to 0 makes all of them go away but doesn't change the results, and there are no further hints in the valgrind memcheck results.

Using ROOT 6.12 on step4 (always with the same step3 input) I reproduce the comparison failures with a 6.10 baseline, but I don't get the same differences as the PR tests. I do get consistent differences running on a single system/architecture, but I get different changes running on different systems/architectures. For example, comparing a run with 6.12 on a Skylake machine vs. a Haswell system I get comparison failures in the same places but they are numerically different and not always in the same bins. Don't know yet if it's different on every system or tied to the architecture. I don't see that kind of variation on 6.10.

The changes are all too small to discern on the comparison graphs, but still disturbing.

dan131riley · 2018-05-10T21:47:59Z

Adding

 if (startDir == "Tracking/TrackFromPV/generalFromPVPt09_trackingParticleRecoAsssociation" && namePrefix == "h_dxypulleta") {
    srcME->getTH2F()->Print("all");
    meanME->getTH1()->Print("all");
    sigmaME->getTH1()->Print("all");
  }

to DQMGenericClient::computeResolution and running on two different systems, srcME is identical to the precision of the printouts, meanME and sigmaME are indeed different:

-TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 20.963
+TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 22.5034

-TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.407
+TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.383

and there are lots of differences in individual bins, some pretty significant:

- fSumw[16]=19.2041, x=-1.45, error=37.0525
- fSumw[17]=8.21385, x=-1.35, error=29.8819
+ fSumw[16]=19.21, x=-1.45, error=34.4841
+ fSumw[17]=8.2056, x=-1.35, error=25.2786

Running twice on the same machine I get no differences. Next step is to package up srcME and see if I can make a small test case.

dan131riley · 2018-05-16T17:09:29Z

The inconsistencies between different platforms goes away if I either change FitSlicesYTool to do

h->FitSlicesY(nullptr, 0, -1, 0, "QNR SERIAL");

instead of

h->FitSlicesY();

or if I add to the config

process.add_(cms.Service("InitRootHandlers", EnableIMT = cms.untracked.bool(False)))

to turn off IMT. This is curious, since these jobs are single-threaded. InitRootHandlers does call ROOT::EnableImplicitMT() regardless of whether the job is multi-threaded (@Dr15Jones is this a bug or a feature?). I don't see extra threads being created, but apparently just having IMT on changes the fit evaluation in system-dependent ways.

My test program does not, so far, reproduce this behavior.

dan131riley · 2018-05-16T19:20:14Z

For non-threaded jobs, cmsRun does set the number of TBB threads to 1, so there aren't additional threads created. IMT does get used in ROOT::Fit::FitUtil::EvaluateChi2 to create a bunch of tasks evaluating the chi2, and this is new in 6.12. It's at least plausible that, even with only one thread, the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors. I don't see any random numbers being used in the part that gets broken into TBB tasks (and I would expect much larger differences if there were). It's possible that we have some pathological cases where the order of summation makes a difference.

I would think we'd want to turn off IMT for the PR tests, possibly turn it off by default when we aren't multi-threaded. Setting the "SERIAL" fit option is also a possibility, especially given that we know the default MINUIT minimizer can get into trouble due to its reliance on thread local storage.

makortel · 2018-05-16T19:26:29Z

I think the "SERIAL" fit option would be good enough for both cases (I doubt the threading would bring any benefit in practice, and for DQMGenericClient HARVESTING is supposed to be run single-threaded anyway).

Dr15Jones · 2018-05-16T19:29:03Z

I also believe that "SERIAL" is the best way to go.

dan131riley · 2018-05-16T20:07:26Z

Adding "SERIAL" to the fit in HIPixelMedianVtxProducer also resolves the one difference I saw there, so I'll make a PR with both of those.

Dr15Jones · 2018-05-16T20:37:13Z

the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors

If a parallel_for is being used by ROOT for this code then I would definitely bet that TBB would 'chunk' up the processing range of the for to be different based on the number of cores on the machine.

makortel · 2018-05-17T12:08:57Z

There may be further cases of fits (or similar irreproducibilities), see #23160.

Dr15Jones · 2018-05-17T14:07:23Z

@dan131riley ROOT does chunk based on number of CPUs. The chunk size for the integration is done here
https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L393

where the implementation of setAutomaticChunking is here

https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L1756

So different chunk sizes means different parts being summed together in different orders.

dan131riley · 2018-05-17T14:59:28Z

@Dr15Jones the ROOT use of the number of CPUs for chunking explains it, it didn't really make sense that TBB would use the number of physical CPUs and not the configured thread limit. PR #23230 submitted.

dan131riley · 2018-05-17T16:53:42Z

@makortel the 140.53 diffs are actually what led @slava77 to open this issue, and the hope is that the HIPixelMedianVtxProducer patch will resolve it (it does fix the only difference I saw in the step2 vertexing, but I didn't run the entire workflow to verify)

slava77 · 2018-05-24T05:23:52Z

I was looking at diffs in #23258 made with CMSSW_10_2_X_2018-05-23-1100 (after #23230 was merged)
It looks like fits in the following are still affected by non-reproducibility

pcanal · 2018-06-11T21:22:24Z

@dan131riley In the JIRA ticket (https://sft.its.cern.ch/jira/browse/ROOT-9422) @lmoneta mentions (and asks):

A small numerical different is expected, not a large one. It is possible that in some case the fit converges and in other not, if the fit is rather un-stable. Do you have a reproducible test showing a large difference ?

dan131riley · 2018-06-12T14:47:01Z

@pcanal I was not able to create a reproducible test case in the time budgeted for it. I suspect we're seeing an amplification of small differences, where we get small differences in one step that are amplified in a subsequent step, but I haven't traced through the operations to see how plausible that is. These are also low statistics, so they could be ill conditioned. So at this point I don't think we have convincing evidence of a ROOT issue. (I see you already fixed the automatic chunking to use the IMT pool size.)

slava77 · 2018-10-04T19:23:02Z

This is probably ready to close now
after the cases mentioned in #23105 (comment) are addressed

cmsbuild added the pending-assignment label May 1, 2018

cmsbuild added core-pending pending-signatures reconstruction-pending and removed pending-assignment labels May 1, 2018

makortel mentioned this issue May 2, 2018

First version of phase2 premixing #23065

Merged

slava77 changed the title ~~reproducibility problems in hiSelectedPixelVertex in 102X~~ reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X May 2, 2018

christopheralanwest mentioned this issue May 7, 2018

Constrain HE TP depth sums to use only 11 bits #23117

Merged

makortel mentioned this issue May 17, 2018

apply most of python futurist stage1 tools for python3 compatibility #23208

Merged

davidlange6 mentioned this issue May 17, 2018

Updated pixel track efficiency codes for phase 1 detector #23160

Merged

dan131riley mentioned this issue May 17, 2018

set SERIAL on histogram fits to restore deterministic fits #23230

Merged

pcanal mentioned this issue May 18, 2018

Make fit chunking dependent on IMT pool size root-project/root#2039

Merged

slava77 mentioned this issue Oct 2, 2018

call Fit with SERIAL option #24790

Merged

slava77 closed this as completed Oct 4, 2018

makortel mentioned this issue May 18, 2021

Strange differences in DQM after harvesting step4 despite same step3 files #33766

Closed

davidlange6 mentioned this issue May 20, 2021

Speedup muon validation #33610

Merged

reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105

reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105

Comments

slava77 commented May 1, 2018

cmsbuild commented May 1, 2018

slava77 commented May 1, 2018

cmsbuild commented May 1, 2018

Dr15Jones commented May 1, 2018

slava77 commented May 1, 2018 via email

makortel commented May 2, 2018

slava77 commented May 2, 2018

dan131riley commented May 2, 2018

Dr15Jones commented May 2, 2018

Dr15Jones commented May 2, 2018

Dr15Jones commented May 2, 2018

slava77 commented May 2, 2018

makortel commented May 2, 2018

dan131riley commented May 2, 2018

makortel commented May 2, 2018 • edited Loading

Dr15Jones commented May 2, 2018

Dr15Jones commented May 2, 2018

dan131riley commented May 2, 2018

dan131riley commented May 2, 2018

dan131riley commented May 2, 2018

slava77 commented May 2, 2018 via email

dan131riley commented May 3, 2018

slava77 commented May 3, 2018

dan131riley commented May 3, 2018 via email

slava77 commented May 3, 2018 via email

smuzaffar commented May 3, 2018

dan131riley commented May 7, 2018

dan131riley commented May 10, 2018

dan131riley commented May 16, 2018

dan131riley commented May 16, 2018

makortel commented May 16, 2018

Dr15Jones commented May 16, 2018

dan131riley commented May 16, 2018

Dr15Jones commented May 16, 2018

makortel commented May 17, 2018

Dr15Jones commented May 17, 2018

dan131riley commented May 17, 2018

dan131riley commented May 17, 2018

slava77 commented May 24, 2018

pcanal commented Jun 11, 2018

dan131riley commented Jun 12, 2018

slava77 commented Oct 4, 2018

makortel commented May 2, 2018 •

edited

Loading