Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X #23105

Closed
slava77 opened this issue May 1, 2018 · 42 comments
Closed

Comments

@slava77
Copy link
Contributor

slava77 commented May 1, 2018

There are somewhat random changes in hiSelectedPixelVertex in reco the HI workflows.
This started showing up in 20-40% of jenkins tests starting from CMSSW_10_2_X_2018-04-26-2300 (8 out of about 35 tests have this issue, while at least 60 tests before this IB do not have it).
This coincides with merging of cms-sw/cmsdist#3957, which advanced from ROOT 6.10.09 to 6.12.07.

HI vertex reco uses either hiPixelAdaptiveVertex (which comes from hiPixelMedianVertex, which is similar to the PrimaryVertexProducer used in the pp reco) or from HIPixelMedianVtxProducer, which has some minuit-based fitting.
I suspect that the problem is in RecoHI/HiTracking/plugins/HIPixelMedianVtxProducer.cc

@Dr15Jones @dan131riley please take a look at HIPixelMedianVtxProducer to see if it has some known features that may have changed with the new ROOT.

@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2018

A new Issue was created by @slava77 Slava Krutelyov.

@davidlange6, @Dr15Jones, @smuzaffar, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@slava77
Copy link
Contributor Author

slava77 commented May 1, 2018

assign reconstruction,core

@cmsbuild
Copy link
Contributor

cmsbuild commented May 1, 2018

New categories assigned: core,reconstruction

@Dr15Jones,@smuzaffar,@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

@Dr15Jones
Copy link
Contributor

@smuzaffar @slava77 by Jenkins tests, do you mean the RelVals or tests done for pull request? I had thought the pull request tests were always run single threaded.

@slava77
Copy link
Contributor Author

slava77 commented May 1, 2018 via email

@makortel
Copy link
Contributor

makortel commented May 2, 2018

Since "minuit" and "fitting" were mentioned, let me note that e.g. in #23065 (some of?) the *_Mean and *_Sigma histograms produced by DQMGenericClient start to show differences as well (with their input histograms staying the same). See here #23065 (comment) and onwards for more discussion there.

Just thinking if the root cause for these irreproducibilities could be the same as with hiSelectedPixelVertex.

@slava77 slava77 changed the title reproducibility problems in hiSelectedPixelVertex in 102X reproducibility problems in hiSelectedPixelVertex RECO and in DQM fits in 102X May 2, 2018
@slava77
Copy link
Contributor Author

slava77 commented May 2, 2018

@makortel
I confirm there are also issues with the reproducibility of DQM plots. This is happening in many workflows, which is much more inconvenient in checking out PRs

@dan131riley
Copy link

The only substantive minuit change that I see is cms-sw/root@5efb8d6#diff-3e4e3008064c935331e2a484cd9157a1 but that shouldn't make any difference for single-threaded jobs, and looking at the logs confirms that these jobs were single-threaded.

Expanding scope to all of root/math, I still don't see any changes that I would expect to affect single-threaded jobs.

@Dr15Jones
Copy link
Contributor

So the DQMGenericClient histograms appear to come from here

histoY->Fit(fitFcn,"QR0","",x1,x2);

@Dr15Jones
Copy link
Contributor

So it appears the other fit @slava77 was thinking about is

@Dr15Jones
Copy link
Contributor

@slava77 @makortel do either of you have an exact workflow showing the problem? I'm thinking a run of valgrind is in order.

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2018

140.53 step2 is the place where it happens for the reco values.

DQM diffs in jenkins show up in many places (in 25 out of 29 running workflows in one of the cases that I checked)

@makortel
Copy link
Contributor

makortel commented May 2, 2018

@Dr15Jones for the DQM e.g. workflow 10024.0 HARVESTING step.

@makortel
Copy link
Contributor

makortel commented May 2, 2018

@dan131riley Earlier these histograms had exactly the same values, unless any of the input histograms to these fits changed. In that case there has been numerical changes also in the fit results of unrelated histograms (because of some internal state of Minuit as far as I've understood).

@Dr15Jones
Copy link
Contributor

@pcanal FYI

@Dr15Jones
Copy link
Contributor

One thing I noticed is in both places which seem to be showing differences, neither of them check that the fit did not fail.

@dan131riley
Copy link

Between the versions there are substantial changes to the mathcore Fitter and hist TF1 classes, mostly for IMT and vectorization. I can certainly imagine those could change what gets left in the results of failed fits.

@dan131riley
Copy link

That theory isn't looking good...I modified HIPixelMedianVtxProducer with

auto status = histo.Fit(&f1,"QN");
if (status) {
  edm::LogSystem("HIPixelMedianVtxProducer") << "Fit failed " << med + f1.GetParameter(1) << " +- " << f1.GetParError(1) << " cm";
}

and ran 140.53 step2 in CMSSW_10_2_0_pre1 and CMSSW_10_2_0_pre1_ROOT612. I got two fit failures (on the same event) in each, any they all printed the same values (to the resolution of the printout) in both releases:

Fit failed 7.6079 +- 0.145983 cm

Next step, print the result on every event and see if they ever differ, or if this is just the wrong tree.

@dan131riley
Copy link

Looking at all fits and boosting the precision I get one difference:

-Fit result 6.041964245417202228 +- 0.001913964442836382784 cm
+Fit result 6.041964245416591162 +- 0.001913964442221971485 cm

I have no idea if that's large enough to produce the differences @slava77 saw.

@slava77
Copy link
Contributor Author

slava77 commented May 2, 2018 via email

@dan131riley
Copy link

The DQMGenericClient fit is actually from here:

from here:

FitSlicesYTool fitTool(srcME);

@slava77
Copy link
Contributor Author

slava77 commented May 3, 2018

curiously:
in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23119/26369/ there are no changes neither in DQM nor in the HI.
OTOH, the same IB but a different PR https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_10_2_X_2018-05-02-1100+23065/26360/ has differences both in the HI and in the DQM plots in many workflows (not related to the PR).

It looks like the origin of the change is not completely random and is likely a feature of the CMSSW (local) build with perhaps library memory re-mapping making a consistent difference or no difference.

@dan131riley
Copy link

dan131riley commented May 3, 2018 via email

@slava77
Copy link
Contributor Author

slava77 commented May 3, 2018 via email

@smuzaffar
Copy link
Contributor

We can run the comparison for same IB twice to see if it generate random results.

@dan131riley
Copy link

Running valgrind on step4 gives thousands of uninitialized values used with tracebacks like

==406690==    by 0x1BFE5AA3: Comp2RefChi2::setMessage()
==406690==    by 0x1BFED16D: MonitorElement::runQTests()
==406690==    by 0x1BFF7458: DQMStore::runQTests()
==406690==    by 0x3845FFC0: QualityTester::performTests()
==406690==    by 0x4B9254A: edm::EDAnalyzer::doEndLuminosityBlock(edm::LuminosityBlockPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*)

because a Comp2RefChi2 is getting used without being initialized. Modifying

int Ndof_; double chi2_;
to initialize to 0 makes all of them go away but doesn't change the results, and there are no further hints in the valgrind memcheck results.

Using ROOT 6.12 on step4 (always with the same step3 input) I reproduce the comparison failures with a 6.10 baseline, but I don't get the same differences as the PR tests. I do get consistent differences running on a single system/architecture, but I get different changes running on different systems/architectures. For example, comparing a run with 6.12 on a Skylake machine vs. a Haswell system I get comparison failures in the same places but they are numerically different and not always in the same bins. Don't know yet if it's different on every system or tied to the architecture. I don't see that kind of variation on 6.10.

The changes are all too small to discern on the comparison graphs, but still disturbing.

@dan131riley
Copy link

Adding

 if (startDir == "Tracking/TrackFromPV/generalFromPVPt09_trackingParticleRecoAsssociation" && namePrefix == "h_dxypulleta") {
    srcME->getTH2F()->Print("all");
    meanME->getTH1()->Print("all");
    sigmaME->getTH1()->Print("all");
  }

to DQMGenericClient::computeResolution and running on two different systems, srcME is identical to the precision of the printouts, meanME and sigmaME are indeed different:

-TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 20.963
+TH1.Print Name  = h_dxypulleta_Mean, Entries= 60, Total sum= 22.5034

-TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.407
+TH1.Print Name  = h_dxypulleta_Sigma, Entries= 60, Total sum= 277.383

and there are lots of differences in individual bins, some pretty significant:

- fSumw[16]=19.2041, x=-1.45, error=37.0525
- fSumw[17]=8.21385, x=-1.35, error=29.8819
+ fSumw[16]=19.21, x=-1.45, error=34.4841
+ fSumw[17]=8.2056, x=-1.35, error=25.2786

Running twice on the same machine I get no differences. Next step is to package up srcME and see if I can make a small test case.

@dan131riley
Copy link

The inconsistencies between different platforms goes away if I either change FitSlicesYTool to do

h->FitSlicesY(nullptr, 0, -1, 0, "QNR SERIAL");

instead of

h->FitSlicesY();

or if I add to the config

process.add_(cms.Service("InitRootHandlers", EnableIMT = cms.untracked.bool(False)))

to turn off IMT. This is curious, since these jobs are single-threaded. InitRootHandlers does call ROOT::EnableImplicitMT() regardless of whether the job is multi-threaded (@Dr15Jones is this a bug or a feature?). I don't see extra threads being created, but apparently just having IMT on changes the fit evaluation in system-dependent ways.

My test program does not, so far, reproduce this behavior.

@dan131riley
Copy link

For non-threaded jobs, cmsRun does set the number of TBB threads to 1, so there aren't additional threads created. IMT does get used in ROOT::Fit::FitUtil::EvaluateChi2 to create a bunch of tasks evaluating the chi2, and this is new in 6.12. It's at least plausible that, even with only one thread, the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors. I don't see any random numbers being used in the part that gets broken into TBB tasks (and I would expect much larger differences if there were). It's possible that we have some pathological cases where the order of summation makes a difference.

I would think we'd want to turn off IMT for the PR tests, possibly turn it off by default when we aren't multi-threaded. Setting the "SERIAL" fit option is also a possibility, especially given that we know the default MINUIT minimizer can get into trouble due to its reliance on thread local storage.

@makortel
Copy link
Contributor

I think the "SERIAL" fit option would be good enough for both cases (I doubt the threading would bring any benefit in practice, and for DQMGenericClient HARVESTING is supposed to be run single-threaded anyway).

@Dr15Jones
Copy link
Contributor

I also believe that "SERIAL" is the best way to go.

@dan131riley
Copy link

Adding "SERIAL" to the fit in HIPixelMedianVtxProducer also resolves the one difference I saw there, so I'll make a PR with both of those.

@Dr15Jones
Copy link
Contributor

the order of task execution could still vary from machine to machine based on some internal TBB data structure--perhaps a table that gets sized according to the number of processors

If a parallel_for is being used by ROOT for this code then I would definitely bet that TBB would 'chunk' up the processing range of the for to be different based on the number of cores on the machine.

@makortel
Copy link
Contributor

There may be further cases of fits (or similar irreproducibilities), see #23160.

@Dr15Jones
Copy link
Contributor

@dan131riley ROOT does chunk based on number of CPUs. The chunk size for the integration is done here
https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L393

where the implementation of setAutomaticChunking is here

https://github.com/root-project/root/blob/5b25608c138fa826cb8bca0bebe26d3774b02179/math/mathcore/src/FitUtil.cxx#L1756

So different chunk sizes means different parts being summed together in different orders.

@dan131riley
Copy link

@Dr15Jones the ROOT use of the number of CPUs for chunking explains it, it didn't really make sense that TBB would use the number of physical CPUs and not the configured thread limit. PR #23230 submitted.

@dan131riley
Copy link

@makortel the 140.53 diffs are actually what led @slava77 to open this issue, and the hope is that the HIPixelMedianVtxProducer patch will resolve it (it does fix the only difference I saw in the step2 vertexing, but I didn't run the entire workflow to verify)

@slava77
Copy link
Contributor Author

slava77 commented May 24, 2018

@pcanal
Copy link
Contributor

pcanal commented Jun 11, 2018

@dan131riley In the JIRA ticket (https://sft.its.cern.ch/jira/browse/ROOT-9422) @lmoneta mentions (and asks):

A small numerical different is expected, not a large one. It is possible that in some case the fit converges and in other not, if the fit is rather un-stable. Do you have a reproducible test showing a large difference ?

@dan131riley
Copy link

@pcanal I was not able to create a reproducible test case in the time budgeted for it. I suspect we're seeing an amplification of small differences, where we get small differences in one step that are amplified in a subsequent step, but I haven't traced through the operations to see how plausible that is. These are also low statistics, so they could be ill conditioned. So at this point I don't think we have convincing evidence of a ROOT issue. (I see you already fixed the automatic chunking to use the IMT pool size.)

@slava77
Copy link
Contributor Author

slava77 commented Oct 4, 2018

This is probably ready to close now
after the cases mentioned in #23105 (comment) are addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants