Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Closed
Dr15Jones opened this issue Apr 30, 2020 · 17 comments · Fixed by #29738
Closed

Segmentation fault in HcalOfflineHarvesting in recent IB #29605

Dr15Jones opened this issue Apr 30, 2020 · 17 comments · Fixed by #29738

Comments

@Dr15Jones
Copy link
Contributor

We are seeing segmentation faults in 3 of the integration build workflows. The stack trace is

#4  <signal handler called>
#5  0x00002ab08a2ea5b8 in hcaldqm::DigiRunSummary::endLuminosityBlock(dqm::implementation::IBooker&amp, dqm::implementation::IGetter&amp, edm::LuminosityBlock const&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/libDQMHcalTasks.so
#6  0x00002ab08a22d7d8 in HcalOfflineHarvesting::_dqmEndLuminosityBlock(dqm::implementation::IBooker&amp, dqm::implementation::IGetter&amp, edm::LuminosityBlock const&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/pluginDQMHcalTasksAuto.so
#7  0x00002ab08a1fa0ca in DQMEDHarvester::endLuminosityBlockProduce(edm::LuminosityBlock&amp, edm::EventSetup const&amp) () from /cvmfs/cms-ib.cern.ch/nweek-02626/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_1_X_2020-04-30-1100/lib/slc7_amd64_gcc820/pluginDQMHcalTasksAuto.so

The likely culprits are either #29544 or #29586. The former changed code in that same package while the latter could be causing a module that was not being run before to now be run.

@cmsbuild
Copy link
Contributor

A new Issue was created by @Dr15Jones Chris Jones.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign dqm

@cmsbuild
Copy link
Contributor

New categories assigned: dqm

@jfernan2,@andrius-k,@schneiml,@fioriNTU,@kmaeshima you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

Workflows in question are 4.28, 134.807, and 136.757.

Given that the segfault occurs in the HARVESTING step, I would guess #29586 to be unlikely the cause (still possible if running "unexpected" module in DQM step could somehow trigger a segfault in HARVESTING). Also all DQM modules use the Accumulator extension (right?), so they should have been running also before #29586 .

@makortel
Copy link
Contributor

I ran 4.28 on my developer area of CMSSW_11_1_X_2020-04-21-23000+#29586 and it succeeded. So #29544 looks more probable cause.

@jfernan2
Copy link
Contributor

@lwang046 It seems your PR is causing trouble that we did not detect.
Can you please have a look?

@Dr15Jones
Copy link
Contributor Author

The code crashes here
https://github.com/cms-sw/cmssw/blob/master/DQM/HcalTasks/src/DigiRunSummary.cc#L112

because meNumEvents is nullptr. The variable is set in the line just above
https://github.com/cms-sw/cmssw/blob/master/DQM/HcalTasks/src/DigiRunSummary.cc#L111

so the request from the IGetter is failing. The value for _subsystem appears to be 'Hcal' making the requested MonitorElement 'Hcal/RunInfo/NumberOfEvents'.

@lwang046
Copy link
Contributor

lwang046 commented May 1, 2020

Hi @jfernan2 @Dr15Jones, may I ask is there some principle differences between these 4 workflows and the other workflows? I'm a bit puzzled why the other workflows can get this ME but these 4 cannot. The problem indeed is related to getBinContent, when I comment out this meNumEvents related codes, another newly booked ME cOccupancy_depth also has issue to getBinContent. Both of these are lumi based histograms. I met this same bug about a months ago, but then one day after a certain IB this bug suddenly disappeared for all the other runTheMatrix test workflows. Do you remember if there was a PR during this time that might affect getting lumi based histograms or have similar effects?

@schneiml
Copy link
Contributor

schneiml commented May 1, 2020

Hi @lwang046 ,

A few comments:

  • Harvesting code must always check the return value of IGetter::get for nullptr. Lots of code does not do that today, and this causes crashes every now and then. Histograms can be missing for all sorts of reasons, typically DQM configuration changes in the RECO step, and harvesting should not crash in this situation.
  • There are a lot of changes in CMSSW_11_1 compared to earlier releases due to the "new DQMStore" work. However, this is all more than a month ago. See also the new README [1].
  • My guess is that the problem is due to changes in the DQM or HARVESTING sequences. It does happen that runTheMatrix workflows use misconfigured sequences that work by accident, but fail on seemingly unrelated changes.

[1] https://github.com/cms-sw/cmssw/tree/master/DQMServices/Core

@makortel
Copy link
Contributor

makortel commented May 1, 2020

@schneiml

  • Harvesting code must always check the return value of IGetter::get for nullptr. Lots of code does not do that today, and this causes crashes every now and then. Histograms can be missing for all sorts of reasons, typically DQM configuration changes in the RECO step, and harvesting should not crash in this situation.

Just a thought, would it make sense for IGetter::get to throw an exception (e.g. with the ME name) in case the requested ME is missing?

@lwang046
Copy link
Contributor

lwang046 commented May 4, 2020

Dear @schneiml , @makortel, in an attempt to fix this issue, I found that the null MEs were booked using the the scope concept, auto scope = DQMStore::IBooker::UseLumiScope(ib);. May I ask if this scope has compatibility issue with DQMOneEDAnalyzer + edm::LuminosityBlockCache?

@schneiml
Copy link
Contributor

schneiml commented May 4, 2020

@makortel We used to have an API that did that, but nobody used it, so I removed it. We should not change the default behaviour, since a lot of code handles missing MEs just fine (and I think there are valid reasons for MEs to be missing in harvesting, e.g. conditional booking in RECO). Also for general robustness the HARVESTING code should not blow up just because RECO config changed (pragmatically; arguably we should always make sure that the harvesting is configured correctly to match the RECO config, but in practice we'd blow up Tier0 a lot if we assumed that).

@lwang046 Reading a lumi-saved ME in endLumi harvesting code really should work. Since my recent PR (#29321) it should also be possible to per-lumi save MEs in DQMOneEDAnalyzer. Let me have a look.

@schneiml
Copy link
Contributor

schneiml commented May 4, 2020

Ok, this looks a lot like a bug with #29321, maybe caused by the fact that these WF now actually do concurrent lumis? Investigating...

@schneiml
Copy link
Contributor

schneiml commented May 4, 2020

Ok, the issue is actually pretty obvious if you think about it the right way:

The symptom is that the per-lumi MEs in question (NumberOfEvents) are not present in all lumisections.

The reason for that is that they are booked in a DQMOneEDAnalyzer, which switches lumisections per-event, to not rely on lumisection transitions. However, this means that the per-lumi MEs are never created for the lumisections that never "happend" for this Module: the empty lumisections with no events.

Now, there are a lot of possible solutions:

  • Consider it a feature. If the ME is not present, that is because there was no data for this lumisection. (But that is inconsistent with the behaviour of DQMEDAnalyzer).
  • Fix this by creating per-lumi MEs for each module centrally, on globalBeginLumi/globalEndLumi. This means more work for the DQMStore on the edm service callbacks. I think we should avoid that (are there any implications for parallel running on edm service callbacks?), but it is the cleanest solution: initialize the per lumi MEs for all modules in the global transition callback; enterLumi can then assume the global ME to be there already.
  • Fix this in DQMOneEDAnalyzer by doing an enterLumi in global lumi transitions, in addition to the per-event switching. But to get exposed to the global transitions we need to declare a cache, which conflicts with derived classes declaring a cache. Could probably be done with a bit of fiddling around with templates.
  • Fix this only for the affected modules, by calling enterLumi from subsystem code.

I don't like the first option much, because it violates the invariant "all lumis in a run have the same MEs". We don't really rely on that anywhere (and legacy/harvesting modules can always violate it), but it is nice to have. I like the second option, though we should not go for that if @makortel or @Dr15Jones have concerns about it. The last option might be a minimally-invasive fix, but it is not as clean as I'd like.

@makortel
Copy link
Contributor

makortel commented May 4, 2020

@schneiml Thanks for the detailed analysis! I share the dislike of option 1 pretty much for the same reasons. About the option 2

  • Fix this by creating per-lumi MEs for each module centrally, on globalBeginLumi/globalEndLumi. This means more work for the DQMStore on the edm service callbacks. I think we should avoid that (are there any implications for parallel running on edm service callbacks?), but it is the cleanest solution: initialize the per lumi MEs for all modules in the global transition callback; enterLumi can then assume the global ME to be there already.

I did not fully understand your concern. In beginLumi (endLumi) transitions (callbacks) the modules are run concurrently (following the data dependencies).

@Dr15Jones
Copy link
Contributor Author

(are there any implications for parallel running on edm service callbacks?)

A service must be able to handle all ActivityRegistry service callbacks in a thread-safe manner. For this case, multiple module level 'endLuminosityBlock' callbacks (i.e. PreModuleGlobalBeginLumi and PostModuleGlobalBeginLumi) can be happening to the Services simultaneously since multiple modules could be running concurrently as well as multiple LuminosityBlocks could be ending concurrently.

@schneiml
Copy link
Contributor

schneiml commented May 4, 2020

@makortel As Chris correctly guessed this is about these callbacks: https://github.com/cms-sw/cmssw/blob/master/DQMServices/Core/src/DQMStore.cc#L680-L694

From @Dr15Jones answer I'd say it is ok to do more work there (apart from the edm::Service interface maybe being deprecated in general at some point?). I wanted to get away without these callbacks (apart from for supporting legacy modules), but the cleanupLumi ones are needed anyways, so I think it should be ok to also have a initLumi (symmetrical to cleanupLumi) there. I have that mostly implemented by now, just need to get it to actually work...

Note that while these callbacks are threadsafe, they are of course threadsafe by taking a lock for essentially the entire duration. I don't expect initLumi to take very long, but it might need to initialize a few thousand TH1s in rare cases.

schneiml added a commit to schneiml/cmssw that referenced this issue May 5, 2020
As reported in cms-sw#29605, it can happen that a DQMOneEDAnalyzer does not
produce it's per lumi MEs, because there were no events in the
Lumisection and it only calls enterLumi as needed per event.

To prevent this, we need to make sure that lumi ME are always created
for every ME, but we can't have lumitransitions in DQMOneEDAnalyzer, so
it needs to happen in a global callback. But there we can't safely use
enterLumi, since that would corrupt the module's local MEs.

The solution is to have a new method, initLumi, dedicated to
initializing global MEs, but not touching local MEs. This is actually a
nice thing in general, since now initLumi/cleanupLumi form a symmetrical
pair (create/destroy global MEs), as well as enterLumi/leavLumi (update
local MEs). All of these are idempotent, as before.

There are a bunch of corner cases aorund booking: initLumi *must* be
called before enterLumi, but the global callback that triggers it
globally might (will) happen before booking has happened for the module.
So we also need to do it after booking. There is a race related to lumi
MEs if globalBeginLumi can happen *before* beginRun finishes for all
plugins. This should not happen for now.
@silviodonato silviodonato reopened this May 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants