Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of wiregroups is incorrectly overwritten in CMSSW_11_1_X_2020-07-25-1100 #30921

Closed
dildick opened this issue Jul 26, 2020 · 22 comments
Closed

Comments

@dildick
Copy link
Contributor

dildick commented Jul 26, 2020

I tested the CSC trigger emulator for #30914 (CMSSW_11_1_X_2020-07-25-1100) on a 11_1_0_pre8 relval sample ( root://cmsxrootd-site.fnal.gov//store/relval/CMSSW_11_1_0_pre8/RelValZMM_14/GEN-SIM-DIGI-RAW/111X_mcRun3_2021_realistic_v4-v1/20000/E1B44039-321A-284E-91FB-97CD10F43A48.root). I noticed that the preTrigger() function was crashing in the CSCAnodeLCTProcessor, because of an invalid number of wiregroups. I added a few more print-outs and it seems that the line numWireGroups = cscChamber_->layer(1)->geometry()->numberOfWireGroups() (https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_X/L1Trigger/CSCTriggerPrimitives/src/CSCAnodeLCTProcessor.cc#L227) returns the nominal wiregroup values, except for chamber ME-1/1/11. numberOfWireGroups returned 196609 in Run 1, Event 800, LumiSection 8. Perhaps something in the event setup was corrupted?

This results in an error

----- Begin Fatal Exception 26-Jul-2020 18:45:28 CDT-----------------------
An exception of category 'BadAlloc' occurred while
   [0] Processing  Event run: 1 lumi: 8 event: 800 stream: 0
   [1] Running path 'L1simulation_step'
   [2] Calling method for module CSCTriggerPrimitivesProducer/'simCscTriggerPrimitiveDigis'
Exception Message:
A std::bad_alloc exception was thrown.
The job has probably exhausted the virtual memory available to the process.
----- End Fatal Exception -------------------------------------------------
@cmsbuild
Copy link
Contributor

A new Issue was created by @dildick Sven Dildick.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@Dr15Jones
Copy link
Contributor

assign geometry

@cmsbuild
Copy link
Contributor

New categories assigned: geometry

@Dr15Jones,@cvuosalo,@mdhildreth,@makortel,@ianna,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cvuosalo
Copy link
Contributor

@slomeo Could you please look at this bug? Or maybe mention it to your Muon colleagues?

@dildick
Copy link
Contributor Author

dildick commented Jul 27, 2020

Deeper investigation reveals it's not the numberOfWireGroups() function.

Rather the numWireGroups parameter, which is set in the first event for each ALCT processor, gets corrupted in a seemingly random event and is reset to 196609, is ignored by this block https://github.com/cms-sw/cmssw/blob/master/L1Trigger/CSCTriggerPrimitives/src/CSCAnodeLCTProcessor.cc#L223-L244 and this sanity check https://github.com/cms-sw/cmssw/blob/master/L1Trigger/CSCTriggerPrimitives/src/CSCAnodeLCTProcessor.cc#L246-L254.

@dildick
Copy link
Contributor Author

dildick commented Jul 27, 2020

Weird. The number of wiregroups also gets corrupted in CMSSW_11_1_X_2020-07-27-1100 on /store/relval/CMSSW_11_0_0/RelValZMM_14/GEN-SIM-DIGI-RAW/110X_mcRun4_realistic_v2_2026D49noPU-v1/20000/FD2697C5-D45B-4B47-8066-0D79A874E7F5.root, although now it's ME-4/1/16 which is assigned 196609 wiregroups.

@cvuosalo
Copy link
Contributor

@dildick If there is incorrect overwriting of memory, it can have seemingly random and unpredictable effects. Trying to find the instance of overwriting can be very difficult because the bad effect may be far separated from the cause.

@Dr15Jones
Copy link
Contributor

The best tool to look for memory overwrite is ASAN. My suggestion is to find a recent ASAN IB and run the job using that release.

@dildick dildick changed the title CSCLayerGeometry returns invalid number of wiregroups for ME1/1 in CMSSW_11_1_X_2020-07-25-1100 Number of wiregroups is incorrectly overwritten in CMSSW_11_1_X_2020-07-25-1100 Jul 27, 2020
@cvuosalo
Copy link
Contributor

@dildick Can you continue to debug this problem? You know most about it. It is looking less like a geometry issue.

@dildick
Copy link
Contributor Author

dildick commented Jul 28, 2020

@cvuosalo I found a simple way around the problem, namely remove this line: https://github.com/cms-sw/cmssw/blob/master/L1Trigger/CSCTriggerPrimitives/src/CSCAnodeLCTProcessor.cc#L223. Basically, call the geometry functions for every ALCT processor, for every event.

Now, the CLCT processor code has a similar block, https://github.com/cms-sw/cmssw/blob/master/L1Trigger/CSCTriggerPrimitives/src/CSCCathodeLCTProcessor.cc#L223. Should be I concerned that the number of strips may be incorrectly overwritten? I don't know.

@cvuosalo
Copy link
Contributor

assign l1

@cmsbuild
Copy link
Contributor

New categories assigned: l1

@benkrikler,@rekovic you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cvuosalo
Copy link
Contributor

I think this issue also involves the L1 code.

@cvuosalo
Copy link
Contributor

@dildick You could change that if in both places in this way:

 if (numStrips == 0) {

to

 if (numStrips <= 0 || numStrips > CSCConstants::MAX_NUM_WIRES) {

This way you avoid unnecessary calls for every event but prevent the bad values from being used.
However, until you find the memory overwriting, the program will continue to behave in a random way. You can protect the numStrips variable, but then some other variable will eventually get overwritten and cause random behavior.

@dildick
Copy link
Contributor Author

dildick commented Jul 29, 2020

@Dr15Jones @smuzaffar Is there a recent 11_1_X ASAN build I can test?

@makortel
Copy link
Contributor

I don't think we have ASAN builds in 11_1_X, only in 11_2_X.

@dildick
Copy link
Contributor Author

dildick commented Jul 29, 2020

Ok. I squashed the 11_2_X version in a single commit (dildick@1eefd3e); cherry-picked it into the CMSSW_11_2_ASAN_X_2020-07-24-2300 build and tried to compile, which fails [1]. Where do I go from here to further understand this?

[1]

==13285==ERROR: AddressSanitizer failed to allocate 0xdfff0001000 (15392894357504) bytes at address 2008fff7000 (errno: 12)
==13285==ReserveShadowMemoryRange failed while trying to map 0xdfff0001000 bytes. Perhaps you're using ulimit -v
/bin/sh: line 1: 13285 Aborted                 LD_PRELOAD=/cvmfs/cms-ib.cern.ch/nweek-02638/slc7_amd64_gcc820/external/gcc/8.2.0-bcolbf/lib64/libasan.so edmWriteConfigs -p /uscms_data/d3/dildick/work/ForEfe/CMSSW_11_2_ASAN_X_2020-07-24-2300/tmp/slc7_amd64_gcc820/src/L1Trigger/CSCTriggerPrimitives/plugins/CSCTriggerPrimitivesPlugins/libCSCTriggerPrimitivesPlugins.so
gmake: *** [config/SCRAM/GMake/Makefile.rules:1678: lib/slc7_amd64_gcc820/CSCTriggerPrimitivesPlugins.edmplugin] Error 1
gmake: *** Waiting for unfinished jobs....
gmake: *** [There are compilation/build errors. Please see the detail log above.] Error 2

@Dr15Jones
Copy link
Contributor

@dildick I've found I can't use the FNAL machines for this because of their restriction on virtual memory. Try using a CERN development machine.

@dildick
Copy link
Contributor Author

dildick commented Aug 1, 2020

A solution was provided in #30914 and #30909

@dildick dildick closed this as completed Aug 1, 2020
@slomeo
Copy link
Contributor

slomeo commented Aug 3, 2020

@slomeo Could you please look at this bug? Or maybe mention it to your Muon colleagues?

@cvuosalo : I'm sorry for my late answer but I was on vacation. Is still necessary to contact my Muon colleagues?

@cvuosalo
Copy link
Contributor

cvuosalo commented Aug 3, 2020

+1

@cvuosalo
Copy link
Contributor

cvuosalo commented Aug 3, 2020

@slomeo It looks like this issue is resolved, so nothing more needs to be done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants