large start-of-job memory spikes #28780

davidlange6 · 2020-01-23T14:36:14Z

Hi
running a random reco workflow - I notice there is a (new to me) spike

ignore the two lines - the test I was doing was unrelated..

anyway, I also randomly caught this in igprof - is there some way to limit the memory this xml unpacking can take? I presume the input file and the output data structure are not nearly this big..

69.3 3'279'594'277 59.9 2'836'863'761 59.9 2'836'863'432 52.7 2'494'229'198 34.6 1'639'647'264 34.6 1'639'647'264 34.6 1'639'647'264 34.6 1'638'268'224 34.5 1'632'935'248 34.3 1'622'567'168 34.0 1'607'661'296 33.6 1'591'944'368 33.2 1'573'121'936 32.7 1'547'707'040 31.9 1'509'950'288 30.7 1'455'860'976 30.1 1'425'838'160 30.1 1'425'838'160 30.1 1'425'838'160 29.0 1'371'421'120 26.4 1'249'606'000 22.8 1'081'623'120 19.6 927'220'461 18.2 864'046'000 12.7 599'349'392 661'891 createGBRForest(edm::FileInPath const&) [19]
384'000 edm::WorkerMaker::makeModule(edm::ParameterSet const&) const [20]
383'995 lowptgsfeleseed::HeavyObjectCache::HeavyObjectCache(edm::ParameterSet const&) [21]
387'472 tinyxml2::XMLDocument::Parse(char const*, unsigned long) [22]
398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*) [23]
398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'2 [24]
398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'3 [25]
398'380 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'4 [26]
397'099 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'5 [27]
394'608 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'6 [28]
391'057 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'7 [29]
387'253 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'8 [30]
382'727 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'9 [31]
376'534 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'10 [32]
367'307 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'11 [33]
354'097 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'12 [34]
346'875 tinyxml2::XMLElement::ParseDeep(char*, tinyxml2::StrPair*, int*) [35]
346'875 tinyxml2::XMLElement::ParseAttributes(char*, int*) [36]
346'875 tinyxml2::XMLElement::CreateAttribute() [37]
333'412 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'13 [38]
303'721 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'14 [39]
262'619 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'15 [40]
82 reco::details::readGzipFile(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) [41]
209'733 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'16 [42]
146'039 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'17 [43]

cmsbuild · 2020-01-23T14:36:36Z

A new Issue was created by @davidlange6 David Lange.

@Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 · 2020-01-23T15:23:34Z

@davidlange6
was the memory profile for a single thread job, some production-like reco?
I'm trying to understand if this jump at the start above the baseline is a concern for production.

See
#24432 (comment)
for CPU and memory related notes on the tinyxml2 parts. The additional allocations needed inside tinyxml2 compared to the TMVAReader were observed during integration and seemed acceptable considering the other gains.

davidlange6 · 2020-01-23T15:37:48Z

Single thread job, yes. Step3 of 136.888. I let you decide how reco like it was.. I recall some production jobs reported as being killed for similar behavior. Perhaps this is a reason behind. But I don’t have any specific threads to point to just now. [but ok, the issue is perhaps more storing data intensive info in xml:)] On Jan 23, 2020, at 4:23 PM, Slava Krutelyov <notifications@github.com<mailto:notifications@github.com>> wrote: @davidlange6<https://github.com/davidlange6> was the memory profile for a single thread job, some production-like reco? I'm trying to understand if this jump at the start above the baseline is a concern for production. See #24432 (comment)<#24432 (comment)> for CPU and memory related notes on the tinyxml2 parts. The additional allocations needed inside tinyxml2 compared to the TMVAReader were observed during integration and seemed acceptable considering the other gains. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#28780?email_source=notifications&email_token=ABGPFQ6R7KT4KFJR7LFKUD3Q7GY7RA5CNFSM4KKXO2KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXXHEY#issuecomment-577729427>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQYJ3GLD3IFUG2FWHIDQ7GY7RANCNFSM4KKXO2KA>.

makortel · 2020-01-23T15:41:46Z

assign reconstruction, analysis

cmsbuild · 2020-01-23T15:42:09Z

New categories assigned: analysis,reconstruction

@slava77,@santocch,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel · 2020-01-23T15:42:18Z

@davidlange6 Is the y-axis of your plot RSS in the units of bytes?

davidlange6 · 2020-01-23T15:50:51Z

Yes. (As reported by /proc/<pid>/status on one second intervals) On Jan 23, 2020, at 4:42 PM, Matti Kortelainen <notifications@github.com<mailto:notifications@github.com>> wrote: @davidlange6<https://github.com/davidlange6> Is the y-axis of your plot RSS in the units of bytes? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#28780?email_source=notifications&email_token=ABGPFQ2DMGFP2QE7HASLM3DQ7G3FXA5CNFSM4KKXO2KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXZIVQ#issuecomment-577737814>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQZXRXVE5WQ4EZC7C6DQ7G3FXANCNFSM4KKXO2KA>.

makortel · 2020-01-23T16:00:31Z

@Dr15Jones and I took a look based on the IgProf stack trace. The lowptgsfeleseed::HeavyObjectCache is used as a GlobalCache here

cmssw/RecoEgamma/EgammaElectronProducers/plugins/LowPtGsfElectronSeedProducer.h

Lines 28 to 29 in 1ddd87e

    
           class LowPtGsfElectronSeedProducer final 
        
               : public edm::stream::EDProducer<edm::GlobalCache<lowptgsfeleseed::HeavyObjectCache> > {

Its constructor uses createGBRForest() to read the models

cmssw/RecoEgamma/EgammaElectronProducers/plugins/LowPtGsfElectronSeedHeavyObjectCache.cc

Lines 115 to 117 in 1ddd87e

    
           for (auto& weights : conf.getParameter<std::vector<std::string> >("ModelWeights")) { 
        
             models_.push_back(createGBRForest(edm::FileInPath(weights))); 
        
           }

which itself calls the init() function

cmssw/CommonTools/MVAUtils/src/GBRForestTools.cc

Lines 258 to 262 in 1ddd87e

    
           if (weightsFile[0] == '/') { 
        
             gbrForest = init(weightsFile, varNames); 
        
           } else { 
        
             edm::FileInPath weightsFileEdm(weightsFile); 
        
             gbrForest = init(weightsFileEdm.fullPath(), varNames);

which then parses the XML document with DOM

cmssw/CommonTools/MVAUtils/src/GBRForestTools.cc

Lines 123 to 133 in 1ddd87e

    
           tinyxml2::XMLDocument xmlDoc; 
        
           using namespace reco::details; 
        
           if (hasEnding(weightsFileFullPath, ".xml")) { 
        
             xmlDoc.LoadFile(weightsFileFullPath.c_str()); 
        
           } else if (hasEnding(weightsFileFullPath, ".gz") || hasEnding(weightsFileFullPath, ".gzip")) { 
        
             char* buffer = readGzipFile(weightsFileFullPath); 
        
             xmlDoc.Parse(buffer); 
        
             free(buffer); 
        
           }

.

Based on the cfi file

cmssw/RecoEgamma/EgammaElectronProducers/python/lowPtGsfElectronSeeds_cfi.py

Lines 24 to 27 in 1ddd87e

    
           ModelWeights = cms.vstring([ 
        
                   'RecoEgamma/ElectronIdentification/data/LowPtElectrons/RunII_Autumn18_LowPtElectrons_unbiased.xml.gz', 
        
                   'RecoEgamma/ElectronIdentification/data/LowPtElectrons/RunII_Autumn18_LowPtElectrons_displaced_pt_eta_biased.xml.gz', 
        
                   ]),

I guess the XML files in question are
https://github.com/cms-data/RecoEgamma-ElectronIdentification/blob/master/LowPtElectrons/RunII_Autumn18_LowPtElectrons_unbiased.xml.gz
https://github.com/cms-data/RecoEgamma-ElectronIdentification/blob/master/LowPtElectrons/RunII_Autumn18_LowPtElectrons_displaced_pt_eta_biased.xml.gz
whose uncompressed sizes are 298 MB each. It is not really surprising that reading such large documents with DOM leads to large memory use, even if only temporary.

Looking at the body of the init() function, it seems that the code reads

MethodSetup.GeneralInfo, which appears to be small
MethodSetup.Options, which appears to be small
loop over all BinaryTree elements of MethodSetup.Weights, these are large
- the first file has 490 BinaryTree elements, and the second one 783

We wonder whether a SAX parser could be used here to avoid parsing the entire XML document in memory with DOM. By quick look on the code SAX would not look impossible, but would likely require some amount of work.

slava77 · 2020-01-23T16:40:00Z

@guitargeek
in case you are interested/available, please take a look.
Thank you.

slava77 · 2020-01-23T16:45:54Z

@Dr15Jones and I took a look based on the IgProf stack trace. The lowptgsfeleseed::HeavyObjectCache is used as a GlobalCache here

So, this then is independent of the number of threads.
On the normal 8-thread job and some 10GB per-job allocation, I would expect that this bump of 2.XGB is not a likely reason for trouble.
Still, almost 3GB bump in memory use seems excessive.

slava77 · 2020-01-23T16:47:43Z

@mverzett
please take a note, especially if some future updates may need a more complicated training.

bendavid · 2020-01-23T16:51:45Z

Can't this be solved just be pre-running the conversion to GBRForest and reading the root file/conditions object instead of the xml file?

guitargeek · 2020-01-23T16:52:27Z

I don't think improving the XML parsing is worth it, because this XML is already a workaround itself.

All these new egamma MVAs are trained with xgboost, which can save the model as a text file. Then, we manually converted these text files into the TMVA XML format because that's what cmssw could deal with.

I think the simpler solution here would be to either just serialize the GBRForests to ROOT files such that they can be loaded much faster, or load the model from the original xgboos txt file (if @mverzett still has it) with a library that is very fast at parsing it, like my own one here:
https://github.com/guitargeek/XGBoost-FastForest

This FastForest is basically an even more optimized spin-off of the GBRForest, but it parses xgboost txt instead of TMVA xml. We could just copy-paste that code into cmssw.

slava77 · 2020-01-23T18:19:41Z

Can't this be solved just be pre-running the conversion to GBRForest and reading the root file/conditions object instead of the xml file?

I like this proposal/option.
Perhaps the DB/conditions is a better option.

cmsbuild added the pending-assignment label Jan 23, 2020

cmsbuild added analysis-pending pending-signatures reconstruction-pending and removed pending-assignment labels Jan 23, 2020

bainbrid mentioned this issue Oct 14, 2020

Add low-pT electrons to MINIAOD, update ID, improve end user experience #31220

Merged

cmsbuild closed this as completed in #31220 Dec 2, 2020

bainbrid mentioned this issue Dec 2, 2020

Add low-pT electrons to MINIAOD, update ID, improve end user experience (back port of 31220) #32372

Merged

makortel mentioned this issue Feb 17, 2021

Large allocation in lowptgsfeleseed::HeavyObjectCache #32938

Closed

This was referenced Aug 17, 2021

LowPtElectrons: convert Seed BDTs from XML to ROOT file format cms-data/RecoEgamma-ElectronIdentification#22

Merged

LowPtElectrons: convert Seed BDTs from XML to ROOT file format #34908

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

large start-of-job memory spikes #28780

large start-of-job memory spikes #28780

davidlange6 commented Jan 23, 2020

cmsbuild commented Jan 23, 2020

slava77 commented Jan 23, 2020

davidlange6 commented Jan 23, 2020 via email

makortel commented Jan 23, 2020

cmsbuild commented Jan 23, 2020

makortel commented Jan 23, 2020

davidlange6 commented Jan 23, 2020 via email

makortel commented Jan 23, 2020

slava77 commented Jan 23, 2020

slava77 commented Jan 23, 2020

slava77 commented Jan 23, 2020

bendavid commented Jan 23, 2020

guitargeek commented Jan 23, 2020 •

edited

Loading

slava77 commented Jan 23, 2020

large start-of-job memory spikes #28780

large start-of-job memory spikes #28780

Comments

davidlange6 commented Jan 23, 2020

cmsbuild commented Jan 23, 2020

slava77 commented Jan 23, 2020

davidlange6 commented Jan 23, 2020 via email

makortel commented Jan 23, 2020

cmsbuild commented Jan 23, 2020

makortel commented Jan 23, 2020

davidlange6 commented Jan 23, 2020 via email

makortel commented Jan 23, 2020

slava77 commented Jan 23, 2020

slava77 commented Jan 23, 2020

slava77 commented Jan 23, 2020

bendavid commented Jan 23, 2020

guitargeek commented Jan 23, 2020 • edited Loading

slava77 commented Jan 23, 2020

guitargeek commented Jan 23, 2020 •

edited

Loading