-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
large start-of-job memory spikes #28780
Comments
A new Issue was created by @davidlange6 David Lange. @Dr15Jones, @smuzaffar, @silviodonato, @makortel, @davidlange6, @fabiocos can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
@davidlange6 See |
Single thread job, yes.
Step3 of 136.888. I let you decide how reco like it was..
I recall some production jobs reported as being killed for similar behavior. Perhaps this is a reason behind. But I don’t have any specific threads to point to just now. [but ok, the issue is perhaps more storing data intensive info in xml:)]
On Jan 23, 2020, at 4:23 PM, Slava Krutelyov <notifications@github.com<mailto:notifications@github.com>> wrote:
@davidlange6<https://github.com/davidlange6>
was the memory profile for a single thread job, some production-like reco?
I'm trying to understand if this jump at the start above the baseline is a concern for production.
See
#24432 (comment)<#24432 (comment)>
for CPU and memory related notes on the tinyxml2 parts. The additional allocations needed inside tinyxml2 compared to the TMVAReader were observed during integration and seemed acceptable considering the other gains.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#28780?email_source=notifications&email_token=ABGPFQ6R7KT4KFJR7LFKUD3Q7GY7RA5CNFSM4KKXO2KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXXHEY#issuecomment-577729427>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQYJ3GLD3IFUG2FWHIDQ7GY7RANCNFSM4KKXO2KA>.
|
assign reconstruction, analysis |
@davidlange6 Is the y-axis of your plot RSS in the units of bytes? |
Yes. (As reported by /proc/<pid>/status on one second intervals)
On Jan 23, 2020, at 4:42 PM, Matti Kortelainen <notifications@github.com<mailto:notifications@github.com>> wrote:
@davidlange6<https://github.com/davidlange6> Is the y-axis of your plot RSS in the units of bytes?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#28780?email_source=notifications&email_token=ABGPFQ2DMGFP2QE7HASLM3DQ7G3FXA5CNFSM4KKXO2KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJXZIVQ#issuecomment-577737814>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABGPFQZXRXVE5WQ4EZC7C6DQ7G3FXANCNFSM4KKXO2KA>.
|
@Dr15Jones and I took a look based on the IgProf stack trace. The cmssw/RecoEgamma/EgammaElectronProducers/plugins/LowPtGsfElectronSeedProducer.h Lines 28 to 29 in 1ddd87e
Its constructor uses createGBRForest() to read the modelscmssw/RecoEgamma/EgammaElectronProducers/plugins/LowPtGsfElectronSeedHeavyObjectCache.cc Lines 115 to 117 in 1ddd87e
which itself calls the init() functioncmssw/CommonTools/MVAUtils/src/GBRForestTools.cc Lines 258 to 262 in 1ddd87e
which then parses the XML document with DOM cmssw/CommonTools/MVAUtils/src/GBRForestTools.cc Lines 123 to 133 in 1ddd87e
. Based on the cmssw/RecoEgamma/EgammaElectronProducers/python/lowPtGsfElectronSeeds_cfi.py Lines 24 to 27 in 1ddd87e
I guess the XML files in question are https://github.com/cms-data/RecoEgamma-ElectronIdentification/blob/master/LowPtElectrons/RunII_Autumn18_LowPtElectrons_unbiased.xml.gz https://github.com/cms-data/RecoEgamma-ElectronIdentification/blob/master/LowPtElectrons/RunII_Autumn18_LowPtElectrons_displaced_pt_eta_biased.xml.gz whose uncompressed sizes are 298 MB each. It is not really surprising that reading such large documents with DOM leads to large memory use, even if only temporary. Looking at the body of the
We wonder whether a SAX parser could be used here to avoid parsing the entire XML document in memory with DOM. By quick look on the code SAX would not look impossible, but would likely require some amount of work. |
@guitargeek |
So, this then is independent of the number of threads. |
@mverzett |
Can't this be solved just be pre-running the conversion to GBRForest and reading the root file/conditions object instead of the xml file? |
I don't think improving the XML parsing is worth it, because this XML is already a workaround itself. All these new egamma MVAs are trained with xgboost, which can save the model as a text file. Then, we manually converted these text files into the TMVA XML format because that's what cmssw could deal with. I think the simpler solution here would be to either just serialize the GBRForests to ROOT files such that they can be loaded much faster, or load the model from the original xgboos txt file (if @mverzett still has it) with a library that is very fast at parsing it, like my own one here: This FastForest is basically an even more optimized spin-off of the GBRForest, but it parses xgboost txt instead of TMVA xml. We could just copy-paste that code into cmssw. |
I like this proposal/option. |
Hi
running a random reco workflow - I notice there is a (new to me) spike
ignore the two lines - the test I was doing was unrelated..
anyway, I also randomly caught this in igprof - is there some way to limit the memory this xml unpacking can take? I presume the input file and the output data structure are not nearly this big..
69.3 3'279'594'277 661'891 createGBRForest(edm::FileInPath const&) [19]
59.9 2'836'863'761 384'000 edm::WorkerMaker::makeModule(edm::ParameterSet const&) const [20]
59.9 2'836'863'432 383'995 lowptgsfeleseed::HeavyObjectCache::HeavyObjectCache(edm::ParameterSet const&) [21]
52.7 2'494'229'198 387'472 tinyxml2::XMLDocument::Parse(char const*, unsigned long) [22]
34.6 1'639'647'264 398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*) [23]
34.6 1'639'647'264 398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'2 [24]
34.6 1'639'647'264 398'718 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'3 [25]
34.6 1'638'268'224 398'380 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'4 [26]
34.5 1'632'935'248 397'099 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'5 [27]
34.3 1'622'567'168 394'608 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'6 [28]
34.0 1'607'661'296 391'057 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'7 [29]
33.6 1'591'944'368 387'253 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'8 [30]
33.2 1'573'121'936 382'727 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'9 [31]
32.7 1'547'707'040 376'534 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'10 [32]
31.9 1'509'950'288 367'307 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'11 [33]
30.7 1'455'860'976 354'097 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'12 [34]
30.1 1'425'838'160 346'875 tinyxml2::XMLElement::ParseDeep(char*, tinyxml2::StrPair*, int*) [35]
30.1 1'425'838'160 346'875 tinyxml2::XMLElement::ParseAttributes(char*, int*) [36]
30.1 1'425'838'160 346'875 tinyxml2::XMLElement::CreateAttribute() [37]
29.0 1'371'421'120 333'412 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'13 [38]
26.4 1'249'606'000 303'721 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'14 [39]
22.8 1'081'623'120 262'619 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'15 [40]
19.6 927'220'461 82 reco::details::readGzipFile(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) [41]
18.2 864'046'000 209'733 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'16 [42]
12.7 599'349'392 146'039 tinyxml2::XMLNode::ParseDeep(char*, tinyxml2::StrPair*, int*)'17 [43]
The text was updated successfully, but these errors were encountered: