Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GBRForest implementation without TMVA #24432

Merged
merged 19 commits into from
Oct 31, 2018
Merged

GBRForest implementation without TMVA #24432

merged 19 commits into from
Oct 31, 2018

Conversation

guitargeek
Copy link
Contributor

@guitargeek guitargeek commented Aug 31, 2018

Final Summary

  • Separated XML file parsing from the GBRForest data class itself and moved it to GBRForestTools where the parser was completely rewritten. The XML file is now parsed directly instead of translating a TMVA::Reader. This reduces startup time and CPU usage since the tree structure is only iterated over once.
  • Created new package CommonTools/MVAUtils to resolve cyclic dependencies which houses:
    • The new GBRForestTools which set up a GBRForest from (gzipped) XML files by parsing it with tinyxml2
    • GBRForestWriter to read GBRForests from DB (previously in CommonTools/Utils)
    • TMVAEvaluator and TMVAZipReader (previously in CommonTools/Utils)
  • Clean up of MVA related code in PFEgammaAlgo
  • Changed all CMSSW code which uses GBRForests to make use of this new infrastructure.
  • External PR Update data-RecoParticleFlow-PFProducer.spec cmsdist#4332 translates old text-based weights files to XML format such that the new GBRForestTools can use them

Original Message

Hi all,

recently I re-implemented the GBRForest/GBRTree for a few reasons:

  1. Unexpected behaviour with large weight files [1]
  2. Could only be constructed from TMVA readers, which need quite some boilerplate code to set up
  3. TMVA is a very heavy dependency

Reason 1. triggered my work here, because my Bayesian optimization studies for the electron ID resulted in > 20 MB weight files which I could not readily use with the current implementation.

I don't know if you want to merge it to master, but at least it would be nice to get it tested once at least so I feel confident in the results of my private studies where I use this implementation. But the bigger weight files will come sooner or later also to CMSSW :-)

Notes:

  • The implementation relies on tinyxml2 [2]. Tinyxml 1 is in the cmsdist, but it would be nice to also have a recent version. For now, I just copy-pasted the tinyxml2 source code but it would be nicer to have it in the build configuration.
  • I could not add new data members to the GBRForest, because that would break the de-serialization of old GBRForests. I would have liked to, e.g. to store the variable names, but it's OK like this. Users who want to get the variable names can just pass a string vector to the constructor and they get stored there.
  • I could not reuse the readGzipFile function in CommonTools/Utils [3], because adding that package to the BuildFile of CondFormats/EgammaObjects would introduce a cyclic dependency. It's maybe not optimal that CommonTools/Utils houses both really low-level functionality like the reading of gzip files and higher level business that requires GBRForests itself.
  • In PFProducer, there were really old TMVA weight files in the old .txt format [4]. To read them with the new GBRForest, I had to port them to the newer XML format for which I wrote a little tool [5]. The converted XML files are shipped with this PR, but if we should actually merge this there should be probably a separate PR for cms-data [4].
  • All the code touched in this PR is actually run in the RelVal tests, which is nice.
  • No changes expected; purely technical.

Cheers,
Jonas

[1] #24395
[2] https://github.com/leethomason/tinyxml2
[3] https://github.com/cms-sw/cmssw/blob/master/CommonTools/Utils/src/TMVAZipReader.cc#L21
[4] https://github.com/cms-data/RecoParticleFlow-PFProducer
[5] https://gist.github.com/guitargeek/a00c86ffd697c8c8aac5e6797a352a3c

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-24432/6260

Code check has found code style and quality issues which could be resolved by applying a patch in https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-24432/6260/git-diff.patch
e.g. curl https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-24432/6260/git-diff.patch | patch -p1

You can run scram build code-checks to apply code checks directly

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @guitargeek (Jonas Rembser) for master.

It involves the following packages:

CommonTools/Utils
CondFormats/EgammaObjects
PhysicsTools/PatAlgos
RecoEgamma/EgammaTools
RecoEgamma/ElectronIdentification
RecoEgamma/PhotonIdentification
RecoJets/JetProducers
RecoParticleFlow/PFProducer
RecoParticleFlow/PFTracking
Utilities/General

@perrotta, @smuzaffar, @Dr15Jones, @arunhep, @tocheng, @monttj, @cmsbuild, @gpetruc, @franzoni, @slava77, @ggovi, @pohsun, @arizzi, @lpernie can you please review it and eventually sign? Thanks.
@TaiSakuma, @gouskos, @yslai, @jainshilpi, @rappoccio, @Martin-Grunewald, @ahinzmann, @varuns23, @seemasharmafnal, @mmarionncern, @imarches, @makortel, @smoortga, @acaudron, @lgray, @jdolen, @drkovalskyi, @ferencek, @Sam-Harper, @rovere, @jdamgov, @nhanvtran, @gkasieczka, @tocheng, @schoef, @mmusich, @clelange, @HeinerTholen, @wddgit, @JyothsnaKomaragiri, @mverzett, @cbernet, @gpetruc, @mariadalfonso, @pvmulder, @bachtis this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@slava77
Copy link
Contributor

slava77 commented Aug 31, 2018

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 31, 2018

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-any-integration/30190/console Started: 2018/08/31 23:25

@cmsbuild
Copy link
Contributor

-1

Tested at: 0430eeb

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-24432/30190/summary.html

I found follow errors while testing this PR

Failed tests: Build HeaderConsistency

  • Build:

I found an error when building:

>> Compiling  /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/src/TrajSeedMatcher.cc 
>> Compiling  /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/src/SiStripElectronAlgo.cc 
In file included from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/ElectronIdentification/interface/ElectronMVAEstimator.h:5:0,
                 from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/interface/GsfElectronAlgoHeavyObjectCache.h:5,
                 from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/src/GsfElectronAlgoHeavyObjectCache.cc:1:
/build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/poison/RecoEgamma/EgammaTools/interface/GBRForestTools.h:1:2: error: #error THIS FILE HAS BEEN REMOVED FROM THE PACKAGE.
 #error THIS FILE HAS BEEN REMOVED FROM THE PACKAGE.
  ^~~~~
In file included from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/ElectronIdentification/interface/SoftElectronMVAEstimator.h:9:0,
                 from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/interface/GsfElectronAlgoHeavyObjectCache.h:6,
                 from /build/cmsbld/jenkins/workspace/ib-any-integration/CMSSW_10_3_X_2018-08-31-1100/src/RecoEgamma/EgammaElectronAlgos/src/GsfElectronAlgoHeavyObjectCache.cc:1:


@cmsbuild
Copy link
Contributor

Comparison not run due to Build errors (RelVals and Igprof tests were also skipped)

@fabiocos
Copy link
Contributor

@lpernie @tocheng @ggovi @thomreis could you please check and comment or sign? We would like to avoid to rebase this large PR if possible

@fabiocos
Copy link
Contributor

@lpernie @tocheng @ggovi @thomreis ping

@ggovi
Copy link
Contributor

ggovi commented Oct 30, 2018

+1

@@ -5,7 +5,7 @@
<use name="FWCore/Utilities"/>
<use name="DataFormats/Candidate"/>
<use name="DataFormats/PatCandidates"/>
<use name="CommonTools/Utils"/>
<use name="CommonTools/MVAUtils"/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peruzzim @fgolf please notice this change

@thomreis
Copy link
Contributor

+1

@fabiocos
Copy link
Contributor

@smuzaffar I see an error report in https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-24432/31245/headers_chks.log is it an internal failure of the check? I think we already saw something of this kind...

@kpedro88
Copy link
Contributor

+upgrade

@fabiocos
Copy link
Contributor

+1

@fabiocos
Copy link
Contributor

merge

@lpernie @tocheng please check, this was approved by @ggovi, please sign it even after merge or comment for further updates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.