Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LowPtElectrons: updated BDT model and code base for ID #31080

Merged
merged 10 commits into from
Aug 19, 2020

Conversation

bainbrid
Copy link
Contributor

@bainbrid bainbrid commented Aug 6, 2020

This PR concerns changes to the BDT model used for the identification of LowPtElectrons.

This PR requires the following external: cms-data/RecoEgamma-ElectronIdentification#14


  1. The LowPtGsfElectronIDProducer.cc class has been adapted to handle changes to the list of features used by the new model in LowPtElectrons: updated BDT model for ID cms-data/RecoEgamma-ElectronIdentification#14

  2. We have reverse engineered some recent changes made to this package in the CMSSW_11_0_X cycle. The changes involve moving functionality back to the interface and src directories. Utility methods defined in LowPtGsfElectronFeatures.h guarantee the correct extraction of features (from both AOD and MINIAOD data tiers) for both evaluation (in CMSSW and outside) and training purposes (using workflows defined outside the cms-sw repo).

  3. These changes require the following external: LowPtElectrons: updated BDT model for ID cms-data/RecoEgamma-ElectronIdentification#14, which contains a "placeholder" model for the time being. The model weights will be updated again in a future PR once MC samples are available (in progress) for a final retraining.

Validation has been performed with the "placeholder" model and the performance is consistent with our expectations.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2020

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2020

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-31080/17619

  • This PR adds an extra 32KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2020

A new Pull Request was created by @bainbrid for master.

It involves the following packages:

RecoEgamma/EgammaElectronProducers

@perrotta, @jpata, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@Sam-Harper, @jainshilpi, @rovere, @lgray, @sobhatta, @lecriste, @afiqaize, @varuns23 this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@bainbrid
Copy link
Contributor Author

bainbrid commented Aug 6, 2020

@crovelli

@bainbrid
Copy link
Contributor Author

bainbrid commented Aug 6, 2020

@cavalari

@slava77
Copy link
Contributor

slava77 commented Aug 6, 2020

@slava77
Copy link
Contributor

slava77 commented Aug 6, 2020

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 6, 2020

The tests are being triggered in jenkins.
Tested with other pull request(s) cms-data/RecoEgamma-ElectronIdentification#14

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 7, 2020

+1
Tested at: d553c18
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-9c4a8a/8642/summary.html
CMSSW: CMSSW_11_2_X_2020-08-06-1100
SCRAM_ARCH: slc7_amd64_gcc820

@bainbrid
Copy link
Contributor Author

@slava77 @afiqaize @SohamBhattacharya

Looking at this, I do not see where lowPtGsfElectronID is used; a module simply inserted in the task but which products are not consumed will not run.

Ah, true, my mistake - it is used using a standalone test we have privately, which fooled me.

My suggestion was to add it to the list of the electron IDs/userFloats in the slimmedElectrons.

I'm not sure how to do this. If we add a userFloat as you suggest, then everything else in this PR can remain the same?

OTOH, the main reason why I asked about this was because I thought that lowPtGsfElectronID is not already running (#31080 (comment) says " we have yet to schedule the module to run the new ID model in this PR." ). From the reco comparisons e.g. in https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_11_2_X_2020-08-06-1100+9c4a8a/38242/validateJR/all_OldVSNew_RunSinglePh2016Bwf136p731/ I see that this module is already running in default reco.

Yes, this is true, but this is part of the RECO sequence, while we wish to execute the ID as part of the PAT sequence. So the reRECO test you refer to can bre used to validate the performance within the RECO sequence (once we replace the placeholder model), but - most importantly - we wish to test the ID as part of the PAT sequence using MINIAOD inputs (i.e. same as for the training).

Am I correct?

@jpata
Copy link
Contributor

jpata commented Aug 14, 2020

@bainbrid if you plan to run it in PAT but don't have the final model yet, I think it would be useful to already introduce the relevant python conf as suggested above to the workflows, so everything can be tested end to end.

Perhaps we can get a feeling for the timing as well, even though the model is a placeholder.

@bainbrid
Copy link
Contributor Author

bainbrid commented Aug 14, 2020 via email

@jpata
Copy link
Contributor

jpata commented Aug 14, 2020

@slava77
Copy link
Contributor

slava77 commented Aug 14, 2020

My suggestion was to add it to the list of the electron IDs/userFloats in the slimmedElectrons.

I'm not sure how to do this. If we add a userFloat as you suggest, then everything else in this PR can remain the same?

The configuration logic for electron modifiers starts around

process.slimmedElectrons.modifierConfig.modifications = egamma_modifications

This would include both VID and more direct insertions in RecoEgamma/EgammaTools/python/egammaObjectModificationsInMiniAOD_cff.py
egamma experts may be more specific with a pointer.
@jainshilpi @lsoffi @afiqaize @SohamBhattacharya

@slava77
Copy link
Contributor

slava77 commented Aug 14, 2020

My suggestion was to add it to the list of the electron IDs/userFloats in the slimmedElectrons.

I'm not sure how to do this. If we add a userFloat as you suggest, then everything else in this PR can remain the same?

The configuration logic for electron modifiers starts around

process.slimmedElectrons.modifierConfig.modifications = egamma_modifications

This would include both VID and more direct insertions in RecoEgamma/EgammaTools/python/egammaObjectModificationsInMiniAOD_cff.py
egamma experts may be more specific with a pointer.
@jainshilpi @lsoffi @afiqaize @SohamBhattacharya

It looks like my pointers are more specific to the standard slimmedElectrons
slimmedLowPtElectrons_cfi.py seems more appropriate in this context

@bainbrid
Copy link
Contributor Author

@jainshilpi @lsoffi @afiqaize @SohamBhattacharya Is somebody available to take a look at this? I suspect it's straightforward for an expert to figure out how to implement this, but would take me some time to understand the MINIAOD tools. (I'm happy to implement if somebody can give pointers.) In short, our IDProducer (if called by something!) takes the slimmedLowPtElectrons collection as input and produces a ValueMap of floats, keyed off the slimmed collection. See my attempt (so far) here.

@jpata
Copy link
Contributor

jpata commented Aug 18, 2020

@bainbrid the main consideration right now is that the new code is included in the jenkins tests. As has been pointed out by Slava, looks like the LowPtGsfElectronIDProducer is included in the RECO sequences [1] and gives some differences with respect to the baseline [2].
We can move forward with this from RECO if we understand that these differences are acceptable. Can you confirm? If the code for testing it in PAT is not immediately available, this can come later, too.

[1] #31080 (comment)
[2] https://cmssdt.cern.ch/SDT/jenkins-artifacts/baseLineComparisons/CMSSW_11_2_X_2020-08-13-1100+9c4a8a/38323/validateJR/all_OldVSNew_TTbarwf25p0/all_OldVSNew_TTbarwf25p0c_floatedmValueMap_lowPtGsfElectronID__RECO_obj_values_.png

@bainbrid
Copy link
Contributor Author

@jpata @slava77 In Slava's comment, I followed the links to this comparison. It seems a little different to the one you point me too - perhaps just higher stats? (I will comment on this distribution.)

I can say the following: the new distribution is shifted to lower values, which would be bad for signal and good for background. It's not clear to me if we are looking at signal or bkgd or both (presumably a mix, most likely mainly bkgd), as it depends on the MC sample used for the test.

However, the new code we are testing here is likely to be incompatible with the model weights stored by default in cms-data, so a shift to lower values is probably my expectation.

If you also included cms-data/RecoEgamma-ElectronIdentification#14 in this test (please can you confirm?), then the weights would be compatible with the new code. For this scenario, I cannot say with certainty what I would expect to see, on balance probably not a shift to lower values.

In summary: a change is highly likely, a shift to lower values is quite possible, and the values don't seem crazy.

@jpata
Copy link
Contributor

jpata commented Aug 18, 2020

Thanks @bainbrid.

The test was indeed done with cms-data/RecoEgamma-ElectronIdentification#14 as can be checked in #31080 (comment).

The comparison Slava posted earlier in #31080 (comment) is with RunSinglePh2016B, therefore higher stats for low-pt electrons.

My understanding is then:

  • all the new code you added is evaluated in the tests
  • it changes the output in an expected, non-final way
  • the final model is in development, as are the python sequences to add it to PAT.

Did I get it right?

@jainshilpi
Copy link
Contributor

@jainshilpi @lsoffi @afiqaize @SohamBhattacharya Is somebody available to take a look at this? I suspect it's straightforward for an expert to figure out how to implement this, but would take me some time to understand the MINIAOD tools. (I'm happy to implement if somebody can give pointers.) In short, our IDProducer (if called by something!) takes the slimmedLowPtElectrons collection as input and produces a ValueMap of floats, keyed off the slimmed collection. See my attempt (so far) here.

@jainshilpi @lsoffi @afiqaize @SohamBhattacharya Is somebody available to take a look at this? I suspect it's straightforward for an expert to figure out how to implement this, but would take me some time to understand the MINIAOD tools. (I'm happy to implement if somebody can give pointers.) In short, our IDProducer (if called by something!) takes the slimmedLowPtElectrons collection as input and produces a ValueMap of floats, keyed off the slimmed collection. See my attempt (so far) here.

@bainbrid apologies for missing this. We need to check with our ID and reco experts. will get back to you offline on this.

@bainbrid
Copy link
Contributor Author

My understanding is then:

  • all the new code you added is evaluated in the tests

@jpata

Yes.

  • it changes the output in an expected, non-final way

Yes, changes are expected.

  • the final model is in development, as are the python sequences to add it to PAT.

Yes.

Did I get it right?

Yes, thanks!

@jpata
Copy link
Contributor

jpata commented Aug 18, 2020

Thanks! Can I ask the EGamma contacts who were already pinged @afiqaize @SohamBhattacharya to confirm that this change to the BDT output is fine in the mean time?

@SohamBhattacharya
Copy link
Contributor

SohamBhattacharya commented Aug 18, 2020 via email

@bainbrid
Copy link
Contributor Author

Thanks! Can I ask the EGamma contacts who were already pinged @afiqaize @SohamBhattacharya to confirm that this change to the BDT output is fine in the mean time?

@jpata - I'm afraid I'm probably best placed to comment on the BDT output ... but I'm afraid I cannot give a definitive answer.

The reason in short is that the reference histogram is using a rather suboptimal model and so is a poor choice for reference (but the only one we have). Naviely, I'd expect to do better than this.

However, the model we are testing is only a placeholder and so - without quite some digging - I cannot be sure if it should give better performance than the (suboptimal) default or not.

Further, the distributions are an (unknown) mixture of signal and background, so this complicates the interpretation.

Finally, I'm happy to move forward. The new distribution appears reasonably healthy in that it is smooth and there are no unusual spikes (e.g. at 0 or 1).

So green light from me. We'll know more when we have the PAT sequence and the new model.

@jpata
Copy link
Contributor

jpata commented Aug 18, 2020

Thanks @bainbrid. As you confirm it for EGamma, we can approve this PR from reco. I just want to make sure that nothing else depends on the BDT output in the mean time.

@jpata
Copy link
Contributor

jpata commented Aug 18, 2020

+1

  • tests pass with changes to the lowPtGsfElectronID output as expected using a placeholder model (confirmed by the author for EGamma)
  • updates the LowPtGsfElectronIDProducer and corresponding supporting code
  • model with final retraining and PAT sequences will come in a later PR

@SohamBhattacharya
Copy link
Contributor

Afiq and I confirm that [1] is fine as simply modifying the standard sequence to produce a new collection would be the easiest way to do this.
Since @bainbrid has already greenlit the BDT output, we're okay with these changes.

[1] https://github.com/cms-sw/cmssw/blob/355e62b7237b9e6d054c7d0fc64e1c01cc062ad4/PhysicsTools/PatAlgos/python/slimming/slimmedLowPtElectrons_cff.py

Again, sorry for the delay.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@qliphy
Copy link
Contributor

qliphy commented Aug 19, 2020

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants