Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing cuda crashes #483

Merged
merged 9 commits into from
Jun 15, 2020
Merged

Fixing cuda crashes #483

merged 9 commits into from
Jun 15, 2020

Conversation

vkhristenko
Copy link

PR description:

Couple of things:

  • EventFilter/EcalRawToDigi depended on RecoLocalCalo/EcalRecAlgos and Producers... directly and indirectly -> removed
  • EventFilter/EcalRawToDigi: All cuda kernels/device functions moved to be part of plugins
  • RecoLocalCalo/EcalRecAlgos and Producers: moved all cuda kernels/device functions to be part of plugins
  • RecoLocalCalo/HcalRecAlgos and Producers: moved all cuda kernels/device functions to be part of plugins

the problem was due to device-side linking across packages when it was not needed. And symbols were present in .so that were not supposed to be there. e.g. running w/o current pr just ecal digi to raw + full hcal, you get the same issue. doing nm on the pluginEventFilterEcalRawToDigi.so will show symbols from RecoLocalCalo/EcalRecAlgos.

Furthermore, I think the following situation will fail:

  • we have 2 shared objs: A and B. Both have host + device funcs
  • B depends on A both on host and device. Therefore B will be linked against A. dynamic for host and static for device
  • ldopen both, and running any kernel from B should fail. This is what we basically have now. Am I right here? I do not understand if this is supposed to work actually or not...

PR validation:

Using the exes provided as part of the release.
@mariadalfonso do not know if you remember...
1 thing to note for hcal: i do not know what was done for the cpu, cause whenever i used to validate cpu vs gpu, i used a different branch (w/ a couple of fixes applied). Note, it was documented as part of previous prs

@mariadalfonso
Copy link

mariadalfonso commented Jun 12, 2020

@mariadalfonso do not know if you remember...
1 thing to note for hcal: i do not know what was done for the cpu, cause whenever i used to validate cpu vs gpu, i used a different branch (w/ a couple of fixes applied). Note, it was documented as part of previous prs

Both CPU fixes are in the main release.
Few things need to be customized at HLT (as 8 pulse fit all the times + applyLegacyHBMCorrection=False) to get the equivalent results. :)

@vkhristenko
Copy link
Author

@mariadalfonso perfect! :)

@fwyzard
Copy link

fwyzard commented Jun 12, 2020

@mariadalfonso

Few things need to be customized at HLT (as 8 pulse fit all the times + applyLegacyHBMCorrection=False) to get the equivalent results. :)

For the HLT menu to use

  • should I start from the "frozen" OnLine_HLT_2018.py menu, or from the "current" OnLine_HLT_GRun.py menu ?
  • from the HCAL point of view, do I need to apply any customisation to either of them ?
  • from the HCAL point of view, do they run correctly on 2018 data ?

@mariadalfonso
Copy link

mariadalfonso commented Jun 12, 2020

@mariadalfonso

Few things need to be customized at HLT (as 8 pulse fit all the times + applyLegacyHBMCorrection=False) to get the equivalent results. :)

For the HLT menu to use

  • should I start from the "frozen" OnLine_HLT_2018.py menu, or from the "current" OnLine_HLT_GRun.py menu ?
  • from the HCAL point of view, do I need to apply any customisation to either of them ?
  • from the HCAL point of view, do they run correctly on 2018 data ?
  1. better start from "frozen" OnLine_HLT_2018.py
  2. apply the customization synchronizeHCALHLTofflineRun2(process) of https://github.com/cms-sw/cmssw/blob/CMSSW_11_1_0_pre8/HLTrigger/Configuration/python/customizeHLTforCMSSW.py#L87
  3. additional customization as below (this is what the GPU has implemented, since we have this in the Run3)
    for producer in producers_by_type(process, "HBHEPhase1Reconstructor"):
        producer.algorithm.applyLegacyHBMCorrection = cms.bool( False )
        producer.algorithm.chiSqSwitch = cms.double(-1)

With the 3 points above you a correct results in 2018 data and the two floats energy-cpu and energy-gpu that should be identical.

@fwyzard fwyzard added ECAL ECAL-related developments HCAL HCAL-related developments labels Jun 13, 2020
@fwyzard
Copy link

fwyzard commented Jun 13, 2020

@vkhristenko I have a few fixes:

  • fix few #include statements were still pointing to the header files in the old location
  • fix include guards and #include order
  • apply clang-format changes

Could you apply them with

curl -L https://github.com/cms-patatrack/cmssw/files/4774221/diff.txt | patch -p1

?

The diff is attached: diff .

@vkhristenko
Copy link
Author

@fwyzard applied

recHitsM0TokenOut_{produces<OProductType>("recHitsM0LabelOut")},
recHitsLegacyTokenOut_{produces<HBHERecHitCollection>("recHitsLegacyLabelOut")} {}
recHitsM0TokenOut_{produces<OProductType>(ps.getParameter<std::string>("recHitsM0LabelOut"))},
recHitsLegacyTokenOut_{produces<HBHERecHitCollection>(ps.getParameter<std::string>("recHitsLegacyLabelOut"))} {}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With these changes, I finally have the full HLT running 👍

@fwyzard fwyzard merged commit c198dff into cms-patatrack:master Jun 15, 2020
@vkhristenko vkhristenko deleted the fix_crash branch June 15, 2020 10:44
fwyzard added a commit that referenced this pull request Oct 7, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Oct 8, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Oct 8, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Oct 8, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 9, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 9, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 12, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 12, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 16, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 16, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 26, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Nov 26, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Dec 25, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
fwyzard added a commit that referenced this pull request Dec 29, 2020
Move ECAL and HCAL CUDA code to plugins.
General cleanup: remove unused code, apply clang-format and various include changes.
Fix product labels for HCAL rechits on CPU.

Co-authored-by: Andrea Bocci <andrea.bocci@cern.ch>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix ECAL ECAL-related developments fixed HCAL HCAL-related developments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants