-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] _dp test raise OOM | 2.0.0 _ #1149
Labels
Comments
See #748 (comment) |
We may want an automatic adjustment of the testing batchsize... |
Here is something that can be referred: Lightning-AI/pytorch-lightning#1638 |
Is it possible now to complete a “all-data” dp test ? |
njzjz
added a commit
to njzjz/deepmd-kit
that referenced
this issue
Sep 22, 2021
Resolves deepmodeling#1149. We start nbatch * natoms from 1024 (or we can set a different number), and iteratively multiply it by 2 until catching the OOM error. A small issue is that it's a bit slow to catch the TF OOM error. It's a problem of TF and I don't know how to resolve it. Luckily we only need to catch once.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Summary
Using kit-2.0.0 to conduct "dp test" would raise error of OOM. Not seen with previous version on the same system.
Deepmd-kit version, installation way, input file, running commands, error log, etc.
version 2.0.0_release
error
COMMAND:
paltform: ALI-EHPC
machine_type: P100_4_30 & T4_4_15 (training is OK, but test giving OOM )
Steps to Reproduce
Further Information, Files, and Links
The text was updated successfully, but these errors were encountered: