Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

Open
Entropy-Enthalpy opened this issue Sep 30, 2024 · 3 comments · Fixed by #4172
Open

[BUG] VRAM is wasted when running Lammps with multiple GPUs #4171

Entropy-Enthalpy opened this issue Sep 30, 2024 · 3 comments · Fixed by #4172
Labels

Comments

@Entropy-Enthalpy
Copy link

Entropy-Enthalpy commented Sep 30, 2024

Bug summary

I have been using DP for a long time, and in every version I have used, I have encountered this issue: when running a Lammps MD simulation using multiple GPUs via mpirun, each MPI Rank consumes VRAM on all GPUs, even though the computation of each MPI Rank is actually running on only one GPU.

For example, in the picture below, I requested 4 V100-SXM2-16GB GPUs for a single MD job and started 4 MPI Ranks. In reality, each GPU has (4-1)*0.3=0.9GiB of VRAM "wasted". For an 8-GPU job, this would "waste" (8-1)*0.3=2.1GiB of VRAM. If MPS is used, the "wasted" VRAM would be doubled.

image

On the surface, it seems that this issue arises because the TensorFlow gpu_device runtime executes a "create device" operation for each GPU in every MPI Rank (as can be seen in the logs), but I don't know how to avoid this problem. It is noteworthy that TensorFlow "can't see" the GPUs on different nodes, so when running Lammps MD across multiple nodes and each node uses only one GPU, there is no such issue.

DeePMD-kit Version

3.0.0b4

Backend and its version

TensorFlow v2.15.2, Lammps 29Aug2024

How did you download the software?

Offline packages

Input Files, Running Commands, Error Log, etc.

Running Commands:
mpirun -np 4 lmp_mpi -in input.lammps

Part of Log:

...
2024-10-01 03:13:12.619343: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.620016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.620570: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.621108: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.640945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.641605: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.642124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.642635: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.659556: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.660457: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.661253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 14529 MB memory:  -> device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:84:00.0, compute capability: 7.0
2024-10-01 03:13:12.661270: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.662060: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 14529 MB memory:  -> device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:85:00.0, compute capability: 7.0
2024-10-01 03:13:12.662095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
2024-10-01 03:13:12.662639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:2 with 14529 MB memory:  -> device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c4:00.0, compute capability: 7.0
2024-10-01 03:13:12.663289: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1886] Created device /job:localhost/replica:0/task:0/device:GPU:3 with 14529 MB memory:  -> device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:c5:00.0, compute capability: 7.0
...

Steps to Reproduce

N/A

Further Information, Files, and Links

No response

njzjz added a commit to njzjz/deepmd-kit that referenced this issue Sep 30, 2024
Fix deepmodeling#4171.

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz linked a pull request Sep 30, 2024 that will close this issue
github-merge-queue bot pushed a commit that referenced this issue Oct 6, 2024
Fix #4171.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

- **New Features**
	- Enhanced GPU selection logic for improved resource management.
- Added support for single-frame and multi-frame computations with new
parameters for atom energy and virial calculations.
	- Extended functionality for mixed-type computations in the model.

- **Bug Fixes**
	- Improved error handling during initialization and model execution.
- Added output tensor dimension validations to ensure expected
structures are maintained.

- **Documentation**
- Clarified output tensor validation to ensure expected dimensions are
maintained.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Jinzhe Zeng <jinzhe.zeng@rutgers.edu>
@njzjz njzjz closed this as completed Oct 7, 2024
@Entropy-Enthalpy
Copy link
Author

I found a similar issue with the PyTorch backend, but only GPU_0's VRAM was "wasted".

For a 8-GPU job, like this:
Image

DeePMD-kit Version

source:             v3.0.0b4-17-g8174cf11
source branch:      devel
source commit:      8174cf11
source commit at:   2024-10-11 03:20:55 +0000

LAMMPS version

Lammps 29Aug2024 update1

Backend stack

PyTorch 2.4.1
cuDNN 9.3.0
NVHPC 24.5 (nompi)
OpenMPI 5.0.5 (CUDA-Aware)
UCX 1.17.0 (CUDA + GDRCopy)

@njzjz njzjz reopened this Oct 12, 2024
@njzjz
Copy link
Member

njzjz commented Oct 13, 2024

For PyTorch, I guess c10::cuda::set_device should work. This API is not documented, though.

related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6

@Entropy-Enthalpy
Copy link
Author

For PyTorch, I guess c10::cuda::set_device should work. This API is not documented, though.

related discussion: https://discuss.pytorch.org/t/cuda-extension-with-multiple-gpus/160053/6

As a user, I just know that source/api_cc/src/DeepPotPT.cc might need to be modified, but I don't know how... 🥺

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants