GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

bilzard · 2022-02-02T23:04:05Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Training

Bug

Training YOLO from a checkpoint (*.pt) consumes more GPU memory than training from a pre-trained weight (i.e. yolov5l).

Environment

YOLO: YOLOv5 (latest; how to check the yolo version?)
CUDA: 11.6 (Tesla T4, 15360MiB)
OS: Ubuntu 18.04.6 LTS (Bionic Beaver)
Python: 3.8.12

Minimal Reproducible Example

In the below training command, case 2 requires more GPU memory than case 1.

# 1. train from pre-trained model
train.py ... --weights yolov5l

# 2. train from pre-trained checkpoint
train.py ... --weights pre_trained_checkpoint.pt

Additional

As reported on the pytorch forum[1], loading state dict on CUDA device causes memory leak. We should load it on CPU memory:

state_dict = torch.load(directory, map_location=lambda storage, loc: storage)

[1] https://discuss.pytorch.org/t/load-state-dict-causes-memory-leak/36189/5?u=bilzrd

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2022-02-03T18:37:10Z

@bilzard thanks for the PR! Would it make more sense (less code or easier to understand) to just load directly on CPU with one of these other options? i.e.

state_dict = torch.load(directory, map_location=lambda storage, loc: storage)
state_dict = torch.load(directory)  # option 2
state_dict = torch.load(directory, map_location=torch.device('cpu'))  # option 3

bilzard · 2022-02-04T01:13:49Z

@glenn-jocher Option 2 shouldn't work because default is loaded to GPU.
Option 3 seems O.K. according to the official document[1]. let me check if it work.

When you call torch.load() on a file which contains GPU tensors, those tensors will be loaded to GPU by default. You can call torch.load(.., map_location='cpu') and then load_state_dict() to avoid GPU RAM surge when loading a model checkpoint.

[1] https://pytorch.org/docs/stable/generated/torch.load.html

bilzard · 2022-02-04T03:07:26Z

I checked option 3 worked on my server (GPU memory wasn't increased).

$ watch nvidia-smi # map-location = cpu, --weigts='yolov5l'
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   61C    P0    41W /  70W |  14603MiB / 15360MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      4509      C   python                          14599MiB |
+-----------------------------------------------------------------------------+

$ watch nvidia-smi # map-location = cpu, --weigts=path_to_pretrained.pt
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.39.01    Driver Version: 510.39.01    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0    71W /  70W |  14541MiB / 15360MiB |     91%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2669      C   python                          14537MiB |
+-----------------------------------------------------------------------------+

bilzard · 2022-02-04T03:12:13Z

@glenn-jocher

I think we should also fix code for loading model from torch hub, but I don't know how to test.
What should I do for that?

https://github.com/ultralytics/yolov5/blob/master/hubconf.py#L52

bilzard · 2022-02-04T03:19:49Z

FYI: I fixed code as option 3.

glenn-jocher · 2022-02-05T10:32:51Z

@glenn-jocher

I think we should also fix code for loading model from torch hub, but I don't know how to test. What should I do for that?

https://github.com/ultralytics/yolov5/blob/master/hubconf.py#L52

Good question. For PyTorch Hub we may want to leave as is for startup time speeds. Since PyTorch Hub models may be used in APIs like https://ultralytics.com/yolov5 that are only called once the response time may be more important than reducing the CUDA usage slightly.

Another point is that simple inference uses much less CUDA memory than training, mabe only about 1/3 or 1/2 of training memory. But I also am not sure, it would need some study.

bilzard · 2022-02-05T11:00:52Z

O.K. We need to study response time when changing loading method. Then I will stay it as is at this time.

glenn-jocher · 2022-02-05T13:15:26Z

@bilzard yes that's correct. For training an extra second on initializing won't matter, nor in val.py, but for detect.py and PyTorch Hub we probably want to prioritize fastest time to get first results returned.

bilzard added the bug Something isn't working label Feb 2, 2022

bilzard mentioned this issue Feb 2, 2022

Load checkpoint on CPU instead of on GPU #6516

Merged

1 task

glenn-jocher closed this as completed in #6516 Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

bilzard commented Feb 2, 2022 •

edited

Loading

glenn-jocher commented Feb 3, 2022

bilzard commented Feb 4, 2022

bilzard commented Feb 4, 2022 •

edited

Loading

bilzard commented Feb 4, 2022

bilzard commented Feb 4, 2022

glenn-jocher commented Feb 5, 2022

bilzard commented Feb 5, 2022

glenn-jocher commented Feb 5, 2022

GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

GPU Memory Leak on Loading Pre-Trained Checkpoint #6515

Comments

bilzard commented Feb 2, 2022 • edited Loading

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

glenn-jocher commented Feb 3, 2022

bilzard commented Feb 4, 2022

bilzard commented Feb 4, 2022 • edited Loading

bilzard commented Feb 4, 2022

bilzard commented Feb 4, 2022

glenn-jocher commented Feb 5, 2022

bilzard commented Feb 5, 2022

glenn-jocher commented Feb 5, 2022

bilzard commented Feb 2, 2022 •

edited

Loading

bilzard commented Feb 4, 2022 •

edited

Loading