Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

Open
farbgeist opened this issue Jul 6, 2024 · 2 comments

Comments

@farbgeist
Copy link

I get the following error after using the recommended docker image with the recommended installation steps from the readme:

python train.py --batch 16 --epochs 25 --img 640 --device 0 --min-items 0 --close-mosaic 15 --data ../generated_training_images_root_yoloV9/data.yaml --weights /workspace/weights/gelan-c.pt --cfg models/detect/gelan-c.yaml --hyp hyp.scratch-high.yaml
train: weights=/workspace/weights/gelan-c.pt, cfg=models/detect/gelan-c.yaml, data=../generated_training_images_root_yoloV9/data.yaml, hyp=hyp.scratch-high.yaml, epochs=25, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, flat_cos_lr=False, fixed_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, min_items=0, close_mosaic=15, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
YOLOv5 🚀 1e33dbb Python-3.8.12 torch-1.11.0a0+b6df043 CUDA:0 (NVIDIA GeForce RTX 3090, 24575MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, cls_pw=1.0, dfl=1.5, obj_pw=1.0, iou_t=0.2, anchor_t=5.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.3
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLO 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLO 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=5

             from  n    params  module                                  arguments

0 -1 1 1856 models.common.Conv [3, 64, 3, 2]
1 -1 1 73984 models.common.Conv [64, 128, 3, 2]
2 -1 1 212864 models.common.RepNCSPELAN4 [128, 256, 128, 64, 1]
3 -1 1 164352 models.common.ADown [256, 256]
4 -1 1 847616 models.common.RepNCSPELAN4 [256, 512, 256, 128, 1]
5 -1 1 656384 models.common.ADown [512, 512]
6 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1]
7 -1 1 656384 models.common.ADown [512, 512]
8 -1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1]
9 -1 1 656896 models.common.SPPELAN [512, 512, 256]
10 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
11 [-1, 6] 1 0 models.common.Concat [1]
12 -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1]
13 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
14 [-1, 4] 1 0 models.common.Concat [1]
15 -1 1 912640 models.common.RepNCSPELAN4 [1024, 256, 256, 128, 1]
16 -1 1 164352 models.common.ADown [256, 256]
17 [-1, 12] 1 0 models.common.Concat [1]
18 -1 1 2988544 models.common.RepNCSPELAN4 [768, 512, 512, 256, 1]
19 -1 1 656384 models.common.ADown [512, 512]
20 [-1, 9] 1 0 models.common.Concat [1]
21 -1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1]
22 [15, 18, 21] 1 5494495 models.yolo.DDetect [5, [256, 512, 512]]
gelan-c summary: 621 layers, 25440927 parameters, 25440911 gradients, 103.2 GFLOPs

Transferred 931/937 items from /workspace/weights/gelan-c.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 154 weight(decay=0.0), 161 weight(decay=0.0005), 160 bias
train: Scanning /workspace/generated_training_images_root_yoloV9/train/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00
val: Scanning /workspace/generated_training_images_root_yoloV9/valid/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00
Plotting labels to runs/train/exp17/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp17
Starting training for 25 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/14 00:00
Traceback (most recent call last):
File "train.py", line 634, in
main(opt)
File "train.py", line 528, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 277, in train
for i, (imgs, targets, paths, _) in pbar: # batch -------------------------------------------------------------
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter
for obj in iterable:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 438, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory()
RuntimeError: CUDA error: out of memory

It is NOT working with a smaller batch size...

@farbgeist
Copy link
Author

farbgeist commented Jul 6, 2024

the command to start the docker container: docker run --gpus=all --name yolov9 -it -v ./data/training/generated/generated_training_images_root_yoloV9/:/workspace/generated_training_images_root_yoloV9/ -v ./data/jupyter/yoloV9/:/workspace/ --shm-size=64g nvcr.io/nvidia/pytorch:21.11-py3

after that I did the recommended steps:

apt update
apt install -y zip htop screen libgl1-mesa-glx
pip install seaborn thop
cd /yolov9

So for me the standard installation is broken. I am using a RTX 3090 on Ubuntu 22.04 inside of WSL 2 on Windows 11 with the newest Nvidia drivers.

@farbgeist
Copy link
Author

noone else with that error?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant