Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

farbgeist · 2024-07-06T17:23:42Z

I get the following error after using the recommended docker image with the recommended installation steps from the readme:

python train.py --batch 16 --epochs 25 --img 640 --device 0 --min-items 0 --close-mosaic 15 --data ../generated_training_images_root_yoloV9/data.yaml --weights /workspace/weights/gelan-c.pt --cfg models/detect/gelan-c.yaml --hyp hyp.scratch-high.yaml
train: weights=/workspace/weights/gelan-c.pt, cfg=models/detect/gelan-c.yaml, data=../generated_training_images_root_yoloV9/data.yaml, hyp=hyp.scratch-high.yaml, epochs=25, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs/train, name=exp, exist_ok=False, quad=False, cos_lr=False, flat_cos_lr=False, fixed_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, min_items=0, close_mosaic=15, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
YOLOv5 🚀 1e33dbb Python-3.8.12 torch-1.11.0a0+b6df043 CUDA:0 (NVIDIA GeForce RTX 3090, 24575MiB)

hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=7.5, cls=0.5, cls_pw=1.0, dfl=1.5, obj_pw=1.0, iou_t=0.2, anchor_t=5.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.9, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.15, copy_paste=0.3
ClearML: run 'pip install clearml' to automatically track, visualize and remotely train YOLO 🚀 in ClearML
Comet: run 'pip install comet_ml' to automatically track and visualize YOLO 🚀 runs in Comet
TensorBoard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
Overriding model.yaml nc=80 with nc=5

             from  n    params  module                                  arguments

0 -1 1 1 -1 1 2 3 4 5 6 7 8 9 10 -1 1 11 [-1, 6] 1 12 13 -1 1 14 [-1, 4] 1 15 16 17 [-1, 12] 1 18 19 20 [-1, 9] 1 21 22 [15, 18, 21] 1 gelan-c summary: 1856 models.common.Conv [3, 64, 3, 2]
73984 models.common.Conv [64, 128, 3, 2]
-1 1 212864 models.common.RepNCSPELAN4 [128, 256, 128, 64, 1]
-1 1 164352 models.common.ADown [256, 256]
-1 1 847616 models.common.RepNCSPELAN4 [256, 512, 256, 128, 1]
-1 1 656384 models.common.ADown [512, 512]
-1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1]
-1 1 656384 models.common.ADown [512, 512]
-1 1 2857472 models.common.RepNCSPELAN4 [512, 512, 512, 256, 1]
-1 1 656896 models.common.SPPELAN [512, 512, 256]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1]
0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
0 models.common.Concat [1]
-1 1 912640 models.common.RepNCSPELAN4 [1024, 256, 256, 128, 1]
-1 1 164352 models.common.ADown [256, 256]
0 models.common.Concat [1]
-1 1 2988544 models.common.RepNCSPELAN4 [768, 512, 512, 256, 1]
-1 1 656384 models.common.ADown [512, 512]
0 models.common.Concat [1]
-1 1 3119616 models.common.RepNCSPELAN4 [1024, 512, 512, 256, 1]
5494495 models.yolo.DDetect [5, [256, 512, 512]]
621 layers, 25440927 parameters, 25440911 gradients, 103.2 GFLOPs

Transferred 931/937 items from /workspace/weights/gelan-c.pt
AMP: checks passed ✅
optimizer: SGD(lr=0.01) with parameter groups 154 weight(decay=0.0), 161 weight(decay=0.0005), 160 bias
train: Scanning /workspace/generated_training_images_root_yoloV9/train/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00
val: Scanning /workspace/generated_training_images_root_yoloV9/valid/labels.cache... 221 images, 0 backgrounds, 0 corrupt: 100%|██████████| 221/221 00:00
Plotting labels to runs/train/exp17/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp17
Starting training for 25 epochs...

  Epoch    GPU_mem   box_loss   cls_loss   dfl_loss  Instances       Size

0%| | 0/14 00:00
Traceback (most recent call last):
File "train.py", line 634, in
main(opt)
File "train.py", line 528, in main
train(opt.hyp, opt, device, callbacks)
File "train.py", line 277, in train
for i, (imgs, targets, paths, _) in pbar: # batch -------------------------------------------------------------
File "/opt/conda/lib/python3.8/site-packages/tqdm/std.py", line 1180, in iter
for obj in iterable:
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/opt/conda/lib/python3.8/site-packages/torch/_utils.py", line 438, in reraise
raise exception
RuntimeError: Caught RuntimeError in pin memory thread for device 0.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 34, in _pin_memory_loop
data = pin_memory(data)
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in pin_memory
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 58, in
return [pin_memory(sample) for sample in data]
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/pin_memory.py", line 50, in pin_memory
return data.pin_memory()
RuntimeError: CUDA error: out of memory

It is NOT working with a smaller batch size...

The text was updated successfully, but these errors were encountered:

farbgeist · 2024-07-06T17:25:31Z

the command to start the docker container: docker run --gpus=all --name yolov9 -it -v ./data/training/generated/generated_training_images_root_yoloV9/:/workspace/generated_training_images_root_yoloV9/ -v ./data/jupyter/yoloV9/:/workspace/ --shm-size=64g nvcr.io/nvidia/pytorch:21.11-py3

after that I did the recommended steps:

apt update
apt install -y zip htop screen libgl1-mesa-glx
pip install seaborn thop
cd /yolov9

So for me the standard installation is broken. I am using a RTX 3090 on Ubuntu 22.04 inside of WSL 2 on Windows 11 with the newest Nvidia drivers.

farbgeist · 2024-07-09T14:52:26Z

noone else with that error?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

farbgeist commented Jul 6, 2024

farbgeist commented Jul 6, 2024 •

edited

Loading

farbgeist commented Jul 9, 2024

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

Error when training: RuntimeError: Caught RuntimeError in pin memory thread for device 0 #533

Comments

farbgeist commented Jul 6, 2024

farbgeist commented Jul 6, 2024 • edited Loading

farbgeist commented Jul 9, 2024

farbgeist commented Jul 6, 2024 •

edited

Loading