Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygon_train.py - Expected all tensors to be on the same device #31

Open
sac3tf opened this issue Aug 8, 2022 · 10 comments
Open

polygon_train.py - Expected all tensors to be on the same device #31

sac3tf opened this issue Aug 8, 2022 · 10 comments

Comments

@sac3tf
Copy link

sac3tf commented Aug 8, 2022

When following the tutorial, I ran the following line of code:

!python polygon_train.py --weights polygon-yolov5s-ucas.pt --cfg polygon_yolov5s_ucas.yaml \ --data polygon_ucas.yaml --hyp hyp.ucas.yaml --img-size 1024 \ --epochs 3 --batch-size 12 --noautoanchor --polygon --cache

I then received the below error while training was starting:

`Warning: "polygon_inter_union_cuda" and "polygon_b_inter_union_cuda" are not installed.
The Exception is: /usr/local/lib/python3.7/dist-packages/polygon_inter_union_cuda-0.0.0-py3.7-linux-x86_64.egg/polygon_inter_union_cuda.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE.
YOLOv5 🚀 v1.0-27-g42d6884 torch 1.9.0+cu111 CUDA:0 (Tesla P100-PCIE-16GB, 16280.875MB)

Namespace(adam=False, artifact_alias='latest', batch_size=12, bbox_interval=-1, bucket='', cache_images=True, cfg='./models/polygon_yolov5s_ucas.yaml', data='./data/polygon_ucas.yaml', device='', entity=None, epochs=3, evolve=False, exist_ok=False, global_rank=-1, hyp='./data/hyp.ucas.yaml', image_weights=False, img_size=[1024, 1024], label_smoothing=0.0, linear_lr=False, local_rank=-1, multi_scale=False, name='exp', noautoanchor=True, nosave=False, notest=False, polygon=True, project='runs/train', quad=False, rect=False, resume=False, save_dir='runs/train/exp', save_period=-1, single_cls=False, sync_bn=False, total_batch_size=12, upload_dataset=False, weights='polygon-yolov5s-ucas.pt', workers=8, world_size=1)
tensorboard: Start with 'tensorboard --logdir runs/train', view at http://localhost:6006/
hyperparameters: lr0=0.01, lrf=0.2, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.1, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=30.0, translate=0.1, scale=0.0, shear=5.0, perspective=0.0005, flipud=0.5, fliplr=0.5, mosaic=0.0, mixup=0.0
wandb: Install Weights & Biases for YOLOv5 logging with 'pip install wandb' (recommended)

             from  n    params  module                                  arguments                     

0 -1 1 3520 models.common.Focus [3, 32, 3]
1 -1 1 18560 models.common.Conv [32, 64, 3, 2]
2 -1 1 18816 models.common.C3 [64, 64, 1]
3 -1 1 73984 models.common.Conv [64, 128, 3, 2]
4 -1 1 156928 models.common.C3 [128, 128, 3]
5 -1 1 295424 models.common.Conv [128, 256, 3, 2]
6 -1 1 625152 models.common.C3 [256, 256, 3]
7 -1 1 1180672 models.common.Conv [256, 512, 3, 2]
8 -1 1 656896 models.common.SPP [512, 512, [5, 9, 13]]
9 -1 1 1182720 models.common.C3 [512, 512, 1, False]
10 -1 1 131584 models.common.Conv [512, 256, 1, 1]
11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
12 [-1, 6] 1 0 models.common.Concat [1]
13 -1 1 361984 models.common.C3 [512, 256, 1, False]
14 -1 1 33024 models.common.Conv [256, 128, 1, 1]
15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest']
16 [-1, 4] 1 0 models.common.Concat [1]
17 -1 1 90880 models.common.C3 [256, 128, 1, False]
18 -1 1 147712 models.common.Conv [128, 128, 3, 2]
19 [-1, 14] 1 0 models.common.Concat [1]
20 -1 1 296448 models.common.C3 [256, 256, 1, False]
21 -1 1 590336 models.common.Conv [256, 256, 3, 2]
22 [-1, 10] 1 0 models.common.Concat [1]
23 -1 1 1182720 models.common.C3 [512, 512, 1, False]
24 [17, 20, 23] 1 29667 models.yolo.Polygon_Detect [2, [[31, 30, 28, 49, 50, 31], [46, 45, 58, 58, 74, 74], [94, 94, 115, 115, 151, 151]], [128, 256, 512]]
/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.)
return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Model Summary: 283 layers, 7077027 parameters, 7077027 gradients, 16.5 GFLOPs

Transferred 360/362 items from polygon-yolov5s-ucas.pt
Scaled weight_decay = 0.00046875
Optimizer groups: 62 .bias, 62 conv.weight, 59 other
albumentations: MedianBlur(always_apply=False, p=0.05, blur_limit=(3, 7)), ToGray(always_apply=False, p=0.1), RandomBrightnessContrast(always_apply=False, p=0.35, brightness_limit=(-0.2, 0.2), contrast_limit=(-0.2, 0.2), brightness_by_max=True), CLAHE(always_apply=False, p=0.2, clip_limit=(1, 4.0), tile_grid_size=(8, 8)), InvertImg(always_apply=False, p=0.3)
train: Scanning '../UCAS50/train' images and labels...40 found, 0 missing, 0 empty, 0 corrupted: 100% 40/40 [00:00<00:00, 158.38it/s]
train: New cache created: ../UCAS50/train.cache
train: Caching images (0.1GB): 100% 40/40 [00:00<00:00, 78.13it/s]
val: Scanning '../UCAS50/val.cache' images and labels... 9 found, 0 missing, 0 empty, 0 corrupted: 100% 9/9 [00:00<?, ?it/s]
val: Caching images (0.0GB): 100% 9/9 [00:00<00:00, 20.73it/s]
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Plotting labels...
Image sizes 1024 train, 1024 test
Using 4 dataloader workers
Logging results to runs/train/exp
Starting training for 3 epochs...

 Epoch   gpu_mem       box       obj       cls     total    labels  img_size

0% 0/4 [00:01<?, ?it/s]
Traceback (most recent call last):
File "polygon_train.py", line 551, in
train(hyp, opt, device, tb_writer, polygon=opt.polygon)
File "polygon_train.py", line 312, in train
loss, loss_items = compute_loss(pred, targets.to(device)) # loss scaled by batch_size
File "/content/PolygonObjectDetection/polygon-yolov5/utils/loss.py", line 274, in call
iou = polygon_bbox_iou(pbox, tbox[i], CIoU=True, device=device, ordered=True) # iou(prediction, target)
File "/content/PolygonObjectDetection/polygon-yolov5/utils/general.py", line 961, in polygon_bbox_iou
alpha = v / (v - iou + (1 + eps))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!`

@sac3tf
Copy link
Author

sac3tf commented Aug 8, 2022

I also wanted to clarify that I ran the setup.py script, which threw no errors but in the above error output it says I am still missing two functions:

"polygon_inter_union_cuda" and "polygon_b_inter_union_cuda" are not installed. The Exception is: /usr/local/lib/python3.7/dist-packages/polygon_inter_union_cuda-0.0.0-py3.7-linux-x86_64.egg/polygon_inter_union_cuda.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefIlEENS2_8optionalINS2_10ScalarTypeEEENS5_INS2_6LayoutEEENS5_INS2_6DeviceEEENS5_IbEE.

I am running this on Google colab.

@XinzeLee
Copy link
Owner

XinzeLee commented Aug 9, 2022

Please read carefully the README.md, which guides you to install the functions polygon_inter_union_cuda and polygon_b_inter_union_cuda (which are cuda codes). You are getting these errors because you havent installed them.

@sac3tf
Copy link
Author

sac3tf commented Aug 9, 2022

I did follow the READ.md. I completed the setup.py step and followed it step by step and still received the error above as noted in my comments.

@scocke
Copy link

scocke commented Aug 9, 2022

On the example Collab that is provided, these are the instructions at the top:

`### For colab, run the following codes

from google.colab import drive

drive.mount('/content/gdrive')

cd to your directory

%cd /content/gdrive/MyDrive/Your_Dir

cd to polygon-yolov5

%cd polygon-yolov5

install requirements

!pip install -r requirements.txt

install cuda extensions for polygon box iou computation

%cd utils/iou_cuda

!python setup.py install

cd back

%cd ..

%cd ..`

following those steps and then continuing on to run the subsequent code blocks end up in an error that was originally posted. Seems setup.py is not installing the required packages.

@XinzeLee
Copy link
Owner

I have used the colab to run the code, and there is no error related to the installation of cuda extension function.

Anyway, all the problems mentioned are caused by that "`polygon_inter_union_cuda' and 'polygon_b_inter_union_cuda' are not installed.". Ensure your installation process of setup.py raises no error. If there are errors reported in the extension installation process, try to check the compatibility of environments and solve the errors. I dont think the errors mentioned persist if the installation process is successful.

@pocca2048
Copy link

I solved this issue. In my case, It happened because module was being installed in different path.
My solution was to install by pip instead of setup.py.

python setup.py sdist
pip install .

@scocke
Copy link

scocke commented Aug 10, 2022

@pocca2048 I tried that solution and unfortunately still getting the same error. Even though I receive a message saying that it was successfully installed, it shows this after setup.py installation (I don't see any issues):

`running sdist
running egg_info
creating polygon_inter_union_cuda.egg-info
writing polygon_inter_union_cuda.egg-info/PKG-INFO
writing dependency_links to polygon_inter_union_cuda.egg-info/dependency_links.txt
writing requirements to polygon_inter_union_cuda.egg-info/requires.txt
writing top-level names to polygon_inter_union_cuda.egg-info/top_level.txt
writing manifest file 'polygon_inter_union_cuda.egg-info/SOURCES.txt'
/usr/local/lib/python3.7/dist-packages/torch/utils/cpp_extension.py:411: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
writing manifest file 'polygon_inter_union_cuda.egg-info/SOURCES.txt'
running check
warning: check: missing required meta-data: url

warning: check: missing meta-data: either (author and author_email) or (maintainer and maintainer_email) must be supplied

creating polygon_inter_union_cuda-0.0.0
creating polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
copying files to polygon_inter_union_cuda-0.0.0...
copying README.txt -> polygon_inter_union_cuda-0.0.0
copying extensions.cpp -> polygon_inter_union_cuda-0.0.0
copying inter_union_cuda.cu -> polygon_inter_union_cuda-0.0.0
copying inter_union_cuda.h -> polygon_inter_union_cuda-0.0.0
copying setup.py -> polygon_inter_union_cuda-0.0.0
copying utils.h -> polygon_inter_union_cuda-0.0.0
copying polygon_inter_union_cuda.egg-info/PKG-INFO -> polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
copying polygon_inter_union_cuda.egg-info/SOURCES.txt -> polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
copying polygon_inter_union_cuda.egg-info/dependency_links.txt -> polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
copying polygon_inter_union_cuda.egg-info/requires.txt -> polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
copying polygon_inter_union_cuda.egg-info/top_level.txt -> polygon_inter_union_cuda-0.0.0/polygon_inter_union_cuda.egg-info
Writing polygon_inter_union_cuda-0.0.0/setup.cfg
creating dist
Creating tar archive
removing 'polygon_inter_union_cuda-0.0.0' (and everything under it)
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Processing /content/PolygonObjectDetection/polygon-yolov5/utils/iou_cuda
DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
pip 21.3 will remove support for this functionality. You can find discussion regarding this at pypa/pip#7555.
Requirement already satisfied: torch>=1.0.0a0 in /usr/local/lib/python3.7/dist-packages (from polygon-inter-union-cuda==0.0.0) (1.12.0+cu113)
Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from polygon-inter-union-cuda==0.0.0) (0.13.0+cu113)
Requirement already satisfied: pillow in /usr/local/lib/python3.7/dist-packages (from polygon-inter-union-cuda==0.0.0) (7.1.2)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from polygon-inter-union-cuda==0.0.0) (2.23.0)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.0.0a0->polygon-inter-union-cuda==0.0.0) (4.1.1)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->polygon-inter-union-cuda==0.0.0) (2022.6.15)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->polygon-inter-union-cuda==0.0.0) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->polygon-inter-union-cuda==0.0.0) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->polygon-inter-union-cuda==0.0.0) (1.24.3)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from torchvision->polygon-inter-union-cuda==0.0.0) (1.21.6)
Building wheels for collected packages: polygon-inter-union-cuda
Building wheel for polygon-inter-union-cuda (setup.py) ... done
Created wheel for polygon-inter-union-cuda: filename=polygon_inter_union_cuda-0.0.0-cp37-cp37m-linux_x86_64.whl size=1168138 sha256=1e0e17a9cac92df5ad5cb042e184f4a09e3d1bc51dd0ea8e496fd5b4c5510bf6
Stored in directory: /tmp/pip-ephem-wheel-cache-1wyp8mxw/wheels/35/a7/05/b7c5647249146303debde2a1fc09d250a2c7afeac6f716c060
Successfully built polygon-inter-union-cuda
Installing collected packages: polygon-inter-union-cuda
Successfully installed polygon-inter-union-cuda-0.0.0`

@pocca2048
Copy link

@scocke How about checking where it is installed and if that path is in sys.path

@nsabir2011
Copy link

nsabir2011 commented Aug 28, 2022

@XinzeLee
I have also faced issues relating the installation of polygon_inter_union_cuda.
On my laptop with a 1660Ti,
I was facing an error when installing the package. It was saying that the compiler does not support sm_80. Then I checked that the installed cuda 10.2 only supports up to sm_75. So I removed the following nvcc args from setup.py:

'-gencode=arch=compute_80,code=sm_80',
'-gencode=arch=compute_86,code=sm_86',
'-gencode=arch=compute_86,code=compute_86'

Then I proceeded to train a small model and everything worked fine.

However, I tried to run it on a RTX3060 equipped machine as well and faced another problem regarding this installation.
Running python setup.py install worked fine without any error. But running polygon_train.py showed the following error:

Warning: "polygon_inter_union_cuda" and "polygon_b_inter_union_cuda" are not installed.
The Exception is: libcudart.so.10.2: cannot open shared object file: No such file or directory.

It was looking for cuda version 10.2 files.
So I then uninstalled polygon_inter_union_cuda and installed again after removing the following nvcc args from setup.py:

'-gencode=arch=compute_37,code=sm_37',
'-gencode=arch=compute_60,code=sm_60', '-gencode=arch=compute_61,code=sm_61',
'-gencode=arch=compute_70,code=sm_70', '-gencode=arch=compute_72,code=sm_72',
'-gencode=arch=compute_75,code=sm_75', '-gencode=arch=compute_80,code=sm_80',
'-gencode=arch=compute_86,code=compute_86'

I only kept '-gencode=arch=compute_86,code=sm_86' this time. I then tried to train a small model and everything worked just fine.
I think the setup.py should be modified to only install for the supported architecture by the gpu.
You can check that using pytorch:

>> import torch
>> torch.cuda.get_device_capability()
(8, 6)
>> sm = torch.cuda.get_device_capability()
>> arg = '-gencode=arch=compute_{0}{1},code=sm_{0}{1}'
>> arg.format(*sm)
'-gencode=arch=compute_86,code=sm_86'

@XinzeLee
Copy link
Owner

@nsabir2011
Thank you so much for your comments! Yes we have to select the appropriate architecture for our GPU model, otherwise there will be errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants