Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error just after training is completed #136

Open
neel04 opened this issue Nov 13, 2021 · 4 comments
Open

Error just after training is completed #136

neel04 opened this issue Nov 13, 2021 · 4 comments

Comments

@neel04
Copy link

neel04 commented Nov 13, 2021

Hi, thanks for such a great repo! 🤗
I wanted to train YOLOr on my own custom data. This is the command I am using, in Colab:-

#Set Epochs
EPOCHS = 30
BATCH_SIZE = 32
!python /content/yolor/train.py --batch-size $BATCH_SIZE --img 512 512 --data /content/yolor/data/coco.yaml --cfg /content/yolor/cfg/yolor_p6.cfg --weights '' --device 0 --name yolor_p6 --epochs $EPOCHS --adam --cache-images

However, just after training finishes I get this error:-

Using torch 1.7.0 CUDA:0 (Tesla P100-PCIE-16GB, 16280MB)

Namespace(adam=True, batch_size=16, bucket='', cache_images=False, cfg='/content/yolor/cfg/yolor_p6.cfg', data='/content/yolor/data/coco.yaml', device='0', epochs=50, evolve=False, exist_ok=False, global_rank=-1, hyp='./yolor/data/hyp.scratch.1280.yaml', image_weights=False, img_size=[512, 512], local_rank=-1, log_imgs=16, multi_scale=False, name='yolor_p6', noautoanchor=False, nosave=False, notest=False, project='runs/train', rect=False, resume=False, save_dir='runs/train/yolor_p6', single_cls=False, sync_bn=False, total_batch_size=16, weights='', workers=8, world_size=1)
Start Tensorboard with "tensorboard --logdir runs/train", view at http://localhost:6006/
Hyperparameters {'lr0': 0.01, 'lrf': 0.2, 'momentum': 0.937, 'weight_decay': 0.0005, 'warmup_epochs': 3.0, 'warmup_momentum': 0.8, 'warmup_bias_lr': 0.1, 'box': 0.05, 'cls': 0.5, 'cls_pw': 1.0, 'obj': 1.0, 'obj_pw': 1.0, 'iou_t': 0.2, 'anchor_t': 4.0, 'fl_gamma': 0.0, 'hsv_h': 0.015, 'hsv_s': 0.7, 'hsv_v': 0.4, 'degrees': 0.0, 'translate': 0.5, 'scale': 0.5, 'shear': 0.0, 'perspective': 0.0, 'flipud': 0.0, 'fliplr': 0.5, 'mosaic': 1.0, 'mixup': 0.0}
Model Summary: 665 layers, 37265016 parameters, 37265016 gradients, 81.564040600 GFLOPS
Optimizer groups: 145 .bias, 145 conv.weight, 149 other


Scanning images: 100% 5400/5400 [00:00<00:00, 5509.28it/s]
Scanning labels /content/train_yolo/labels/val.cache3 (5400 found, 0 missing, 0 empty, 2 duplicate, for 5400 images): 5400it [00:00, 7996.85it/s]
Scanning images: 100% 301/301 [00:00<00:00, 3736.10it/s]
Scanning labels /content/val_yolo/labels/val.cache3 (301 found, 0 missing, 0 empty, 0 duplicate, for 301 images): 301it [00:00, 3673.14it/s]
NumExpr defaulting to 4 threads.
Images sizes do not match. This will causes images to be display incorrectly in the UI.
Image sizes 512 train, 512 test
Using 4 dataloader workers
Logging results to runs/train/yolor_p6
Starting training for 50 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      0/49     9.66G   0.05845   0.05395  0.001509    0.1139        16       512: 100% 338/338 [02:54<00:00,  1.94it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      1/49     9.79G   0.04292   0.04701  0.001316   0.09124        25       512: 100% 338/338 [02:36<00:00,  2.16it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      2/49     9.79G   0.03832   0.04282   0.00132   0.08246        16       512: 100% 338/338 [02:30<00:00,  2.24it/s]

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      3/49     9.79G    0.0347   0.03967  0.001317   0.07569        21       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:05<00:00,  1.83it/s]
                 all         301         392       0.201       0.794       0.317        0.19

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      4/49     6.29G   0.03113   0.03582  0.001313   0.06826        30       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:03<00:00,  2.88it/s]
                 all         301         392       0.212       0.887       0.381       0.233

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
      5/49     6.29G   0.02992   0.03439  0.001319   0.06563        19       512: 100% 338/338 [02:28<00:00,  2.28it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95: 100% 10/10 [00:03<00:00,  2.94it/s]
                 all         301         392       0.219       0.897       0.402       0.278

...........

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
     49/49     6.29G   0.01612   0.02269  0.001305   0.04012        18       512: 100% 338/338 [02:28<00:00,  2.27it/s]
               Class      Images     Targets           P           R      mAP@.5  mAP@.5:.95:   0% 0/10 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/content/yolor/train.py", line 537, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "/content/yolor/train.py", line 344, in train
    log_imgs=opt.log_imgs if wandb else 0)
  File "/content/yolor/test.py", line 167, in test
    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
  File "/content/yolor/test.py", line 167, in <listcomp>
    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
TypeError: list indices must be integers or slices, not float

This is a sample of how the bounding boxes in the dataset look like:

1 0.547852 0.704590 0.484375 0.590820

where 1 is the class index presumably.

Does anyone know what may be the cause of this error?

@neel04
Copy link
Author

neel04 commented Nov 13, 2021

Also getting this, strangely enough after some modifications 🤔

FileNotFoundError: [Errno 2] No such file or directory: '/content/runs/train/yolor_p6/precision-recall_curve.png'

@dripdropdr
Copy link

Did you solve this? I have the same problem, too. :(

@zggg1p
Copy link

zggg1p commented Apr 27, 2022

I have the same problem, too. :(

@Timmimim
Copy link

Timmimim commented Aug 5, 2022

In case anybody still encounters the same issue:

I fixed the initial issue of float indices by simply casting cls to integer where it is used as list index:

[...],
box_caption": "%s %.3f" % (names[int(cls)], conf), `
[...] 

Before moving on, while we are in the W&B Logging part (lines 161-169):
The names variable might need to be a dict type for some versions of wandb (apparently; at least for me there was an error). To fix the issue before it arises, edit the code to something like this:

# W&B logging
if plots and len(wandb_images) < log_imgs:
    box_data = [{"position": {"minX": xyxy[0], "minY": xyxy[1], "maxX": xyxy[2], "maxY": xyxy[3]},
                    "class_id": int(cls),
                    "box_caption": "%s %.3f" % (names[int(cls)], conf),
                    "scores": {"class_score": conf},
                    "domain": "pixel"} for *xyxy, conf, cls in pred.tolist()]
    # if necessary, create a dict using list indices as keys, so it can be queried almost exactly like a list 
    if type(names) == type([]):
        names_dict = {idx:val for idx, val in enumerate(names)}
        boxes = {"predictions": {"box_data": box_data, "class_labels": names_dict}}
    else:
        boxes = {"predictions": {"box_data": box_data, "class_labels": names}}
wandb_images.append(wandb.Image(img[si], boxes=boxes, caption=path.name))

The second issue (FileNotFoundError: [Errno 2] No such file or directory: 'runs/train/<run_dir>/precision-recall_curve.png') is the result of no validations being performed for less than 3 training epochs. This is defined in train.py line 336 (if epoch >= 3:).
If the test()method from test.py has not been called before, the relevant images are not prepared and thus non-existent.
At least that was the case for me. I assume you reduced your # epochs for test runs?

I encountered several more issues down the road. I had compatibility issues with PyTorch v1.12, which were easily resolved thanks to the code provided in #270.

I had to adjust the number of classes for my custom data, and subsequently the number of filters in several layers of the architecture as described in the respective <architecture>.cfg files. Examples can be found in #16 and #251.

Finally, there was another issue in utils/plot.py that kept me busy, but might also be a compatibility issue with PyTorch v1.12. I kept getting (illogical) errors for a list-type object that was somehow a CUDA Tensor, but should not be. Somewhere under the hood, some data is not properly converted. So in the method output_to_target() (lines 89-108), the target variable is not a simple list, but a CUDA Tensor (or includes CUDA Tensors). These MUST be moved to CPU memory. So I ended up editing the _tensor.py file in my PyTorch installation.
The Tensor class has a method __array__() used for implicit type casting (lines 753-761 in my installation). I added the following code ahead of the if-clauses handling the two possible return statements:

if self.is_cuda:
    self = self.cpu()

I hope that covers all issues you might have. I thought it would be good to write a small summary of my problems today, so others won't have to waste half a day.
Have a good one! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants