Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joint dataset training question #6904

Closed
1 task done
HeChengHui opened this issue Mar 8, 2022 · 43 comments
Closed
1 task done

Joint dataset training question #6904

HeChengHui opened this issue Mar 8, 2022 · 43 comments
Labels
question Further information is requested Stale

Comments

@HeChengHui
Copy link

Search before asking

Question

I would like to try out joint dataset training as seen here with COCO + VisDrone2019-det. However, I am not sure if I should start with pre-trained weights (v5m6, s6,n6) or start from scratch (if that is possible).

Additional

No response

@HeChengHui HeChengHui added the question Further information is requested label Mar 8, 2022
@HeChengHui
Copy link
Author

HeChengHui commented Mar 8, 2022

@glenn-jocher Thank you for referring to that link.

I would like to ask if it would be fine to use the pre-trained m6,s6, and n6 as the starting model to train for COCO+VisDrone? From what I understand, the pre-trained models are trained from COCO. Would doing so cause any unwanted behaviors down the road?

Further, would using a pre-trained model to train on another dataset cause the final model to be bloated with extra parameters from its pre-trained dataset? If that is the case, is it possible to train from scratch?

@glenn-jocher
Copy link
Member

@HeChengHui yes you can use YOLOv5 pretrained models to start training any dataset or combination of datasets.

@HeChengHui
Copy link
Author

HeChengHui commented Mar 15, 2022

@glenn-jocher
Thanks for the forum! I managed to merge all the image of COCO and VisDrone, with a labels folder with only 1 class.
I tried training with the following code:
python train.py --device 0 --weights '' --cfg yolov5s_cocoVisdrone.yaml --data coco_visdrone.yaml --batch-size -1 --epochs 300 --evolve --hyp hyp.scratch-low_cocoVisdrone.yaml --data VisDrone.yaml --imgsz 1920 --cache --name yolov5s_cocoVisdrone
but ran into CUDA memory problem even with --batch-size -1

train: weights='', cfg=yolov5s_cocoVisdrone.yaml, data=VisDrone.yaml, hyp=hyp.scratch-low_cocoVisdrone.yaml, epochs=300, batch_size=-1, imgsz=1920, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, evolve=300, bucket=, cache=ram, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=yolov5s_cocoVisdrone, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest
github: up to date with https://github.com/ultralytics/yolov5
YOLOv5  v6.1-36-gc09fb2a torch 1.10.2 CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU, 16384MiB)
Overriding model.yaml nc=1 with nc=10
Overriding model.yaml anchors with anchors=3

                 from  n    params  module                                  arguments
  0                -1  1      3520  models.common.Conv                      [3, 32, 6, 2, 2]
  1                -1  1     18560  models.common.Conv                      [32, 64, 3, 2]
  2                -1  1     18816  models.common.C3                        [64, 64, 1]
  3                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  4                -1  2    115712  models.common.C3                        [128, 128, 2]
  5                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  6                -1  3    625152  models.common.C3                        [256, 256, 3]
  7                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  8                -1  1   1182720  models.common.C3                        [512, 512, 1]
  9                -1  1    656896  models.common.SPPF                      [512, 512, 5]
 10                -1  1    131584  models.common.Conv                      [512, 256, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  1    361984  models.common.C3                        [512, 256, 1, False]
 14                -1  1     33024  models.common.Conv                      [256, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 4]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    147712  models.common.Conv                      [128, 128, 3, 2]
 19          [-1, 14]  1         0  models.common.Concat                    [1]
 20                -1  1    296448  models.common.C3                        [256, 256, 1, False]
 21                -1  1    590336  models.common.Conv                      [256, 256, 3, 2]
 22          [-1, 10]  1         0  models.common.Concat                    [1]
 23                -1  1   1182720  models.common.C3                        [512, 512, 1, False]
 24      [17, 20, 23]  1     40455  models.yolo.Detect                      [10, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Model Summary: 270 layers, 7046599 parameters, 7046599 gradients, 15.9 GFLOPs

AutoBatch: Computing optimal batch size for --imgsz 1920
AutoBatch: CUDA:0 (NVIDIA GeForce RTX 3080 Laptop GPU) 16.00G total, 0.06G reserved, 0.05G allocated, 15.88G free
      Params      GFLOPs  GPU_mem (GB)  forward (ms) backward (ms)                   input                  output
     7046599       143.3         1.755         46.88         31.25      (1, 3, 1920, 1920)                    list
     7046599       286.7         3.511         52.06          57.3      (2, 3, 1920, 1920)                    list
     7046599       573.4         7.348         93.72         109.4      (4, 3, 1920, 1920)                    list
     7046599        1147        14.577         195.8         208.4      (8, 3, 1920, 1920)                    list
CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 16.00 GiB total capacity; 13.79 GiB already allocated; 0 bytes free; 13.93 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
OMP: Error #15: Initializing libiomp5md.dll, but found libiomp5md.dll already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.

Could this be some error in my settings or just a case of autobatch not working here?
Furthermore, why is it saying Overriding model.yaml nc=1 with nc=10? I had nc in --cfg and --data set to 1.

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 15, 2022

@HeChengHui your data.yaml nc=10 so it's using that rather than the conflicting nc from your model.yaml.

Based on your partially completed AutoBatch results it seems like your card can support maybe --batch 4 or --batch 8. Experiment to see what works.

@HeChengHui
Copy link
Author

HeChengHui commented Mar 15, 2022

@glenn-jocher
ah my bad I accidentally added 2 --data

After setting up the environment again, autobatch seems to work when I tried --batch-size 16 and it implemented a size of 7.

@HeChengHui
Copy link
Author

@glenn-jocher
I tried training using:
python train.py --device 0 --weights yolov5s.pt --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone --evolve

Even on 34 epoch, my metrics (mAP, recall, precision logged into wandb) are all 0. Could this be due to the hyperparameter evolution?

@glenn-jocher
Copy link
Member

@HeChengHui I don't know what you mean by zero mAP on epoch 34 of 300, as --evolve does not compute mAP until the final epoch. Also note that --batch 1 is extremely small and not recommended.

@HeChengHui
Copy link
Author

@glenn-jocher

I don't know what you mean by zero mAP on epoch 34 of 300

I refer to the metrics shown in wandb. Is it only evaluated after 300 epochs instead of every epoch?

Also note that --batch 1 is extremely small and not recommended.

Does --batch-size -1 not help to find the best batch size?

@glenn-jocher
Copy link
Member

Does --batch-size -1 not help to find the best batch size?

Oh yes! Didn't notice the -1. -1 will implement AutoBatch to automatically find the best batch size. But yes during evolution mAP is only evaluated on the final epoch, so there's no way to know it's value until a generation is finished.

@HeChengHui
Copy link
Author

@glenn-jocher
I see! No wonder it is not showing anything. If I want further training after 300 epochs, is it correct to use back python train.py --device 0 --weights yolov5s.pt --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone --evolve, but change the weights to the best resulting model?

@glenn-jocher
Copy link
Member

@HeChengHui it sounds like you should just train normally rather than using --evolve. --evolve is intended to take several weeks with significant resources, and it does not return a model, it only returns evolve hyperparameters on your base scenario that you can then use to train a model.

If you just want to train a model don't use --evolve.

@HeChengHui
Copy link
Author

@glenn-jocher
I see. But since I have already started the process, would it be much more beneficial to use the resulting 300 epochs hyperparameters to train?

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 21, 2022

@HeChengHui you're not understanding evolution. One training is one generation. Evolution relies on many (hundreds) of generations to evolve optimal hyperparameters. See hyperparameter evolution tutorial for details:

YOLOv5 Tutorials

Good luck 🍀 and let us know if you have any other questions!

@HeChengHui
Copy link
Author

HeChengHui commented Mar 24, 2022

@glenn-jocher

While looking through the different model configurations under models/hub I am interested in using p2.yaml since it also includes detecting xsmall objects which could be useful for my use case of aerial tracking.
To further decrease the network size to increase speed, I deleted the large detection head as follows:

# YOLOv5 v6.0 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

   [-1, 1, Conv, [128, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 2], 1, Concat, [1]],  # cat backbone P2
   [-1, 1, C3, [128, False]],  # 21 (P2/4-xsmall)

   [-1, 1, Conv, [128, 3, 2]],
   [[-1, 18], 1, Concat, [1]],  # cat head P3
   [-1, 3, C3, [256, False]],  # 24 (P3/8-small)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 27 (P4/16-medium)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 30 (P5/32-large)

   [[21, 24, 27, 30], 1, Detect, [nc, anchors]],  # Detect(P2, P3, P4, P5)
  ]

I would like to clarify the purpose of # 17 (P3/8-small) block and if deleting will affect the performance of the model?

@glenn-jocher
Copy link
Member

@HeChengHui sure you can delete larger output blocks if you don't need them. Results will vary based on your dataset and training settings like --img-size naturally.

Another option for small object detection would just be to train and detect at larger --img-size with the normal P5 models.

@HeChengHui
Copy link
Author

@glenn-jocher

I see.
How about deleting

 [-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)

Would that affect the performance? The purpose is to reduce model size and increase speed.

@glenn-jocher
Copy link
Member

@HeChengHui you can delete anything you want, but if you delete intermediate layers you need to correctly reconnect the remaining layers, i.e. the [[-1, 18], 1, Concat, [1]], # cat head P3 layer depends on layer 18 which will no longer be there if you delete layer 17 etc.

@HeChengHui
Copy link
Author

@glenn-jocher
Thank you for the suggestion. I tried combining p2 and v5s configuration as follow:

# YOLOv5 v6.0 head with (P2, P3, P4) outputs
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13

   [-1, 1, Conv, [128, 1, 1]],  #14
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],  #15
   [[-1, 2], 1, Concat, [1]],  # cat backbone P2
   [-1, 1, C3, [128, False]],  # 17 (P2/4-xsmall)

   [-1, 1, Conv, [256, 3, 2]],
   [[-1, 4], 1, Concat, [1]],  # cat head P3
   [-1, 3, C3, [512, False]],  # 20 (P3/8-small)

   [-1, 1, Conv, [512, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 23 (P4/16-medium)

   [[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P2, P3, P4)
  ]

Would this be a valid configuration?

@glenn-jocher
Copy link
Member

@HeChengHui you can run any model yaml through yolo.py to verify it works and profile it etc.

python models/yolo.py --cfg yolov5s.yaml

@HeChengHui
Copy link
Author

@glenn-jocher

ohh thank you.

                 from  n    params  module                                  arguments
  0                -1  1      7040  models.common.Conv                      [3, 64, 6, 2, 2]
  1                -1  1     73984  models.common.Conv                      [64, 128, 3, 2]
  2                -1  3    156928  models.common.C3                        [128, 128, 3]
  3                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
  4                -1  6   1118208  models.common.C3                        [256, 256, 6]
  5                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
  6                -1  9   6433792  models.common.C3                        [512, 512, 9]
  7                -1  1   4720640  models.common.Conv                      [512, 1024, 3, 2]
  8                -1  3   9971712  models.common.C3                        [1024, 1024, 3]
  9                -1  1   2624512  models.common.SPPF                      [1024, 1024, 5]
 10                -1  1    525312  models.common.Conv                      [1024, 512, 1, 1]
 11                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 12           [-1, 6]  1         0  models.common.Concat                    [1]
 13                -1  3   2757632  models.common.C3                        [1024, 512, 3, False]
 14                -1  1     65792  models.common.Conv                      [512, 128, 1, 1]
 15                -1  1         0  torch.nn.modules.upsampling.Upsample    [None, 2, 'nearest']
 16           [-1, 2]  1         0  models.common.Concat                    [1]
 17                -1  1     90880  models.common.C3                        [256, 128, 1, False]
 18                -1  1    295424  models.common.Conv                      [128, 256, 3, 2]
 19           [-1, 4]  1         0  models.common.Concat                    [1]
 20                -1  3    690688  models.common.C3                        [512, 256, 3, False]
 21                -1  1   1180672  models.common.Conv                      [256, 512, 3, 2]
 22          [-1, 14]  1         0  models.common.Concat                    [1]
 23                -1  3   2561024  models.common.C3                        [640, 512, 3, False]
 24      [17, 20, 23]  1    229245  Detect                                  [80, [[0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5], [0, 1, 2, 3, 4, 5]], [128, 256, 512]]
Traceback (most recent call last):
  File "models/yolo.py", line 309, in <module>
    model = Model(opt.cfg).to(device)
  File "models/yolo.py", line 112, in __init__
    m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))])  # forward
  File "models/yolo.py", line 126, in forward
    return self._forward_once(x, profile, visualize)  # single-scale inference, train
  File "models/yolo.py", line 149, in _forward_once
    x = m(x)  # run
  File "C:\Users\chenghui\anaconda3\envs\yolov5\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\YoloV5\yolov5\models\common.py", line 275, in forward
    return torch.cat(x, self.d)
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 32 but got size 64 for tensor number 1 in the list.

Seems like something went wrong with the concat layer? Any advice on how to debug this?

@glenn-jocher
Copy link
Member

@HeChengHui we don't provide support for model customizations, sorry. Perhaps a community member can assist.

@HeChengHui
Copy link
Author

HeChengHui commented Mar 24, 2022

@glenn-jocher
Alright, thanks for the help!

@HeChengHui
Copy link
Author

@glenn-jocher

I am training a model using python train.py --device 0 --weights '' --cfg yolov5s-p234.yaml --data mot_visdrone.yaml --batch-size -1 --epochs 300 --hyp hyp.scratch-low_MOTvisdrone.yaml --imgsz 960 --cache disk --name yolov5s_motVisdrone_p234.
The batch size is 24 after autobatch.

After training for 60 epochs, it suddenly failed with CUDA OOM. I looked around and I might need to lower the batch size. However, is there a way to lower the batch size while resuming training? Or must I restart from scratch?

@glenn-jocher
Copy link
Member

@HeChengHui hi sorry to hear that! That's very strange. Is there any other GPU memory usage on the instance?

AutoBatch seeks to set a batch size for 90% CUDA memory utilization, but perhaps we should reduce the default value to 85%.

You can not modify any parameters on resume, but you can go into train.py and customize the code to force it to a different batch size, i.e.:

yolov5/train.py

Lines 70 to 73 in 7a2a118

save_dir, epochs, batch_size, weights, single_cls, evolve, data, cfg, resume, noval, nosave, workers, freeze = \
Path(opt.save_dir), opt.epochs, opt.batch_size, opt.weights, opt.single_cls, opt.evolve, opt.data, opt.cfg, \
opt.resume, opt.noval, opt.nosave, opt.workers, opt.freeze

@HeChengHui
Copy link
Author

HeChengHui commented Mar 25, 2022

@glenn-jocher
Yes it is quite weird indeed given how it survived 60epochs. I have tried --resume after restarting my laptop but the same error still occurs.
I have also checked with wandb and my GPU Memory Allocated (%) stayed at 94.52% throughout before it crashed.

You can not modify any parameters on resume, but you can go into train.py and customize the code to force it to a different batch size

Alright thank you!

@HeChengHui
Copy link
Author

@glenn-jocher
Hello I have 2 questions

  1. An error occurred when fusing layers for a custom fpn config
300 epochs completed in 42.485 hours.
Optimizer stripped from runs\train\yolov5s_motVisdrone_fpn234\weights\last.pt, 5.2MB
Optimizer stripped from runs\train\yolov5s_motVisdrone_fpn234\weights\best.pt, 5.2MB

Validating runs\train\yolov5s_motVisdrone_fpn234\weights\best.pt...
Fusing layers...
Traceback (most recent call last):
  File "train.py", line 643, in <module>
    main(opt)
  File "train.py", line 539, in main
    train(opt.hyp, opt, device, callbacks)
  File "train.py", line 434, in train
    model=attempt_load(f, device).half(),
  File "D:\YoloV5\yolov5\models\experimental.py", line 98, in attempt_load
    model.append(ckpt.fuse().eval() if fuse else ckpt.eval())  # fused or un-fused model in eval mode
  File "D:\YoloV5\yolov5\models\yolo.py", line 225, in fuse
    m.conv = fuse_conv_and_bn(m.conv, m.bn)  # update conv
  File "D:\YoloV5\yolov5\utils\torch_utils.py", line 202, in fuse_conv_and_bn
    fusedconv.weight.copy_(torch.mm(w_bn, w_conv).view(fusedconv.weight.shape))
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

wandb: Waiting for W&B process to finish, PID 15008... (failed 1). Press ctrl-c to abort syncing.
wandb:
wandb: Run history:
wandb:        metrics/mAP_0.5 ▁▃▄▅▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█████████████████
wandb:   metrics/mAP_0.5:0.95 ▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇███████████████
wandb:      metrics/precision ▁▃▄▅▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇█▇▇▇████▇▇█▇▇██▇███
wandb:         metrics/recall ▁▄▅▅▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇▇▇███████████
wandb:         train/box_loss █▆▄▄▄▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:         train/obj_loss █▆▅▄▄▄▄▃▃▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁
wandb:           val/box_loss █▆▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/cls_loss ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:           val/obj_loss █▅▃▃▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
wandb:                  x/lr0 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                  x/lr1 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:                  x/lr2 ████▇▇▇▇▇▆▆▆▆▆▆▅▅▅▅▅▅▄▄▄▄▄▃▃▃▃▃▃▂▂▂▂▂▁▁▁
wandb:
wandb: Run summary:
wandb:             best/epoch 299
wandb:           best/mAP_0.5 0.53313
wandb:      best/mAP_0.5:0.95 0.2063
wandb:         best/precision 0.65371
wandb:            best/recall 0.47009
wandb:        metrics/mAP_0.5 0.53313
wandb:   metrics/mAP_0.5:0.95 0.2063
wandb:      metrics/precision 0.65371
wandb:         metrics/recall 0.47009
wandb:         train/box_loss 0.02225
wandb:         train/cls_loss 0.0
wandb:         train/obj_loss 0.03103
wandb:           val/box_loss 0.04784
wandb:           val/cls_loss 0.0
wandb:           val/obj_loss 0.01575
wandb:                  x/lr0 0.00017
wandb:                  x/lr1 0.00017
wandb:                  x/lr2 0.00017
wandb:
wandb: Synced 6 W&B file(s), 325 media file(s), 0 artifact file(s) and 0 other file(s)
wandb: Synced yolov5s_motVisdrone_fpn234: https://wandb.ai/d3cpt/train/runs/1mxmji3y
wandb: Find logs at: .\wandb\run-20220328_031510-1mxmji3y\logs\debug.log

It seems to exit okay but I am not sure if the error is going to cause any problem

  1. I was running test on weights trained to epoch 118, 150 and 155 (out of 300) without realising that the optimiser is not stripped. I would like to check the effects of not stripping the optimiser besides a bigger model size (slower inference speed?)

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 29, 2022

@HeChengHui it appears you may have environment problems. Please ensure you meet all dependency requirements if you are attempting to run YOLOv5 locally. If in doubt, create a new virtual Python 3.9 environment, clone the latest repo (code changes daily), and pip install requirements.txt again from scratch.

💡 ProTip! Try one of our verified environments below if you are having trouble with your local environment.

Requirements

Python>=3.7.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

Models and datasets download automatically from the latest YOLOv5 release when first requested.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@HeChengHui
Copy link
Author

@glenn-jocher
I am sure that the environment is correct because I managed to successfully train 2 prior models for 300epochs. Any advice on how to remedy this error or do I have to restart training?

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 29, 2022

I am sure that the environment is correct

If you trained successfully on another environment, then the independent variable that has changed is your environment, not YOLOv5. Logically you should start examining your environment for issues, or use a working one:

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@HeChengHui
Copy link
Author

HeChengHui commented Mar 29, 2022

If you trained successfully on another environment, then the independent variable that has changed is your environment, not YOLOv5. Logically you should start examining your environment for issues, or use a working one:

Sorry, I meant that I have also managed to train 2 models with no errors in the same environment.
Does that error make the final model invalid?

@glenn-jocher
Copy link
Member

glenn-jocher commented Mar 29, 2022

@HeChengHui your environment is up to you. If you have a reproducible error specific to YOLOv5, then please submit a bug report with code to reproduce.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

@HeChengHui
Copy link
Author

@glenn-jocher
I would like to check if changing the activation function requires --weighs '' --cfg yolov5s.yaml or I can just use --weights yolov5s.pt

@glenn-jocher
Copy link
Member

@HeChengHui default activation function for YOLOv5 is SiLU:

self.act = nn.SiLU() if act is True else (act if isinstance(act, nn.Module) else nn.Identity())

@HeChengHui
Copy link
Author

HeChengHui commented Apr 7, 2022

@glenn-jocher

sorry, I was asking more specifically after changing the activation function, do I need to train the model from scratch using --weighs '' --cfg yolov5s.yaml or I can just use a pretrained one like --weights yolov5s.pt

@glenn-jocher
Copy link
Member

@HeChengHui I don't understand your question. Nothing is changeable about a trained model. Any changes you make to modules it uses will result in errors or worse results.

@HeChengHui
Copy link
Author

@glenn-jocher
I am not trying to change the AF after training. I meant to ask before training starts, do I need to train the model from scratch using the cfg file or can I use --weights yolov5s.pt to use the new AF.

@glenn-jocher
Copy link
Member

glenn-jocher commented Apr 7, 2022

@HeChengHui oh, you can do both depending whether you want to start from pretrained weights or not. See Train Custom Data tutorial for details:

Screen Shot 2022-04-07 at 1 59 33 PM

YOLOv5 Tutorials

Good luck 🍀 and let us know if you have any other questions!

@HeChengHui
Copy link
Author

@glenn-jocher
ok! thanks for the clarification.

@HeChengHui
Copy link
Author

@glenn-jocher

  1. How do I check the activation of a pretrained model?

  2. In loss.py, I see that the loss function is weighted as follows:
    self.balance = {3: [4.0, 1.0, 0.4]}.get(det.nl, [4.0, 1.0, 0.25, 0.06, 0.02]) # P3-P7
    Does this mean that P2 is not considered when calculating loss? Because I would like to use P2 as a detection head for xsmall objects.

@HeChengHui
Copy link
Author

@glenn-jocher

Hi, I would like to clarify the purpose of the test split during training. My understanding is that validation is done on the validation split. Does the test split contribute in any way?

@glenn-jocher
Copy link
Member

test split is not used during training

@github-actions
Copy link
Contributor

github-actions bot commented May 17, 2022

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants