Training is getting slower and slower #11490

ylc2580 · 2023-05-05T07:05:16Z

Search before asking

I have searched the YOLOv5 issues and discussions and found no similar questions.

Question

hi，
when i am training,it is always stucked here,could help me.

My training server hardware situation is：
3090*2，xeon silver 4210

AutoAnchor: 4.64 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0/299      23.5G     0.1087    0.02937    0.07594        135        640:   7%|▋         | 48/730 [02:25<20:25,  1.80s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1086    0.02932    0.07605        159        640:   7%|▋         | 52/730 [02:36<22:38,  2.00s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1061    0.02694    0.07568        141        640:  25%|██▌       | 185/730 [07:19<16:35,  1.83s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1056    0.02668    0.07558        149        640:  28%|██▊       | 204/730 [07:54<12:16,  1.40s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G    0.09795    0.02256    0.07217         36        640: 100%|██████████| 730/730 [22:33<00:00,  1.85s/it]
             Class     Images  Instances          P          R      mAP50   mAP50-95:   0%|          | 0/365 [00:00<?, ?it/s]libpng warning: Incorrect sBIT chunk length
             Class     Images  Instances          P          R      mAP50   mAP50-95:  43%|████▎     | 157/365 [04:00<04:23,  1.27s/it]libpng warning: Incorrect sBIT chunk length
             Class     Images  Instances          P          R      mAP50   mAP50-95:  53%|█████▎    | 195/365 [05:01<03:42,  1.31s/it]WARNING ⚠️ NMS time limit 9.500s exceeded
             Class     Images  Instances          P          R      mAP50   mAP50-95:  99%|█████████▊| 360/365 [10:39<00:39,  7.91s/it]WARNING ⚠️ NMS time limit 9.500s exceeded
             Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 365/365 [10:58<00:00,  1.80s/it]

Additional

No response

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2023-05-05T08:01:31Z

@ylc2580 hi there,

Based on the output you shared, it looks like your training is progressing normally. The warnings about sBIT chunk length are not indicative of an issue with the training process. Regarding the NMS time limit warning, this may not necessarily be an issue unless you are seeing drastically poor performance or other issues during the training process.

Since you have a relatively powerful server setup, I recommend ensuring that your dataset is properly formatted and optimized for training with YOLOv5. Additionally, you may want to experiment with adjusting your hyperparameters to see if you can improve performance.

Let us know if you have any further questions or concerns.

Best,
Glenn

ylc2580 · 2023-05-05T09:17:24Z

``

when training,Pwr is about 300w,and GPU-Util is about 90%,but when calculate map,the Pwr is about 25w,and GPU-Util is 0%.

when i training in yolox,it did not happen.

hyperparameters i used is default，and The data of the val set is about 30,000, it is the reason?

glenn-jocher · 2023-05-05T11:20:04Z

@ylc2580 hello,

Based on your description, it seems like the issue might be related to the hyperparameters and the size of your validation set. The default hyperparameters might not be the most optimal for your specific use-case, and the size of your validation set (30,000) might be too large. These factors might be leading to a drop in GPU utilization during the map calculation stage.

To solve the issue, I suggest experimenting with different hyperparameters to find the most optimal ones for your use-case. Additionally, you can try reducing the size of your validation set to see if that improves GPU utilization during the map calculation stage.

If these suggestions do not resolve the issue, please provide more details such as the version of YOLOv5 and the specific command used to run the training/validation.

I hope this helps. Let us know if you need further assistance.

Best,
Glenn Jocher

ylc2580 · 2023-05-05T12:01:47Z

the  version of  v5 are 7.0 and 6.1. also, i will try in small val. thank you.

…

---Original--- From: "Glenn ***@***.***> Date: Fri, May 5, 2023 19:20 PM To: ***@***.***>; Cc: ***@***.******@***.***>; Subject: Re: [ultralytics/yolov5] Training is getting slower and slower (Issue#11490) @ylc2580 hello, Based on your description, it seems like the issue might be related to the hyperparameters and the size of your validation set. The default hyperparameters might not be the most optimal for your specific use-case, and the size of your validation set (30,000) might be too large. These factors might be leading to a drop in GPU utilization during the map calculation stage. To solve the issue, I suggest experimenting with different hyperparameters to find the most optimal ones for your use-case. Additionally, you can try reducing the size of your validation set to see if that improves GPU utilization during the map calculation stage. If these suggestions do not resolve the issue, please provide more details such as the version of YOLOv5 and the specific command used to run the training/validation. I hope this helps. Let us know if you need further assistance. Best, Glenn Jocher — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: ***@***.***>

glenn-jocher · 2023-05-05T15:38:49Z

Hello @ylc2580,

Thank you for your response. I recommend trying different hyperparameters to optimize the training for your specific use-case. Additionally, reducing the size of your validation set might help improve GPU utilization during the map calculation stage.

Please let us know if you encounter any other issues or have further questions.

Best regards,
Glenn Jocher

github-actions · 2023-06-05T00:23:21Z

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Docs: https://docs.ultralytics.com
HUB: https://hub.ultralytics.com
Community: https://community.ultralytics.com

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

ylc2580 added the question Further information is requested label May 5, 2023

github-actions bot added the Stale label Jun 5, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training is getting slower and slower #11490

Training is getting slower and slower #11490

ylc2580 commented May 5, 2023

glenn-jocher commented May 5, 2023

ylc2580 commented May 5, 2023 •

edited

Loading

glenn-jocher commented May 5, 2023

ylc2580 commented May 5, 2023 via email

glenn-jocher commented May 5, 2023

github-actions bot commented Jun 5, 2023

Training is getting slower and slower #11490

Training is getting slower and slower #11490

Comments

ylc2580 commented May 5, 2023

Search before asking

Question

Additional

glenn-jocher commented May 5, 2023

ylc2580 commented May 5, 2023 • edited Loading

glenn-jocher commented May 5, 2023

ylc2580 commented May 5, 2023 via email

glenn-jocher commented May 5, 2023

github-actions bot commented Jun 5, 2023

ylc2580 commented May 5, 2023 •

edited

Loading