Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training is getting slower and slower #11490

Closed
1 task done
ylc2580 opened this issue May 5, 2023 · 6 comments
Closed
1 task done

Training is getting slower and slower #11490

ylc2580 opened this issue May 5, 2023 · 6 comments
Labels
question Further information is requested Stale

Comments

@ylc2580
Copy link

ylc2580 commented May 5, 2023

Search before asking

Question

hi,
when i am training,it is always stucked here,could help me.

My training server hardware situation is:
3090*2,xeon silver 4210

AutoAnchor: 4.64 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset ✅
Plotting labels to runs/train/exp/labels.jpg...
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to runs/train/exp
Starting training for 300 epochs...

  Epoch    GPU_mem   box_loss   obj_loss   cls_loss  Instances       Size
  0/299      23.5G     0.1087    0.02937    0.07594        135        640:   7%|▋         | 48/730 [02:25<20:25,  1.80s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1086    0.02932    0.07605        159        640:   7%|▋         | 52/730 [02:36<22:38,  2.00s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1061    0.02694    0.07568        141        640:  25%|██▌       | 185/730 [07:19<16:35,  1.83s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G     0.1056    0.02668    0.07558        149        640:  28%|██▊       | 204/730 [07:54<12:16,  1.40s/it]libpng warning: Incorrect sBIT chunk length
  0/299      23.5G    0.09795    0.02256    0.07217         36        640: 100%|██████████| 730/730 [22:33<00:00,  1.85s/it]
             Class     Images  Instances          P          R      mAP50   mAP50-95:   0%|          | 0/365 [00:00<?, ?it/s]libpng warning: Incorrect sBIT chunk length
             Class     Images  Instances          P          R      mAP50   mAP50-95:  43%|████▎     | 157/365 [04:00<04:23,  1.27s/it]libpng warning: Incorrect sBIT chunk length
             Class     Images  Instances          P          R      mAP50   mAP50-95:  53%|█████▎    | 195/365 [05:01<03:42,  1.31s/it]WARNING ⚠️ NMS time limit 9.500s exceeded
             Class     Images  Instances          P          R      mAP50   mAP50-95:  99%|█████████▊| 360/365 [10:39<00:39,  7.91s/it]WARNING ⚠️ NMS time limit 9.500s exceeded
             Class     Images  Instances          P          R      mAP50   mAP50-95: 100%|██████████| 365/365 [10:58<00:00,  1.80s/it]

Additional

No response

@ylc2580 ylc2580 added the question Further information is requested label May 5, 2023
@glenn-jocher
Copy link
Member

@ylc2580 hi there,

Based on the output you shared, it looks like your training is progressing normally. The warnings about sBIT chunk length are not indicative of an issue with the training process. Regarding the NMS time limit warning, this may not necessarily be an issue unless you are seeing drastically poor performance or other issues during the training process.

Since you have a relatively powerful server setup, I recommend ensuring that your dataset is properly formatted and optimized for training with YOLOv5. Additionally, you may want to experiment with adjusting your hyperparameters to see if you can improve performance.

Let us know if you have any further questions or concerns.

Best,
Glenn

@ylc2580
Copy link
Author

ylc2580 commented May 5, 2023

``
4567

when training,Pwr is about 300w,and GPU-Util is about 90%,but when calculate map,the Pwr is about 25w,and GPU-Util is 0%.

when i training in yolox,it did not happen.

hyperparameters i used is default,and The data of the val set is about 30,000, it is the reason?

@glenn-jocher
Copy link
Member

@ylc2580 hello,

Based on your description, it seems like the issue might be related to the hyperparameters and the size of your validation set. The default hyperparameters might not be the most optimal for your specific use-case, and the size of your validation set (30,000) might be too large. These factors might be leading to a drop in GPU utilization during the map calculation stage.

To solve the issue, I suggest experimenting with different hyperparameters to find the most optimal ones for your use-case. Additionally, you can try reducing the size of your validation set to see if that improves GPU utilization during the map calculation stage.

If these suggestions do not resolve the issue, please provide more details such as the version of YOLOv5 and the specific command used to run the training/validation.

I hope this helps. Let us know if you need further assistance.

Best,
Glenn Jocher

@ylc2580
Copy link
Author

ylc2580 commented May 5, 2023 via email

@glenn-jocher
Copy link
Member

Hello @ylc2580,

Thank you for your response. I recommend trying different hyperparameters to optimize the training for your specific use-case. Additionally, reducing the size of your validation set might help improve GPU utilization during the map calculation stage.

Please let us know if you encounter any other issues or have further questions.

Best regards,
Glenn Jocher

@github-actions
Copy link
Contributor

github-actions bot commented Jun 5, 2023

👋 Hello there! We wanted to give you a friendly reminder that this issue has not had any recent activity and may be closed soon, but don't worry - you can always reopen it if needed. If you still have any questions or concerns, please feel free to let us know how we can help.

For additional resources and information, please see the links below:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLO 🚀 and Vision AI ⭐

@github-actions github-actions bot added the Stale label Jun 5, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

2 participants