🌟💡 YOLOv5 Study: batch size #2377

glenn-jocher · 2021-03-05T21:28:08Z

Study 🤔

I did a quick study to examine the effect of varying batch size on YOLOv5 trainings. The study trained YOLOv5s on COCO for 300 epochs with --batch-size at 8 different values: [16, 20, 32, 40, 64, 80, 96, 128].

We've tried to make the train code batch-size agnostic, so that users get similar results at any batch size. This means users on a 11 GB 2080 Ti should be able to produce the same results as users on a 24 GB 3090 or a 40 GB A100, with smaller GPUs simply using smaller batch sizes.

We do this by scaling loss with batch size, and also by scaling weight decay with batch size. At batch sizes smaller than 64 we accumulate loss before optimizing, and at batch sizes above 64 we optimize after every batch.

Results 😃

Initial results vary significantly with batch size, but final results are nearly identical (good!).

Closeup of mAP@0.5:0.95:

One oddity that stood out is val objectness loss, which did vary with batch-size. I'm not sure why, as val-box and val-cls did not vary much, and neither did the 3 train losses. I don't know what this means or if there's any room for concern (or improvement).

The text was updated successfully, but these errors were encountered:

abhiagwl4262 · 2021-03-07T14:13:36Z

@glenn-jocher May be when we train for large number of epochs then we don't see significant improvement. I did experiment for batch size of 32 and 48 and I got better result when I trained with larger Batch Size. I trained for 50 epochs. And it happened on multiple datasets.

glenn-jocher · 2021-03-08T01:39:03Z

@abhiagwl4262 we always recommend you train on the largest batch-size possible, not so much for better performance, as the above results don't indicate higher performance with higher batch size, but certainly for faster training and better resource utilization.

Multi-GPU may add another angle to the above story though, as larger batch sizes there may help contribute to better results, at least in early training, since the batchnorm stats are split there among your CUDA devices.

abhiagwl4262 · 2021-03-08T18:52:56Z

@glenn-jocher Is High Batch good, even for very small dataset e.g 200 images per class ?

glenn-jocher · 2021-03-08T19:55:03Z

@abhiagwl4262 maybe, as long as you maintain a similar number of iterations. For very small datasets this may require significantly increasing training epochs, i.e. to several thousand, or until you observe overfitting.

cszer · 2021-03-08T22:24:50Z

Hey, good thing to study. But i need to notice that results with sync BN are not reproducible for me. I have trained yolo m model on 8 tesla a100 gpus with batch size 256 because ddp only supports gloo backend and 0 GPU was loaded 50% more than others. (cuda 11). It will be good to compare syncbn with BN training.

glenn-jocher · 2021-03-08T23:07:06Z

@cszer thanks for the comments! Yes a --sync study would be interesting as well. What you are your observations with and without --sync?

Excess CUDA device 0 memory usage was previously related to too-large batch sizes on device 0 when testing, but this bug was fixed on February 6th as part of PR #2148. If your results are from before that then you may want to update your code and see if the problem has been fixed.

cszer · 2021-03-09T01:00:05Z

1-2 05:095 map lower on coco

glenn-jocher · 2021-03-09T03:50:19Z

@cszer oh wow, that's a significant difference. Do you mean that you see a drop of -1 or -2 mAP on COCO when not using --sync-bn on a 8x A100 YOLOv5m training at --batch 256? That's much larger than I would have expected. Did you train for 300 epochs?

abhiagwl4262 · 2021-03-09T06:59:33Z

@glenn-jocher One very strange observation. I am able to run 48 batch size on single GPU and not able to run batch size 64 even on 2 GPUs. Is there some bug in multi-GPU implementation ?

glenn-jocher · 2021-03-09T07:44:57Z

@abhiagwl4262 if you believe you have a reproducible problem, please raise a new issue using the 🐛 Bug Report template, providing screenshots and a minimum reproducible example to help us better understand and diagnose your problem. Thank you!

glenn-jocher added question Further information is requested documentation Improvements or additions to documentation and removed question Further information is requested labels Mar 5, 2021

glenn-jocher self-assigned this Mar 5, 2021

glenn-jocher mentioned this issue Mar 11, 2021

If I reduce the image size Will the model become smaller? #2416

Closed

glenn-jocher closed this as completed Mar 13, 2021

ultralytics locked and limited conversation to collaborators Mar 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

🌟💡 YOLOv5 Study: batch size #2377

🌟💡 YOLOv5 Study: batch size #2377

glenn-jocher commented Mar 5, 2021 •

edited

Loading

abhiagwl4262 commented Mar 7, 2021

glenn-jocher commented Mar 8, 2021

abhiagwl4262 commented Mar 8, 2021

glenn-jocher commented Mar 8, 2021

cszer commented Mar 8, 2021 •

edited

Loading

glenn-jocher commented Mar 8, 2021 •

edited

Loading

cszer commented Mar 9, 2021 •

edited

Loading

glenn-jocher commented Mar 9, 2021

abhiagwl4262 commented Mar 9, 2021

glenn-jocher commented Mar 9, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

🌟💡 YOLOv5 Study: batch size #2377

🌟💡 YOLOv5 Study: batch size #2377

Comments

glenn-jocher commented Mar 5, 2021 • edited Loading

Study 🤔

Results 😃

abhiagwl4262 commented Mar 7, 2021

glenn-jocher commented Mar 8, 2021

abhiagwl4262 commented Mar 8, 2021

glenn-jocher commented Mar 8, 2021

cszer commented Mar 8, 2021 • edited Loading

glenn-jocher commented Mar 8, 2021 • edited Loading

cszer commented Mar 9, 2021 • edited Loading

glenn-jocher commented Mar 9, 2021

abhiagwl4262 commented Mar 9, 2021

glenn-jocher commented Mar 9, 2021

This issue was moved to a discussion.

glenn-jocher commented Mar 5, 2021 •

edited

Loading

cszer commented Mar 8, 2021 •

edited

Loading

glenn-jocher commented Mar 8, 2021 •

edited

Loading

cszer commented Mar 9, 2021 •

edited

Loading