Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When multi-GPU training, my validation map value is very low. #1436

Closed
buaazsj opened this issue Aug 7, 2020 · 11 comments
Closed

When multi-GPU training, my validation map value is very low. #1436

buaazsj opened this issue Aug 7, 2020 · 11 comments
Labels
question Further information is requested Stale

Comments

@buaazsj
Copy link

buaazsj commented Aug 7, 2020

❔Question

When multi-GPU training, my validation map value is very low.
But when the training is interrupted and then resumed, it becomes normal. Why?

Additional context

@buaazsj buaazsj added the question Further information is requested label Aug 7, 2020
@glenn-jocher
Copy link
Member

@buaazsj there are a few issues open regarding --resume that talk about this, you might want to search the issues a bit.

@akshaychawla
Copy link

akshaychawla commented Aug 14, 2020

Is there any update on this issue? I've been facing the same problem, low mAP when training on 2/4 gpu setting. Training on 1-gpu works perfectly fine.

Edit. when training in a multi-gpu setting, the training loss (gIOU, cls, and obj) are the same as 1-gpu setting. Its only the validation loss + mAP that is reduced.

Edit 2. Ok so this is weird. Evaluation using :func: test.test during training on multi-gpu gives low mAP. BUT, if you separately evaluate using the checkpoint that was saved to disk by running python test.py .... , for the same epoch, you get correct mAP! By correct, I mean mAP is similar to training with 1-gpu setting. I'm testing this on yolov3-tiny.cfg. Will report more when training finishes in 1-2 days.

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 14, 2020

@akshaychawla hello, thank you for your interest in our work! This issue seems to lack the minimum requirements for a proper response, or is insufficiently detailed for us to help you. Please note that most technical problems are due to:

  • Your changes to the default repository. If your issue is not reproducible in a new git clone version of this repository we can not debug it. Before going further run this code and ensure your issue persists:
sudo rm -rf yolov5  # remove existing
git clone https://github.com/ultralytics/yolov5 && cd yolov5 # clone latest
python detect.py  # verify detection
# CODE TO REPRODUCE YOUR ISSUE HERE
  • Your custom data. If your issue is not reproducible with COCO or COCO128 data we can not debug it. Visit our Custom Training Tutorial for guidelines on training your custom data. Examine train_batch0.jpg and test_batch0.jpg for a sanity check of training and testing data.

  • Your environment. If your issue is not reproducible in one of the verified environments below we can not debug it. If you are running YOLOv5 locally, ensure your environment meets all of the requirements.txt dependencies specified below.

If none of these apply to you, we suggest you close this issue and raise a new one using the Bug Report template, providing screenshots and minimum viable code to reproduce your issue. Thank you!

Requirements

Python 3.8 or later with all requirements.txt dependencies installed, including torch>=1.6. To install run:

$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are passing. These tests evaluate proper operation of basic YOLOv5 functionality, including training (train.py), testing (test.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu.

@kongbia
Copy link

kongbia commented Aug 17, 2020

Is there any update on this issue? I've been facing the same problem, low mAP when training on 2/4 gpu setting. Training on 1-gpu works perfectly fine.

Edit. when training in a multi-gpu setting, the training loss (gIOU, cls, and obj) are the same as 1-gpu setting. Its only the validation loss + mAP that is reduced.

Edit 2. Ok so this is weird. Evaluation using :func: test.test during training on multi-gpu gives low mAP. BUT, if you separately evaluate using the checkpoint that was saved to disk by running python test.py .... , for the same epoch, you get correct mAP! By correct, I mean mAP is similar to training with 1-gpu setting. I'm testing this on yolov3-tiny.cfg. Will report more when training finishes in 1-2 days.

I met the same question. Have you solved it?

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 17, 2020

Ultralytics has open-sourced YOLOv5 at https://github.com/ultralytics/yolov5, featuring faster, lighter and more accurate object detection. YOLOv5 is recommended for all new projects.



** GPU Speed measures end-to-end time per image averaged over 5000 COCO val2017 images using a V100 GPU with batch size 32, and includes image preprocessing, PyTorch FP16 inference, postprocessing and NMS. EfficientDet data from [google/automl](https://github.com/google/automl) at batch size 8.
  • August 13, 2020: v3.0 release: nn.Hardswish() activations, data autodownload, native AMP.
  • July 23, 2020: v2.0 release: improved model definition, training and mAP.
  • June 22, 2020: PANet updates: new heads, reduced parameters, improved speed and mAP 364fcfd.
  • June 19, 2020: FP16 as new default for smaller checkpoints and faster inference d4c6674.
  • June 9, 2020: CSP updates: improved speed, size, and accuracy (credit to @WongKinYiu for CSP).
  • May 27, 2020: Public release. YOLOv5 models are SOTA among all known YOLO implementations.
  • April 1, 2020: Start development of future compound-scaled YOLOv3/YOLOv4-based PyTorch models.

Pretrained Checkpoints

Model APval APtest AP50 SpeedGPU FPSGPU params FLOPS
YOLOv5s 37.0 37.0 56.2 2.4ms 416 7.5M 13.2B
YOLOv5m 44.3 44.3 63.2 3.4ms 294 21.8M 39.4B
YOLOv5l 47.7 47.7 66.5 4.4ms 227 47.8M 88.1B
YOLOv5x 49.2 49.2 67.7 6.9ms 145 89.0M 166.4B
YOLOv5x + TTA 50.8 50.8 68.9 25.5ms 39 89.0M 354.3B
YOLOv3-SPP 45.6 45.5 65.2 4.5ms 222 63.0M 118.0B

** APtest denotes COCO test-dev2017 server results, all other AP results in the table denote val2017 accuracy.
** All AP numbers are for single-model single-scale without ensemble or test-time augmentation. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.001
** SpeedGPU measures end-to-end time per image averaged over 5000 COCO val2017 images using a GCP n1-standard-16 instance with one V100 GPU, and includes image preprocessing, PyTorch FP16 image inference at --batch-size 32 --img-size 640, postprocessing and NMS. Average NMS time included in this chart is 1-2ms/img. Reproduce by python test.py --data coco.yaml --img 640 --conf 0.1
** All checkpoints are trained to 300 epochs with default settings and hyperparameters (no autoaugmentation).
** Test Time Augmentation (TTA) runs at 3 image sizes. Reproduce by python test.py --data coco.yaml --img 832 --augment

For more information and to get started with YOLOv5 please visit https://github.com/ultralytics/yolov5. Thank you!

@akshaychawla
Copy link

akshaychawla commented Aug 18, 2020

Here are the results:

image

issue When training with multiple GPUs on DDP, validation mAP is lower than expected.

observation If we test a serialized version of the ddp model (i.e serialize to disk using torch.save and then serialize using torch.load), the validation mAP is fine.

fix During training instead of running test.test using the model that is being currently trained using DDP, write a temporary checkpoint to disk and tell test to create a new model and load that checkpoint and run validation with it.

Why does it work? No idea. Just dumb luck I guess. If I had to guess, it looks like when we move model.module.yolo_layers to model.yolo_layers after DDP init, and then switch to eval using model.eval(), it may not be switching eval on for the yolo_layers since they're outside model.module. OR, it might have something to do with the test dataloader.

The minor changes required for train.py & test.py can be viewed here: akshaychawla@4e5eaab

Another observation is that when I increased batch-size to 128 and train the model, I was expecting it to perform worse because higher batch size means less gradient updates which means less performance. BUT, this repository has support for gradient accumulation using loss *= batchsize / 64 which I think was initially designed for cases where batchsize < 64. But for bs>64, it has the effect of scaling up the loss, which is same as scaling the learning rate hyp['lr0'] to accommodate higher batch-sizes. But more importantly, it speed up training.

#gpu Time to 300 epochs (hrs)
1 42.843
2 36.740
4 26.455

Logs, results and checkpoints available at: Edit. https://drive.google.com/drive/folders/11CN50wq-e0COr9RnWPtBHgj8z-AfFBq6?usp=sharing

@glenn-jocher
Copy link
Member

glenn-jocher commented Aug 18, 2020

@akshaychawla thanks so much for the detailed analysis! It looks like you've put in a lot of work and arrived at a very useful insight.

I would highly recommend you try YOLOv5, it has a multitude of feature additions, improvements and bug fixes above and beyond this YOLOv3 repo. DDP is functioning correctly there, we use for training the largest official YOLOv5x model with no problems.
https://github.com/ultralytics/yolov5

@akshaychawla
Copy link

@glenn-jocher Thank YOU for building, open-sourcing and then maintaining this and yolov5 repository!

this is a very well written piece of software and I have learnt so much from it. Would love to transition to Yolov5, but reviewer 3 will ask me to compare to "existing state of the art" before a borderline reject, so my arms are tied to v3 for now 🤷‍♂️

@goldwater668
Copy link

goldwater668 commented Aug 28, 2020

@glenn-jocher
Yolov3-spp-1 in yolov3 is used cls.cfg After training with yolov5x in yolov5, we get the following results: yolov3 is better than yolov5, why?
yolov3:
image

yolov5:
image

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@glenn-jocher
Copy link
Member

@goldwater668 It's great to hear about the comparison! YOLOv3 in the YOLOv5 repository features an updated architecture that may outperform YOLOv5 in certain scenarios. YOLOv5, however, offers a significant number of improvements and optimizations over YOLOv3. I would recommend reviewing the recent updates and optimizations in YOLOv5 to ensure an Apple's-to-Apple's comparison. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested Stale
Projects
None yet
Development

No branches or pull requests

5 participants