Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behavior in DDP mode with dataloader workers #5628

Closed
1 of 2 tasks
wico-silva opened this issue Nov 12, 2021 · 9 comments · Fixed by #5631
Closed
1 of 2 tasks

Unexpected behavior in DDP mode with dataloader workers #5628

wico-silva opened this issue Nov 12, 2021 · 9 comments · Fixed by #5631
Labels
bug Something isn't working

Comments

@wico-silva
Copy link

Search before asking

  • I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Multi-GPU

Bug

In distributed multi-GPU training we specify the total batch size in the command-line which then gets divided per process here: https://github.com/ultralytics/yolov5/blob/master/train.py#L222

I would expect the same to be done for the number of workers but that's not the case.

So if I run:
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 16 --workers 8

I expect two process each with batch size 8 and 4 dataloader workers.
Instead, I get two processes, each with batch size 8 and 8 dataloader workers which is dangerous because it might be too many workers for my CPU to handle and training gets extremely slow.

Environment

  • YOLO: master
  • Ubuntu 18.04
  • Python 3.6.9

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

  • Yes I'd like to help by submitting a PR!
@wico-silva wico-silva added the bug Something isn't working label Nov 12, 2021
@github-actions
Copy link
Contributor

github-actions bot commented Nov 12, 2021

👋 Hello @wico-silva, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Status

CI CPU testing

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

@glenn-jocher
Copy link
Member

glenn-jocher commented Nov 12, 2021

@wico-silva --workers are per RANK if you use DDP. This allows you to scale training without having to worry about manually specifying --workers, i.e.

python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 16
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 32
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 64

@wico-silva
Copy link
Author

But then why not apply that same logic with batch size? My main point is that workers and batch size should follow the same logic. Either per rank or for world. Currently, workers is per rank and batch size is for world.

At least, this difference in logic should be documented clearly in the command-line arguments.

@glenn-jocher
Copy link
Member

@wico-silva thanks for the feedback! Have you used other tools that follow one convention or the other for these two settings?

@glenn-jocher
Copy link
Member

@wico-silva we can definitely put some more checks in place to prevent excess workers and to improve the console output for clarity, but I doubt it's a good idea to modify default behavior with so many users already using existing YOLOv5 DDP conventions.

@wico-silva
Copy link
Author

What I see the most is definitely workers and batch size per rank, because that's just what naturally happens without any extra code.

It's also what makes the most sense given that torch.distributed.run is just taking the script train.py and spawning multiple processes. So it makes sense that whatever arguments come after
python -m torch.distributed.run --nproc_per_node X train.py
are the same as if we were training in single GPU.

On the other hand, if we look at the PyTorch ImageNet example we see that they use workers and batch size for world. They then divide both per rank here: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L145

I guess what's important is being consistent. But I also understand that this would be an annoying change for many, so a compromise would be to document this behavior in the command-line help

@glenn-jocher
Copy link
Member

@wico-silva got it! I've opened up a PR #5631 that ticks a few of these boxes. This retains the current behavior but it updates the command line help and most importantly it caps the vcpu usage smartly with world-size, so the danger is removed that a DDP user would accidentally use too many workers. This is probably the best compromise.

@wico-silva
Copy link
Author

thanks for the quick replies and addressing it so fast. I'll close the issue

@glenn-jocher
Copy link
Member

@wico-silva PR #5631 is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants