Unexpected behavior in DDP mode with dataloader workers #5628

wico-silva · 2021-11-12T11:34:51Z

Search before asking

I have searched the YOLOv5 issues and found no similar bug report.

YOLOv5 Component

Multi-GPU

Bug

In distributed multi-GPU training we specify the total batch size in the command-line which then gets divided per process here: https://github.com/ultralytics/yolov5/blob/master/train.py#L222

I would expect the same to be done for the number of workers but that's not the case.

So if I run:
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 16 --workers 8

I expect two process each with batch size 8 and 4 dataloader workers.
Instead, I get two processes, each with batch size 8 and 8 dataloader workers which is dangerous because it might be too many workers for my CPU to handle and training gets extremely slow.

Environment

YOLO: master
Ubuntu 18.04
Python 3.6.9

Minimal Reproducible Example

No response

Additional

No response

Are you willing to submit a PR?

Yes I'd like to help by submitting a PR!

The text was updated successfully, but these errors were encountered:

github-actions · 2021-11-12T11:35:36Z

👋 Hello @wico-silva, thank you for your interest in YOLOv5 🚀! Please visit our ⭐️ Tutorials to get started, where you can find quickstart guides for simple tasks like Custom Data Training all the way to advanced concepts like Hyperparameter Evolution.

If this is a 🐛 Bug Report, please provide screenshots and minimum viable code to reproduce your issue, otherwise we can not help you.

If this is a custom training ❓ Question, please provide as much information as possible, including dataset images, training logs, screenshots, and a public link to online W&B logging if available.

For business inquiries or professional support requests please visit https://ultralytics.com or email Glenn Jocher at glenn.jocher@ultralytics.com.

Requirements

Python>=3.6.0 with all requirements.txt installed including PyTorch>=1.7. To get started:

$ git clone https://github.com/ultralytics/yolov5
$ cd yolov5
$ pip install -r requirements.txt

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Google Colab and Kaggle notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training (train.py), validation (val.py), inference (detect.py) and export (export.py) on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

glenn-jocher · 2021-11-12T11:48:19Z

@wico-silva --workers are per RANK if you use DDP. This allows you to scale training without having to worry about manually specifying --workers, i.e.

python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 16
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 32
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 64

wico-silva · 2021-11-12T12:48:39Z

But then why not apply that same logic with batch size? My main point is that workers and batch size should follow the same logic. Either per rank or for world. Currently, workers is per rank and batch size is for world.

At least, this difference in logic should be documented clearly in the command-line arguments.

glenn-jocher · 2021-11-12T13:03:51Z

@wico-silva thanks for the feedback! Have you used other tools that follow one convention or the other for these two settings?

glenn-jocher · 2021-11-12T13:05:12Z

@wico-silva we can definitely put some more checks in place to prevent excess workers and to improve the console output for clarity, but I doubt it's a good idea to modify default behavior with so many users already using existing YOLOv5 DDP conventions.

wico-silva · 2021-11-12T13:19:15Z

What I see the most is definitely workers and batch size per rank, because that's just what naturally happens without any extra code.

It's also what makes the most sense given that torch.distributed.run is just taking the script train.py and spawning multiple processes. So it makes sense that whatever arguments come after
python -m torch.distributed.run --nproc_per_node X train.py
are the same as if we were training in single GPU.

On the other hand, if we look at the PyTorch ImageNet example we see that they use workers and batch size for world. They then divide both per rank here: https://github.com/pytorch/examples/blob/master/imagenet/main.py#L145

I guess what's important is being consistent. But I also understand that this would be an annoying change for many, so a compromise would be to document this behavior in the command-line help

glenn-jocher · 2021-11-12T13:38:27Z

@wico-silva got it! I've opened up a PR #5631 that ticks a few of these boxes. This retains the current behavior but it updates the command line help and most importantly it caps the vcpu usage smartly with world-size, so the danger is removed that a DDP user would accidentally use too many workers. This is probably the best compromise.

wico-silva · 2021-11-12T13:41:52Z

thanks for the quick replies and addressing it so fast. I'll close the issue

glenn-jocher · 2021-11-12T13:49:28Z

@wico-silva PR #5631 is merged. Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐

wico-silva added the bug Something isn't working label Nov 12, 2021

glenn-jocher mentioned this issue Nov 12, 2021

DDP WORLD_SIZE-safe dataloader workers #5631

Merged

glenn-jocher linked a pull request Nov 12, 2021 that will close this issue

DDP WORLD_SIZE-safe dataloader workers #5631

Merged

wico-silva closed this as completed Nov 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behavior in DDP mode with dataloader workers #5628

Unexpected behavior in DDP mode with dataloader workers #5628

wico-silva commented Nov 12, 2021

github-actions bot commented Nov 12, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 12, 2021 •

edited

Loading

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

Unexpected behavior in DDP mode with dataloader workers #5628

Unexpected behavior in DDP mode with dataloader workers #5628

Comments

wico-silva commented Nov 12, 2021

Search before asking

YOLOv5 Component

Bug

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

github-actions bot commented Nov 12, 2021 • edited by glenn-jocher Loading

Requirements

Environments

Status

glenn-jocher commented Nov 12, 2021 • edited Loading

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

wico-silva commented Nov 12, 2021

glenn-jocher commented Nov 12, 2021

github-actions bot commented Nov 12, 2021 •

edited by glenn-jocher

Loading

glenn-jocher commented Nov 12, 2021 •

edited

Loading