Multi-GPU Training 🌟 #475

NanoCode012 · 2020-07-22T11:36:27Z

📚 This guide explains how to properly use multiple GPUs to train a dataset with YOLOv5 🚀 on single or multiple machine(s). UPDATED 25 December 2022.

Before You Start

Clone repo and install requirements.txt in a Python>=3.7.0 environment, including PyTorch>=1.7. Models and datasets download automatically from the latest YOLOv5 release.

git clone https://github.com/ultralytics/yolov5  # clone
cd yolov5
pip install -r requirements.txt  # install

💡 ProTip! Docker Image is recommended for all Multi-GPU trainings. See Docker Quickstart Guide
💡 ProTip! torch.distributed.run replaces torch.distributed.launch in PyTorch>=1.9. See docs for details.

Training

Select a pretrained model to start training from. Here we select YOLOv5s, the smallest and fastest model available. See our README table for a full comparison of all models. We will train this model with Multi-GPU on the COCO dataset.

Single GPU

$ python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0

Multi-GPU DataParallel Mode (⚠️ not recommended)

You can increase the device to use Multiple GPUs in DataParallel mode.

$ python train.py  --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

This method is slow and barely speeds up training compared to using just 1 GPU.

Multi-GPU DistributedDataParallel Mode (✅ recommended)

You will have to pass python -m torch.distributed.run --nproc_per_node, followed by the usual arguments.

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --weights yolov5s.pt --device 0,1

--nproc_per_node specifies how many GPUs you would like to use. In the example above, it is 2.
--batch is the total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.

The code above will use GPUs 0... (N-1).

Use specific GPUs (click to expand)

You can do so by simply passing --device followed by your specific GPUs. For example, in the code below, we will use GPUs 2,3.

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --device 2,3

Use SyncBatchNorm (click to expand)

SyncBatchNorm could increase accuracy for multiple gpu training, however, it will slow down training by a significant factor. It is only available for Multiple GPU DistributedDataParallel training.

It is best used when the batch-size on each GPU is small (<= 8).

To use SyncBatchNorm, simple pass --sync-bn to the command like below,

$ python -m torch.distributed.run --nproc_per_node 2 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights '' --sync-bn

Use Multiple machines (click to expand)

This is only available for Multiple GPU DistributedDataParallel training.

Before we continue, make sure the files on all machines are the same, dataset, codebase, etc. Afterwards, make sure the machines can communicate to each other.

You will have to choose a master machine(the machine that the others will talk to). Note down its address(master_addr) and choose a port(master_port). I will use master_addr = 192.168.1.1 and master_port = 1234 for the example below.

To use it, you can do as the following,

# On master machine 0
$ python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank 0 --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

# On machine R
$ python -m torch.distributed.run --nproc_per_node G --nnodes N --node_rank R --master_addr "192.168.1.1" --master_port 1234 train.py --batch 64 --data coco.yaml --cfg yolov5s.yaml --weights ''

where G is number of GPU per machine, N is the number of machines, and R is the machine number from 0...(N-1).
Let's say I have two machines with two GPUs each, it would be G = 2 , N = 2, and R = 1 for the above.

Training will not start until all N machines are connected. Output will only be shown on master machine!

Notes

Windows support is untested, Linux is recommended.
--batch must be a multiple of the number of GPUs.
GPU 0 will take slightly more memory than the other GPUs as it maintains EMA and is responsible for checkpointing etc.
If you get RuntimeError: Address already in use, it could be because you are running multiple trainings at a time. To fix this, simply use a different port number by adding --master_port like below,

$ python -m torch.distributed.run --master_port 1234 --nproc_per_node 2 ...

Results

DDP profiling results on an AWS EC2 P4d instance with 8x A100 SXM4-40GB for YOLOv5l for 1 COCO epoch.

Profiling code

# prepare
t=ultralytics/yolov5:latest && sudo docker pull $t && sudo docker run -it --ipc=host --gpus all -v "$(pwd)"/coco:/usr/src/coco $t
pip3 install torch==1.9.0+cu111 torchvision==0.10.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
cd .. && rm -rf app && git clone https://github.com/ultralytics/yolov5 -b master app && cd app
cp data/coco.yaml data/coco_profile.yaml

# profile
python train.py --batch-size 16 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0 
python -m torch.distributed.run --nproc_per_node 2 train.py --batch-size 32 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1   
python -m torch.distributed.run --nproc_per_node 4 train.py --batch-size 64 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3  
python -m torch.distributed.run --nproc_per_node 8 train.py --batch-size 128 --data coco_profile.yaml --weights yolov5l.pt --epochs 1 --device 0,1,2,3,4,5,6,7

GPUs A100	batch-size	CUDA_mem ^{device0 (G)}	COCO ^train	COCO ^val
1x	16	26GB	20:39	0:55
2x	32	26GB	11:43	0:57
4x	64	26GB	5:57	0:55
8x	128	26GB	3:09	0:57

FAQ

If an error occurs, please read the checklist below first! (It could save your time)

Checklist (click to expand)

Have you properly read this post?
Have you tried to reclone the codebase? The code changes daily.
Have you tried to search for your error? Someone may have already encountered it in this repo or in another and have the solution.
Have you installed all the requirements listed on top (including the correct Python and Pytorch versions)?
Have you tried in other environments listed in the "Environments" section below?
Have you tried with another dataset like coco128 or coco2017? It will make it easier to find the root cause.

If you went through all the above, feel free to raise an Issue by giving as much detail as possible following the template.

Environments

YOLOv5 may be run in any of the following up-to-date verified environments (with all dependencies including CUDA/CUDNN, Python and PyTorch preinstalled):

Notebooks with free GPU:
Google Cloud Deep Learning VM. See GCP Quickstart Guide
Amazon Deep Learning AMI. See AWS Quickstart Guide
Docker Image. See Docker Quickstart Guide

Status

If this badge is green, all YOLOv5 GitHub Actions Continuous Integration (CI) tests are currently passing. CI tests verify correct operation of YOLOv5 training, validation, inference, export and benchmarks on MacOS, Windows, and Ubuntu every 24 hours and on every commit.

Credits

I would like to thank @MagicFrogSJTU, who did all the heavy lifting, and @glenn-jocher for guiding us along the way.

The text was updated successfully, but these errors were encountered:

MagicFrogSJTU · 2020-07-22T14:35:44Z

There will be multiple/redundant outputs. It does not affect training. This is a WIP.

I suggest we use will be fixed in the future instead of WIP. Many probably don't know what is WIP.
By the way, explain all the abbreviations. We must assume Users know nothing!

Multiple GPUs DistributedDataParallel Mode (Recommended!!)

I suggest we should explictly make it clear that DDP is faster than DP. Use this title

Multiple GPUs DistributedDataParallel Mode (Faster than DP, Recommended!!)

The tutorial is excellent! Good job!

NanoCode012 · 2020-07-24T07:13:18Z

store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use

Hello @feizhouxiaozhu , I think this may be because you are running multiple trainings at a time, and they are communicating to the same port. To fix this, you can run in a different port.
Using the example from above, add --master_port ####, where #### is a random port number.

$ python -m torch.distributed.launch --master_port 42342 --nproc_per_node 2 ...

Please tell me if this fixed the problem. If it doesn't, can you tell us how to replicate this problem?

NanoCode012 · 2020-07-24T08:03:19Z

Hmm, I'm not sure why that is. @feizhouxiaozhu , could you try to re-clone the repo then try again?

If error still occurs, could you try to run on coco128? Run the code below in terminal.

cd yolov5
python3 -c "from utils.google_utils import *; gdrive_download('1n_oKgR81BJtqk75b00eAjdv03qVCQn2f', 'coco128.zip')" && mv -n ./coco128 ../
export PYTHONPATH="$PWD"
python -m torch.distributed.launch --master_port 9990 --nproc_per_node 2 train.py --weights yolov5s.pt --cfg yolov5s.yaml --epochs 1 --img 320

I'm currently running 8 GPU DDP custom data training, and there is no issue.

Edit: Reply was removed. @feizhouxiaozhu , is the problem solved?

cesarandreslopez · 2020-07-24T10:56:03Z

Excellent guide guys, thank you so much! I was training on a DGX1 and was wondering why there wasn't much of a speed difference.

glenn-jocher · 2020-07-24T17:57:46Z

@cesarandreslopez oh wow, lucky you. Are you seeing faster speeds now with the updated multi gpu training?

cesarandreslopez · 2020-07-25T03:02:34Z

@glenn-jocher in DataParallel model, every Epoch, with about 51000 images in yolov5l.yaml was taking on the DGX1 about 6 and a half minutes.

on DistributedDataParallel Mode with SyncBatchNorm I am seeing about 3 minutes and 10 seconds, so quite an improvement.

I've seen no improvement in Testing speed.

On @NanoCode012's guide there is this note:

--batch-size is now the Total batch-size. It will be divided evenly to each GPU. In the example above, it is 64/2=32 per GPU.

Based on that I assumed that batch size could be something like --batch 1024, (128 per GPU), but I kept getting Cuda out of memory after an epoch was completed and it started to test, so I eventually just went with --batch 128.

Apparent GPU use during training and testing.

During training, GPU 0 seems to have a considerably higher RAM use than other GPUS (which limits the batch size to be around the same that one GPU could handle). The processing itself seems distributed on all GPUs

GPU consumption during testing looks like this, where GPU 0 has very high memory use but it doesn't seem to process while the other 7 GPUS seem busy with the amount of memory expected for a batch of that size:

Our training size for this example is about 51000 images and our testing sample is about 5100. Testing takes about 4 minutes and a half, an epoch on training takes about 3 minutes and 10 seconds

Given the amount of time this spends on testing I am wondering if it is possible or even useful to set testing every n epochs. We are currently studying up on this repository and will understand it enough soon to be able to offer PRs.

@glenn-jocher Happy to provide you remote access to the machine for your tests and so on. It's the least we can do! Just PM me.

NanoCode012 · 2020-07-25T03:49:00Z

Hi @cesarandreslopez , nice numbers!

The reason GPU 0 has higher memory is because it has to communicate with other GPUs to coordinate. In my test however, I don’t see that vast of a difference in GPU memory like you do. The latest one is 31GB (GPU 0) and 20 GB (others). Maybe SynBN is increasing GPU load or dataloaders for testing(?).

Batch size is indeed divided evenly. Is it possible to run 128 batch size on your single GPU because that is quite large for yolov5l.

Testing is done on only 1 GPU(GPU0 tests , other gpu continue train) , so that may be why you experience slow testing times. It’s is currently being worked on to use multiple GPUs there.

It is an interesting concept to test every n epochs and can certainly be done. However, maybe randomness will cause you to miss the “best” epoch, so I’m not sure if it’s good.

Edit: If you would like to do so, it’s on line 339 in Train.py, add a (epoch%interval==0) condition

Edit2: How is speed without SynBN? Since the individual batch size is around 128/8 > 8, I’m not sure if accuracy would be affected.

Edit3: If you have multiple machines you want to run this training on, there is an experimental PR you could try.

glenn-jocher · 2020-07-25T03:56:45Z

@cesarandreslopez ok got it, thanks for the feedback. I think I know why your testing is CUDA OOM. Before the DDP updates train and test.py shared the same batch-size (default 32), it seems likely this is still the case, except that test.py is inheriting global batch size instead of local batch size. So I suspect you should be able to train with much larger batch sizes once this bug is fixed. @NanoCode012 does that make sense about the global vs local batch sizes being passed to test.py?

Testing every n epochs is a good idea. You can currently use python test.py --notest to train without testing until the very final epoch, but we don't have a middle ground. Testing may not benefit as much from multi-gpu compared to training, because NMS ops run sequentially rather than in parallel, and tend to dominate testing time. An alternative to testing every n epochs is simply to supply a higher --conf-thres to test at. Default is 0.001, perhaps setting to 0.01 will halve your testing time.

That's a very generous offer! I'm pretty busy these days so I can't take you up on it immediately, but I'll keep that in mind in the future, thank you! It would definitely be nice to have access to something like that.

NanoCode012 · 2020-07-25T04:03:52Z

@glenn-jocher , I just noticed that! That may be why the memory is so different. But now it’s up to optimizations. For small “total batch size”, it makes sense to pass in the entire thing. For large “total”, it doesn’t make sense.

I think one easy solution is to let user pass in one argument “—test-total”, to test their total batch size vs their divided batchsize. But it can get confusing for newcomers.

Edit: What do you think?

glenn-jocher · 2020-07-25T04:04:16Z

@NanoCode012 if we replace total_batch_size with batch_size on L194:

yolov5/train.py

Lines 191 to 196 in fd532d9

    
           # Testloader 
        
           if rank in [-1, 0]: 
        
               # local_rank is set to -1. Because only the first process is expected to do evaluation. 
        
               testloader = create_dataloader(test_path, imgsz_test, total_batch_size, gs, opt, hyp=hyp, augment=False, 
        
                                              cache=opt.cache_images, rect=True, local_rank=-1, world_size=opt.world_size)[0]

And L341 would that solve @cesarandreslopez issue about testing OOM?

yolov5/train.py

Lines 339 to 348 in fd532d9

    
           if not opt.notest or final_epoch:  # Calculate mAP 
        
               results, maps, times = test.test(opt.data, 
        
                                                batch_size=total_batch_size, 
        
                                                imgsz=imgsz_test, 
        
                                                save_json=final_epoch and opt.data.endswith(os.sep + 'coco.yaml'), 
        
                                                model=ema.ema.module if hasattr(ema.ema, 'module') else ema.ema, 
        
                                                single_cls=opt.single_cls, 
        
                                                dataloader=testloader, 
        
                                                save_dir=log_dir)

NanoCode012 · 2020-07-25T04:10:58Z

If we do so, datasets for testing could take num_gpu times longer. (I remember training/testing with total batchsize 16 for coco taking 1h) .

I think giving user an option is good, but we should set test to use totalbatchsize to be on by default.. Only when user has OOM, should they configure it. “—notest—total” sounds good?

glenn-jocher · 2020-07-25T04:16:27Z

@NanoCode012 ok got it. I think the most common use case is for users to maximize training cuda mem, so since test.py is currently restricted to single-gpu it would make sense to default it to batch_size rather than total_batch_size. But I suppose we should wait for @MagicFrogSJTU work on test.py before really modifying, since it will get a makeover shortly here. I think it's best to try and simplify the options when possible so it 'just works' as steve jobs would say, so let's avoid adding extra arguments if possible.

@cesarandreslopez I think for the time being you could apply the L194 and L341 fix described above, we have a few more significant PRs due in the coming week, so a more permanent fix for this should be included in those.

MagicFrogSJTU · 2020-07-25T04:33:16Z

@NanoCode012 does that make sense about the global vs local batch sizes being passed to test.py?

@glenn-jocher
After my fix, the training.py would run parallel test and global_batch_size would be split into small local_batch_size in the test time just like the training time. Problem solved.

cesarandreslopez · 2020-07-25T13:33:31Z

@glenn-jocher please note that when --notest is used on the current master branch it will crash after completing the first epoch.

Traceback (most recent call last):
  File "train.py", line 469, in <module>
    train(hyp, tb_writer, opt, device)
  File "train.py", line 371, in train
    with open(results_file, 'r') as f:  # create checkpoint
FileNotFoundError: [Errno 2] No such file or directory: 'runs/exp0/results.txt'

I tried doing a touch results.txt under the /runs/expoN/ folder that will avoid the error above, but then a new one will appear:

Traceback (most recent call last):
  File "train.py", line 469, in <module>
    train(hyp, tb_writer, opt, device)
  File "train.py", line 380, in train
    if (best_fitness == fi) and not final_epoch:
UnboundLocalError: local variable 'fi' referenced before assignment

so adding --notest to the command above, in yolov5 will not work right now. (this does work on yolov3 on previous tests).

Edit 1: @NanoCode012 if I follow your suggestion:

Edit: If you would like to do so, it’s on line 339 in Train.py, add a (epoch%interval==0) condition

The same error describe here as --notest will appear.

glenn-jocher · 2020-07-25T17:28:12Z

@cesarandreslopez should be fixed following PR #518. Tested on single-GPU and CPU.

twangnh · 2020-07-30T03:17:01Z

hi! @glenn-jocher for multi-gpu training, if using smaller batch size than 64, could you suggest the hyperparameter to adjust like the learning rate?

MagicFrogSJTU · 2020-07-30T03:25:57Z

hi! @glenn-jocher for multi-gpu training, if using smaller batch size than 64, could you suggest the hyperparameter to adjust like the learning rate?

Internally, batch size is kept at least 64. Gradient accumulation will be used if a batch size smaller than 64 is given. Therefore, no adjust is needed if you use a smaller batch size.

liumingjune · 2020-07-30T13:46:24Z

Hello, I have the following problem when using multi-GPU training, which is done according to your command line.

#not working on multi-GPU training.

NanoCode012 · 2020-07-30T13:52:03Z

Hello @liumingjune, could you provide us the exact line you used?

EDIT: Also, did you use the latest repo? I think this can be the reason.

liumingjune · 2020-07-30T13:56:01Z

Hello @liumingjune, could you provide us the exact line you used? Looking at the screenshot, did you pass in --local_rank argument?

Thank you for your reply. My command line is

python -m torch.distributed.launch --nproc_per_node 4 train. py --device 0,1,2,3
I have 4 GPUs totally.

NanoCode012 · 2020-07-30T14:10:28Z

Hi @liumingjune , could you try to pull or clone the repo again? I saw that your hyp values are old, and train function is missing some arguments.

I ran

git clone https://github.com/ultralytics/yolov5.git && cd yolov5
python -m torch.distributed.launch --nproc_per_node 4 train.py --device 0,1,2,3

and there were no problems.

liumingjune · 2020-07-30T14:15:58Z

Hi @liumingjune , could you try to pull or clone the repo again? I saw that your hyp values are old, and train function is missing some arguments.

I ran
git clone https://github.com/ultralytics/yolov5.git && cd yolov5
python -m torch.distributed.launch --nproc_per_node 4 train.py --device 0,1,2,3
and there were no problems.

OK. I will try. Maybe that's the reason. I will try. My version is a clone of Yolov5 when it first appeared.Thanks a lot!

liumingjune · 2020-08-03T01:32:59Z

Hello, I want to know the difference between the current version and the version just released before, because I find that the form of data set preparation is different. The previous one is to prepare the data set path and the training file, verify the file. I need to manually separate out the training data and the validation data. This is not friendly to large data volumes.

mesllo-bc · 2023-01-20T10:52:40Z

Just to confirm - so if I choose a batch size of 8 on the train.py arguments, then if I'm using 4 GPUs then the model is theoretically backpropagating over 2 examples every step and the final results we see are for a model trained with a batch size of 2, or does the model gather the gradients together at every step to get the result from each GPU and backpropagate with a batch size of 8?

iann838 · 2023-02-17T20:26:59Z

@glenn-jocher Training with torch.distributed.run gets increasingly slower and at some point there is an error:

[E ProcessGroupNCCL.cpp:737] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800918 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800918 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:737] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800923 milliseconds before timing out.
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 636, in <module>
    main(opt)
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 397, in train
    dist.broadcast_object_list(broadcast_list, 0)  # broadcast 'stop' to all ranks
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1890, in broadcast_object_list
    broadcast(object_tensor, src=src, group=group)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 3.  Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800923 milliseconds before timing out.
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 636, in <module>
    main(opt)
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 529, in main
    train(opt.hyp, opt, device, callbacks)
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 397, in train
Traceback (most recent call last):
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 636, in <module>
    dist.broadcast_object_list(broadcast_list, 0)  # broadcast 'stop' to all ranks
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1890, in broadcast_object_list
        main(opt)broadcast(object_tensor, src=src, group=group)

  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 529, in main
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    train(opt.hyp, opt, device, callbacks)
  File "/home/ec2-user/SageMaker/AmazonSageMaker-tools/yolov5/train.py", line 397, in train
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 1.  Original reason for failure was: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800918 milliseconds before timing out.
    dist.broadcast_object_list(broadcast_list, 0)  # broadcast 'stop' to all ranks
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1890, in broadcast_object_list
    broadcast(object_tensor, src=src, group=group)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1193, in broadcast
    work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 2.  Original reason for failure was: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17509, OpType=BROADCAST, Timeout(ms)=1800000) ran for 1800918 milliseconds before timing out.
                 Class     Images  Instances          P          R      mAP50   WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 13660 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 13661) of binary: /home/ec2-user/anaconda3/envs/pytorch_p39/bin/python
Traceback (most recent call last):
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/run.py", line 765, in <module>
    main()
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main
    run(args)
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run
    elastic_launch(
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/ec2-user/anaconda3/envs/pytorch_p39/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-02-17_20:19:34
  host      : ip-172-16-99-252.ec2.internal
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 13662)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2023-02-17_20:19:34
  host      : ip-172-16-99-252.ec2.internal
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 13663)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-02-17_20:19:34
  host      : ip-172-16-99-252.ec2.internal
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 13661)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

The hardware is AWS g5.12xlarge with 4 GPU, training starts relatively fast but gets slower over time, it crashes with this error at after some epochs (I've seen at epoch 6, 7 and 9 with a dataset size of 70k with batch size 384).

wofvh · 2023-02-23T09:14:04Z

hi want to run yolov5 with qualcomm adreno GPU is it possible ???

neo133 · 2023-03-04T07:56:37Z

Hi @glenn-jocher,
if I have 1 GPU per machine and a total of 2 machines in same network then would DDP work??.......because its written in the tutorial that it needs multi GPU multi machine.

tommyshelby4 · 2024-02-09T16:05:47Z

Hi guys! I am using YOLOV5 for training and up to now I had only one GPU available. I recently integrated a second (identical) GPU in my machine and I try to run my code with Multi-GPU training and especially in DistributedDataParallel mode, which is currently recommended. I use a Jupyter notebook to run my experiments. Up to now I had been using the following code snippet:

import train
opt = {
    'weights': f"yolov5{YOLO_SIZE}.pt",
    'cfg': '',
    'data': CONF_FILE_PATH,
    'epochs': EPOCHS,
    'batch_size': BATCH_SIZE,
    'imgsz': IMG_SIZE,
    'rect': False,
    'resume': False,
    'nosave': False,
    'noval': False,
    'noautoanchor': False,
    'noplots': False,
    'bucket': '',
    'cache': None,
    'image_weights': False,
    'multi_scale': False,
    'nproc_per_node': 2,
    'single_cls': False,
    'optimizer': 'SGD',
    'sync_bn': False,
    'workers': 8,
    'project': TRAIN_PATH,
    'name': MODEL_NAME,
    'exist_ok': True,
    'quad': False,
    'cos_lr': False,
    'label_smoothing': 0.0,
    'patience': PATIENCE,
    'freeze': [0],
    'save_period': -1,
    'seed': SEED,
    'local_rank': -1,
    'entity': None,
    'upload_dataset': False,
    'bbox_interval': -1,
    'artifact_alias': 'latest',
    'device': ''
    }
train.main(argparse.Namespace(**opt))

Currently, I use the same opt dictionary with the only deviation being the device value as described in https://docs.ultralytics.com/yolov5/tutorials/multi_gpu_training/#multi-gpu-distributeddataparallel-mode-recommended, where I set device to '0,1'. I have a doubt on whether it is right or wrong. The code I currently run is:

import torch.distributed.run as dist_run
dist_run.main(train.main(argparse.Namespace(**opt)), nproc_per_node=2)

The code runs but I do not seem to get parallelization. I do not want to run the process from the command window as suggested in the link above. Does anybody have any idea on how I would actually get parallel execution in the context I am describing? My code is currently running properly in this setting - but I get no acceleration compared to the single GPU setting - and in the end of the execution I get the error in the image below:

Any assistance would be greatly appreciated!

NanoCode012 mentioned this issue Jul 22, 2020

Improvement of DDP is needed! #463

Closed

glenn-jocher added the documentation Improvements or additions to documentation label Jul 22, 2020

glenn-jocher assigned NanoCode012 Jul 22, 2020

This comment has been minimized.

Sign in to view

NanoCode012 changed the title ~~Multi-GPU Train Tutorial~~ Multi-GPU Training Jul 24, 2020

NanoCode012 mentioned this issue Jul 25, 2020

--notest bug fix #518

Merged

ultralytics deleted a comment from NanoCode012 Jul 25, 2020

glenn-jocher mentioned this issue Oct 12, 2022

How Train many models (yolov5) on the same time with different inputs #9374

Closed

This was referenced Oct 24, 2022

Use Yolo for anomaly detection #9906

Closed

Problem on running Hyperparameter Evolution on Big Dataset #9916

Closed

I want to pass the image read by opencv to the model I/F #9913

Closed

GPU utilization is low when training on COCO dataset #9929

Closed

This was referenced Nov 1, 2022

GPU not utilizing 100% memory #9949

Closed

Number of Classes #10054

Closed

This was referenced Nov 8, 2022

Multigpu training becomes slower in Kaggle #10078

Closed

Yolo v3 take a lot of time to train on custom data ultralytics/yolov3#1458

Closed

Yolov5 cannot detection a video (tfjs) #7416

Closed

How do I speed up training my model #10156

Closed

This was referenced Nov 17, 2022

How can we know how many resources my training is using? #10186

Closed

Why is yolov5 training slow? #10254

Closed

ccssu mentioned this issue Nov 28, 2022

8卡ddp增加batch_size，精度值严重下降 Oneflow-Inc/one-yolov5#80

Open

This was referenced Dec 3, 2022

How to train objects365 without auto download the dataset. #4658

Closed

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #10408

Closed

How to freeze backbone and unfreeze it after a specific epoch? #10416

Closed

This was referenced Dec 12, 2022

GPU usage is 1%, the GPU mem use 100% #10477

Closed

Cuda tried allocating an enormous amount of memory (1936GiB) #10528

Closed

glenn-jocher mentioned this issue Jan 18, 2023

Training a few epoch memory suddenly OOM ultralytics/ultralytics#467

Closed

2 tasks

marigoold mentioned this issue Feb 8, 2023

Cannot reproduce the 64.1 mAP on COCO dataset by yolov5m #10905

Closed

1 task

alicera mentioned this issue Feb 10, 2023

cli command vs python ultralytics/ultralytics#923

Closed

1 task

glenn-jocher mentioned this issue Feb 28, 2023

Layer Wise Training #11081

Closed

1 task

glenn-jocher mentioned this issue Mar 27, 2023

CUDA out of memory issue ultralytics/ultralytics#1654

Closed

2 tasks

Gracebaytech mentioned this issue Feb 15, 2024

Troubleshooting Error Messages and Optimization in YOLOv5 Training Process #12733

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU Training 🌟 #475

Multi-GPU Training 🌟 #475

NanoCode012 commented Jul 22, 2020 •

edited by glenn-jocher

Loading

MagicFrogSJTU commented Jul 22, 2020

This comment has been minimized.

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

cesarandreslopez commented Jul 24, 2020

glenn-jocher commented Jul 24, 2020

cesarandreslopez commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020 •

edited

Loading

MagicFrogSJTU commented Jul 25, 2020

cesarandreslopez commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020

twangnh commented Jul 30, 2020

MagicFrogSJTU commented Jul 30, 2020

liumingjune commented Jul 30, 2020

NanoCode012 commented Jul 30, 2020 •

edited

Loading

liumingjune commented Jul 30, 2020

NanoCode012 commented Jul 30, 2020 •

edited

Loading

liumingjune commented Jul 30, 2020

liumingjune commented Aug 3, 2020

mesllo-bc commented Jan 20, 2023 •

edited

Loading

iann838 commented Feb 17, 2023

wofvh commented Feb 23, 2023

neo133 commented Mar 4, 2023 •

edited

Loading

tommyshelby4 commented Feb 9, 2024 •

edited

Loading

Multi-GPU Training 🌟 #475

Multi-GPU Training 🌟 #475

Comments

NanoCode012 commented Jul 22, 2020 • edited by glenn-jocher Loading

Before You Start

Training

Single GPU

Multi-GPU DataParallel Mode (⚠️ not recommended)

Multi-GPU DistributedDataParallel Mode (✅ recommended)

Notes

Results

FAQ

Environments

Status

Credits

MagicFrogSJTU commented Jul 22, 2020

This comment has been minimized.

NanoCode012 commented Jul 24, 2020 • edited Loading

NanoCode012 commented Jul 24, 2020 • edited Loading

cesarandreslopez commented Jul 24, 2020

glenn-jocher commented Jul 24, 2020

cesarandreslopez commented Jul 25, 2020

Apparent GPU use during training and testing.

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

NanoCode012 commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020 • edited Loading

MagicFrogSJTU commented Jul 25, 2020

cesarandreslopez commented Jul 25, 2020 • edited Loading

glenn-jocher commented Jul 25, 2020

twangnh commented Jul 30, 2020

MagicFrogSJTU commented Jul 30, 2020

liumingjune commented Jul 30, 2020

NanoCode012 commented Jul 30, 2020 • edited Loading

liumingjune commented Jul 30, 2020

NanoCode012 commented Jul 30, 2020 • edited Loading

liumingjune commented Jul 30, 2020

liumingjune commented Aug 3, 2020

mesllo-bc commented Jan 20, 2023 • edited Loading

iann838 commented Feb 17, 2023

wofvh commented Feb 23, 2023

neo133 commented Mar 4, 2023 • edited Loading

tommyshelby4 commented Feb 9, 2024 • edited Loading

NanoCode012 commented Jul 22, 2020 •

edited by glenn-jocher

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 24, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 25, 2020 •

edited

Loading

glenn-jocher commented Jul 25, 2020 •

edited

Loading

cesarandreslopez commented Jul 25, 2020 •

edited

Loading

NanoCode012 commented Jul 30, 2020 •

edited

Loading

NanoCode012 commented Jul 30, 2020 •

edited

Loading

mesllo-bc commented Jan 20, 2023 •

edited

Loading

neo133 commented Mar 4, 2023 •

edited

Loading

tommyshelby4 commented Feb 9, 2024 •

edited

Loading