is dataset.indices broadcase necessary? #1820

ardeal · 2021-08-21T14:37:00Z

Hi,

in train.py of yolov3 code:

            # Broadcast if DDP
            if rank != -1:
                indices = (torch.tensor(dataset.indices) if rank == 0 else torch.zeros(dataset.n)).int()
                dist.broadcast(indices, 0)
                if rank != 0:
                    dataset.indices = indices.cpu().numpy()

as there is: sampler = torch.utils.data.distributed.DistributedSampler(dataset) if rank != -1 else None in the dataloader, I think the dataset.indices manual broadcast is not necessary in this case as DistributedSampler will broadcast dataset automatically?

The text was updated successfully, but these errors were encountered:

glenn-jocher · 2021-08-23T11:13:57Z

@ardeal I'm not sure. This would need some empirical results to verify (testing). Can you try to run the code both ways and check if dataset indices are updated on all workers?

ardeal · 2021-08-23T11:34:44Z

Hi,
@glenn-jocher ,

I have reveiewed torch.distributed toturial, and I am right now ask the quesiton on torch forum:
https://discuss.pytorch.org/t/when-will-dist-all-reduce-will-be-called/129918/6

according to torch's explanation, everythign is done by torch itself. The only thing the user needs to do is to feed the model in to DDP.

Pleasse check the code: https://github.com/rwightman/pytorch-image-models/blob/master/train.py
in the above code, it didn't broadcast dataset, however, it call all_reduce to sync loss(line 689 of the above file).

what confused me is that:
when should dataset broadcast?
when should all_reduce be called?
if all_reduce is called and DDP sync weights are both implemented, will they conflict?

I am strudying torch.distributed, once I make it clear, I will tell you.

glenn-jocher · 2021-08-23T11:49:36Z

@ardeal ok got it. Yes please let us know what you find.

One point is that the code you show is only used with --image-weights. This is because the image weights are reassigned after every validation at the end of each training epoch, and then dataset.indices is updated, which determines which images are loaded the next epoch.

Normally dataset indices is just range(n) for n images and the array never changes.

ardeal · 2021-08-23T12:53:54Z

@glenn-jocher ，

which determines which images are loaded the next epoch.
According to my understanding, this is done by torch.distribued, we don't needed to do it manually. Perhaps, my understanding is wrong. That is why I confirm it on pytorch forum.

You could review the upper link of pytorch-image-models repo which did much data augmentation as your repo and didn't broadcast dataset.

The reason why I asked questions about broadcast dataset is that I find your repo and pytorch-image-models broadcast or not broadcast differently.

ardeal · 2021-08-25T08:32:46Z

@glenn-jocher ,

Please check the answer on pytorch forum.
https://discuss.pytorch.org/t/when-will-dist-all-reduce-will-be-called/129918/11

according to the answer, you don't need to broadcast dataset.
My understanding is:

dataset and weight will be broadcast by DDP automatically if we run standard model.
if there is something else needs to be broadcast or sync, we could call broadcast or all_reduce funciton.

glenn-jocher · 2021-08-26T23:07:08Z

@ardeal thanks for the explanation! So is your recommendation that we update this section:

yolov3/train.py

Lines 252 to 265 in 1be3170

    
           # Update image weights (optional) 
        
           if opt.image_weights: 
        
               # Generate indices 
        
               if rank in [-1, 0]: 
        
                   cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc  # class weights 
        
                   iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw)  # image weights 
        
                   dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n)  # rand weighted idx 
        
               # Broadcast if DDP 
        
               if rank != -1: 
        
                   indices = (torch.tensor(dataset.indices) if rank == 0 else torch.zeros(dataset.n)).int() 
        
                   dist.broadcast(indices, 0) 
        
                   if rank != 0: 
        
                       dataset.indices = indices.cpu().numpy()

To this?

        # Update image weights (optional)
        if opt.image_weights:
            # Generate indices
            if RANK in [-1, 0]:
                cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc  # class weights
                iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw)  # image weights
                dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n)  # rand weighted idx

or this?

        # Update image weights (optional)
        if opt.image_weights:
            cw = model.class_weights.cpu().numpy() * (1 - maps) ** 2 / nc  # class weights
            iw = labels_to_image_weights(dataset.labels, nc=nc, class_weights=cw)  # image weights
            dataset.indices = random.choices(range(dataset.n), weights=iw, k=dataset.n)  # rand weighted idx

glenn-jocher · 2021-08-28T17:45:01Z

@ardeal I've eliminated the DDP portion of the image_weights code in YOLOv5 PR ultralytics/yolov5#4579, as the two are incompatible (I'd forgotten this in our earlier discussions), so this issue should be resolved there now, and these updates will be backpropagated to YOLOv3 eventually.

One question I had for you along the DDP topic is my new YOLOv5 EarlyStopping PR ultralytics/yolov5#4576. It works well for Single-GPU, but in DDP mode the stop variable is not broadcast correctly to all ranks, and training hangs rather than ending. Do you have any ideas on how to implement this correctly?

The idea is that the stopper returns stop=True after patience is exceeded, and then we break the training loop, but only RANK -1, 0 processes are breaking I think.

ardeal · 2021-08-29T05:01:25Z

Hi @glenn-jocher ,
I agree to eliminate DDP portion of image_weights in code.

For the issue about sync up stop variable, I think 2 experiments need to be done:

print the stop single/variable on each GPU to confirm.
if each GPU receives the stop variable, I think each loop on each GPU should exit. if some GPU don't exit, I couldn't figure out the reason. I have 1 suggestion for you: in master process,

import torch.distributed as dist


torch.cuda.empty_cache()
dist.destroy_process_group()

glenn-jocher · 2021-08-29T12:35:38Z

@ardeal thanks! Yes the problem is that other RANKS do not receive the stop variable. It's not clear to me how to broadcast the python variable correctly.

The dist.destroy_process_group() command needs to take place outside of the train function, I've placed it in the caller function, otherwise the process hangs and training does not finish.

https://github.com/ultralytics/yolov5/blob/bbfafeabdbf7785f8da5e4f9880df27869a71218/train.py#L507-L512

github-actions · 2021-09-29T00:06:27Z

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv3 🚀 resources:

Wiki – https://github.com/ultralytics/yolov3/wiki
Tutorials – https://github.com/ultralytics/yolov3#tutorials
Docs – https://docs.ultralytics.com

Access additional Ultralytics ⚡ resources:

Ultralytics HUB – https://ultralytics.com/pricing
Vision API – https://ultralytics.com/yolov5
About Us – https://ultralytics.com/about
Join Our Team – https://ultralytics.com/work
Contact Us – https://ultralytics.com/contact

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv3 🚀 and Vision AI ⭐!

ardeal added the question Further information is requested label Aug 21, 2021

glenn-jocher mentioned this issue Aug 28, 2021

Remove image_weights DDP code ultralytics/yolov5#4579

Merged

github-actions bot added the Stale label Sep 29, 2021

github-actions bot closed this as completed Oct 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

is dataset.indices broadcase necessary? #1820

is dataset.indices broadcase necessary? #1820

ardeal commented Aug 21, 2021 •

edited

Loading

glenn-jocher commented Aug 23, 2021

ardeal commented Aug 23, 2021

glenn-jocher commented Aug 23, 2021

ardeal commented Aug 23, 2021 •

edited

Loading

ardeal commented Aug 25, 2021 •

edited

Loading

glenn-jocher commented Aug 26, 2021 •

edited

Loading

glenn-jocher commented Aug 28, 2021

ardeal commented Aug 29, 2021 •

edited

Loading

glenn-jocher commented Aug 29, 2021

github-actions bot commented Sep 29, 2021

is dataset.indices broadcase necessary? #1820

is dataset.indices broadcase necessary? #1820

Comments

ardeal commented Aug 21, 2021 • edited Loading

glenn-jocher commented Aug 23, 2021

ardeal commented Aug 23, 2021

glenn-jocher commented Aug 23, 2021

ardeal commented Aug 23, 2021 • edited Loading

ardeal commented Aug 25, 2021 • edited Loading

glenn-jocher commented Aug 26, 2021 • edited Loading

glenn-jocher commented Aug 28, 2021

ardeal commented Aug 29, 2021 • edited Loading

glenn-jocher commented Aug 29, 2021

github-actions bot commented Sep 29, 2021

ardeal commented Aug 21, 2021 •

edited

Loading

ardeal commented Aug 23, 2021 •

edited

Loading

ardeal commented Aug 25, 2021 •

edited

Loading

glenn-jocher commented Aug 26, 2021 •

edited

Loading

ardeal commented Aug 29, 2021 •

edited

Loading