Unable to resume training when using DDP #851

bsugerman · 2020-08-26T18:58:32Z

I'm running training on 2 GPUs without any problems as follows:

 python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml

However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried

python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml --resume

and a number of variants by leaving out different arguments related to multi-processors. The training gets to:

Transferred 370/370 items from ./runs/exp1/weights/last.pt
Using DDP

sits for a few seconds, then issues the second

Using DDP

and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb.

Any ideas?

The text was updated successfully, but these errors were encountered:

bsugerman · 2020-08-26T19:00:42Z

I should note that resumption works fine when just using 1 gpu.

glenn-jocher · 2020-08-26T19:28:26Z

@bsugerman I have not tested --resume on multi-gpu, but a problem above I see is that --resume is incapable of modifying original arguments. The only supported use cases are
python train.py --resume or
python train.py --resume path/to/last.pt

You might want to try this with your DDP command.

ljmiao · 2020-09-12T17:35:09Z

I also meet the problem when I try to resume training with ddp, can you fix the bug? @glenn-jocher

bsugerman · 2020-09-21T12:31:33Z

I have tried these, and it doesn't work. I have isolated the problem to the line in train.py that converts the model to DDP:

    # DDP mode
    if cuda and rank != -1:
         model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)

This is where the program seems to hang, for me at least.

@bsugerman I have not tested --resume on multi-gpu, but a problem above I see is that --resume is incapable of modifying original arguments. The only supported use cases are
python train.py --resume or
python train.py --resume path/to/last.pt

You might want to try this with your DDP command.

tyomj · 2020-10-01T10:01:10Z

A simple workaround, for now, is to comment the lines with opt replacement in train.py:
https://github.com/ultralytics/yolov5/blob/master/train.py#L422-L423
because every time you read this config you get local_rank = 0.
Then run training as follows:
python -m torch.distributed.launch --nproc_per_node 8 train.py --%your opt params% --resume

github-actions · 2020-11-01T00:32:16Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

chongkuiqi · 2020-12-24T13:24:34Z

A simple workaround, for now, is to comment the lines with opt replacement in train.py:
https://github.com/ultralytics/yolov5/blob/master/train.py#L422-L423
because every time you read this config you get local_rank = 0.
Then run training as follows:
python -m torch.distributed.launch --nproc_per_node 8 train.py --%your opt params% --resume

Thanks! But sorry i'm a little confused about your method to solve the problem, which line should i comment ? And how to configure "--%your opt params%" ? Thank you!

tyomj · 2020-12-24T13:53:01Z

@chongkuiqi
Sorry, I should have provided the link to a specific version since the line positions are unstable.
Here are the lines to comment:

yolov5/train.py

Lines 475 to 476 in 685d601

    
           with open(Path(ckpt).parent.parent / 'opt.yaml') as f: 
        
               opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader))  # replace

tyomj · 2020-12-24T13:57:56Z

@ how to configure "--%your opt params%"
Since you will no longer override these params with the cached opt.yaml, just make sure you are using the same params manually.

chongkuiqi · 2020-12-24T14:11:35Z

@ how to configure "--%your opt params%"
Since you will no longer override these params with the cached opt.yaml, just make sure you are using the same params manually.

Thank you ! It works ! And now i understand what caused the problem. Thanks for your quick reply !

glenn-jocher · 2020-12-24T19:54:41Z

@chongkuiqi @tyomj hi guys, I'm glad to hear there is a workaround. Is this something that might be codeable into a PR to help future people?

Also might there be a workaround to include opt.yaml? opt.yaml carries all the previous argparser arguments that need to be applied the same way on --resume.

glenn-jocher · 2020-12-30T19:53:30Z

TODO: multi-gpu --resume workaround/fix

glenn-jocher · 2020-12-30T19:56:08Z

@tyomj do you think local_rank is the only variable that needs to be preserved here with DDP --resume?

glenn-jocher · 2020-12-30T20:04:50Z

@tyomj what do you think of this proposed solution? This preserves the apriori keys from being replaced in the opt dictionary:

        apriori = opt.global_rank, opt.local_rank
        with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
            opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader))  # replace
        opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori  # reinstate

glenn-jocher · 2020-12-30T20:12:21Z

@tyomj @bsugerman @ljmiao @chongkuiqi I've opened up a PR #1810 with an attempted DDP --resume fix. Please review and test and comment on the PR. I'll be doing internal testing as well.

glenn-jocher · 2020-12-30T20:43:00Z

PR #1810 is merged after successful testing results. Test on 2x T4 GCP VM. Test scenario:

# 1. Start DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 80 --data coco128.yaml --epochs 50 --weights yolov5s.pt

# 2. sudo reboot after about 25 epochs

# 3. Resume DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

Test results as expected, training resumed properly from epoch 25 after forced reboot. nvidia-smi check confirmed multi-GPUs in use. Everything seems to work. Note not tested with W&B enabled.

To --resume with DDP (2 GPUs)

ONLY add --resume or --resume path/to/last.pt. --resume accepts no other arguments.

python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

bsugerman added the bug Something isn't working label Aug 26, 2020

github-actions bot added the Stale label Nov 1, 2020

github-actions bot closed this as completed Nov 6, 2020

glenn-jocher added the TODO label Dec 30, 2020

glenn-jocher mentioned this issue Dec 30, 2020

DDP Multi-GPU --resume bug fix #1810

Merged

glenn-jocher reopened this Dec 30, 2020

glenn-jocher linked a pull request Dec 30, 2020 that will close this issue

DDP Multi-GPU --resume bug fix #1810

Merged

glenn-jocher closed this as completed in #1810 Dec 30, 2020

glenn-jocher removed the TODO label Dec 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to resume training when using DDP #851

Unable to resume training when using DDP #851

bsugerman commented Aug 26, 2020

bsugerman commented Aug 26, 2020

glenn-jocher commented Aug 26, 2020

ljmiao commented Sep 12, 2020

bsugerman commented Sep 21, 2020

tyomj commented Oct 1, 2020

github-actions bot commented Nov 1, 2020

chongkuiqi commented Dec 24, 2020

tyomj commented Dec 24, 2020

tyomj commented Dec 24, 2020

chongkuiqi commented Dec 24, 2020

glenn-jocher commented Dec 24, 2020

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020 •

edited

Loading

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

Unable to resume training when using DDP #851

Unable to resume training when using DDP #851

Comments

bsugerman commented Aug 26, 2020

bsugerman commented Aug 26, 2020

glenn-jocher commented Aug 26, 2020

ljmiao commented Sep 12, 2020

bsugerman commented Sep 21, 2020

tyomj commented Oct 1, 2020

github-actions bot commented Nov 1, 2020

chongkuiqi commented Dec 24, 2020

tyomj commented Dec 24, 2020

tyomj commented Dec 24, 2020

chongkuiqi commented Dec 24, 2020

glenn-jocher commented Dec 24, 2020

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020 • edited Loading

glenn-jocher commented Dec 30, 2020

glenn-jocher commented Dec 30, 2020

To --resume with DDP (2 GPUs)

glenn-jocher commented Dec 30, 2020 •

edited

Loading