Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to resume training when using DDP #851

Closed
bsugerman opened this issue Aug 26, 2020 · 16 comments · Fixed by #1810
Closed

Unable to resume training when using DDP #851

bsugerman opened this issue Aug 26, 2020 · 16 comments · Fixed by #1810
Labels
bug Something isn't working Stale

Comments

@bsugerman
Copy link

I'm running training on 2 GPUs without any problems as follows:

 python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml

However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried

python -m torch.distributed.launch --nproc_per_node 2 --master_port 1111  train.py --cfg yolov5s.yaml --weights '' --epochs 3 --batch-size 12 --workers 64 --device 0,1 --data data/coco128.yaml --resume 

and a number of variants by leaving out different arguments related to multi-processors. The training gets to:

Transferred 370/370 items from ./runs/exp1/weights/last.pt
Using DDP

sits for a few seconds, then issues the second

Using DDP

and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb.

Any ideas?

@bsugerman bsugerman added the bug Something isn't working label Aug 26, 2020
@bsugerman
Copy link
Author

I should note that resumption works fine when just using 1 gpu.

@glenn-jocher
Copy link
Member

@bsugerman I have not tested --resume on multi-gpu, but a problem above I see is that --resume is incapable of modifying original arguments. The only supported use cases are
python train.py --resume or
python train.py --resume path/to/last.pt

You might want to try this with your DDP command.

@ljmiao
Copy link

ljmiao commented Sep 12, 2020

I also meet the problem when I try to resume training with ddp, can you fix the bug? @glenn-jocher

@bsugerman
Copy link
Author

I have tried these, and it doesn't work. I have isolated the problem to the line in train.py that converts the model to DDP:

    # DDP mode
    if cuda and rank != -1:
         model = DDP(model, device_ids=[opt.local_rank], output_device=opt.local_rank)

This is where the program seems to hang, for me at least.

@bsugerman I have not tested --resume on multi-gpu, but a problem above I see is that --resume is incapable of modifying original arguments. The only supported use cases are
python train.py --resume or
python train.py --resume path/to/last.pt

You might want to try this with your DDP command.

@tyomj
Copy link

tyomj commented Oct 1, 2020

A simple workaround, for now, is to comment the lines with opt replacement in train.py:
https://github.com/ultralytics/yolov5/blob/master/train.py#L422-L423
because every time you read this config you get local_rank = 0.
Then run training as follows:
python -m torch.distributed.launch --nproc_per_node 8 train.py --%your opt params% --resume

@github-actions
Copy link
Contributor

github-actions bot commented Nov 1, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Nov 1, 2020
@github-actions github-actions bot closed this as completed Nov 6, 2020
@chongkuiqi
Copy link

A simple workaround, for now, is to comment the lines with opt replacement in train.py:
https://github.com/ultralytics/yolov5/blob/master/train.py#L422-L423
because every time you read this config you get local_rank = 0.
Then run training as follows:
python -m torch.distributed.launch --nproc_per_node 8 train.py --%your opt params% --resume

Thanks! But sorry i'm a little confused about your method to solve the problem, which line should i comment ? And how to configure "--%your opt params%" ? Thank you!

@tyomj
Copy link

tyomj commented Dec 24, 2020

@chongkuiqi
Sorry, I should have provided the link to a specific version since the line positions are unstable.
Here are the lines to comment:

yolov5/train.py

Lines 475 to 476 in 685d601

with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader)) # replace

@tyomj
Copy link

tyomj commented Dec 24, 2020

@ how to configure "--%your opt params%"
Since you will no longer override these params with the cached opt.yaml, just make sure you are using the same params manually.

@chongkuiqi
Copy link

@ how to configure "--%your opt params%"
Since you will no longer override these params with the cached opt.yaml, just make sure you are using the same params manually.

Thank you ! It works ! And now i understand what caused the problem. Thanks for your quick reply !

@glenn-jocher
Copy link
Member

@chongkuiqi @tyomj hi guys, I'm glad to hear there is a workaround. Is this something that might be codeable into a PR to help future people?

Also might there be a workaround to include opt.yaml? opt.yaml carries all the previous argparser arguments that need to be applied the same way on --resume.

@glenn-jocher
Copy link
Member

TODO: multi-gpu --resume workaround/fix

@glenn-jocher
Copy link
Member

@tyomj do you think local_rank is the only variable that needs to be preserved here with DDP --resume?

@glenn-jocher
Copy link
Member

glenn-jocher commented Dec 30, 2020

@tyomj what do you think of this proposed solution? This preserves the apriori keys from being replaced in the opt dictionary:

        apriori = opt.global_rank, opt.local_rank
        with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
            opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader))  # replace
        opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori  # reinstate

@glenn-jocher glenn-jocher reopened this Dec 30, 2020
@glenn-jocher glenn-jocher linked a pull request Dec 30, 2020 that will close this issue
@glenn-jocher
Copy link
Member

@tyomj @bsugerman @ljmiao @chongkuiqi I've opened up a PR #1810 with an attempted DDP --resume fix. Please review and test and comment on the PR. I'll be doing internal testing as well.

@glenn-jocher
Copy link
Member

PR #1810 is merged after successful testing results. Test on 2x T4 GCP VM. Test scenario:

# 1. Start DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 80 --data coco128.yaml --epochs 50 --weights yolov5s.pt

# 2. sudo reboot after about 25 epochs

# 3. Resume DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

Test results as expected, training resumed properly from epoch 25 after forced reboot. nvidia-smi check confirmed multi-GPUs in use. Everything seems to work. Note not tested with W&B enabled.

results

To --resume with DDP (2 GPUs)

ONLY add --resume or --resume path/to/last.pt. --resume accepts no other arguments.

python -m torch.distributed.launch --nproc_per_node 2 train.py --resume

@glenn-jocher glenn-jocher removed the TODO label Dec 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants