-
-
Notifications
You must be signed in to change notification settings - Fork 15.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to resume training when using DDP #851
Comments
I should note that resumption works fine when just using 1 gpu. |
@bsugerman I have not tested --resume on multi-gpu, but a problem above I see is that --resume is incapable of modifying original arguments. The only supported use cases are You might want to try this with your DDP command. |
I also meet the problem when I try to resume training with ddp, can you fix the bug? @glenn-jocher |
I have tried these, and it doesn't work. I have isolated the problem to the line in
This is where the program seems to hang, for me at least.
|
A simple workaround, for now, is to comment the lines with opt replacement in train.py: |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Thanks! But sorry i'm a little confused about your method to solve the problem, which line should i comment ? And how to configure "--%your opt params%" ? Thank you! |
@chongkuiqi Lines 475 to 476 in 685d601
|
@ how to configure "--%your opt params%" |
Thank you ! It works ! And now i understand what caused the problem. Thanks for your quick reply ! |
@chongkuiqi @tyomj hi guys, I'm glad to hear there is a workaround. Is this something that might be codeable into a PR to help future people? Also might there be a workaround to include opt.yaml? opt.yaml carries all the previous argparser arguments that need to be applied the same way on --resume. |
TODO: multi-gpu --resume workaround/fix |
@tyomj do you think |
@tyomj what do you think of this proposed solution? This preserves the apriori = opt.global_rank, opt.local_rank
with open(Path(ckpt).parent.parent / 'opt.yaml') as f:
opt = argparse.Namespace(**yaml.load(f, Loader=yaml.FullLoader)) # replace
opt.cfg, opt.weights, opt.resume, opt.global_rank, opt.local_rank = '', ckpt, True, *apriori # reinstate |
@tyomj @bsugerman @ljmiao @chongkuiqi I've opened up a PR #1810 with an attempted DDP --resume fix. Please review and test and comment on the PR. I'll be doing internal testing as well. |
PR #1810 is merged after successful testing results. Test on 2x T4 GCP VM. Test scenario: # 1. Start DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --batch-size 80 --data coco128.yaml --epochs 50 --weights yolov5s.pt
# 2. sudo reboot after about 25 epochs
# 3. Resume DDP training
python -m torch.distributed.launch --nproc_per_node 2 train.py --resume Test results as expected, training resumed properly from epoch 25 after forced reboot. nvidia-smi check confirmed multi-GPUs in use. Everything seems to work. Note not tested with W&B enabled. To --resume with DDP (2 GPUs)ONLY add --resume or --resume path/to/last.pt. --resume accepts no other arguments. python -m torch.distributed.launch --nproc_per_node 2 train.py --resume |
I'm running training on 2 GPUs without any problems as follows:
However, if I have to kill the job (so someone else can use the processors for a bit), I can not restart the training. I've tried
and a number of variants by leaving out different arguments related to multi-processors. The training gets to:
sits for a few seconds, then issues the second
and then just hangs. On the GPU, 3 processors are started. Two on processor 0, each using 2250Mb, and one on processor 1 using 965Mb.
Any ideas?
The text was updated successfully, but these errors were encountered: