training with ddp get replicas mismatch error #5894
Labels
bug
Something isn't working
distributed
Generic distributed-related topic
help wanted
Open to be worked on
priority: 1
Medium priority task
won't fix
This will not be worked on
Hi,
I've been getting this replicas error with ddp training.
setup: windows 10, torch 1.7.1, pytorch-lightning 1.1.7, on a 3 gpus machine.
The model training was working well with ddp on another machine 2 gpus (same setup w/ win10, torch 1.7.1 and pl 1.1.7)
the code crashed after printed the following error message:
(Note: the sizes [12, 6] in error message changes in different run, could be any numbers, such as sizes[128, 45], etc.)
I then tried with setting accelerator='ddp_spawn', this makes the replicas error disappear. But just as being warned in documentation, ddp_spawn is very unstable, e.g, smaller batches, lower gpu utilizations, longer training time, etc. and the training can hardly proceed to 7-8 epches, because it always mysteriously crashes with memory error.
So still need to figure out how to revert back to ddp mode.
asked the Pytorch forum, their answer is as follow:
"This happens if the model parameters are not the same across all replicas in DDP. Have you tried printing the sizes of all the params in the model from each rank (using model.parameters())? This would be the first thing to verify mismatched sizes."
and I did printed the number model parameters in each process, they are the same. (The printing is after model initiation, but before the Trainer initiation, which will then initialize underline ddp, which is where error happened)
I understand that, in ddp mode, the program is restarted in each process, while in ddp_spawn mode, it's been carrying on in the subprocess -- does this different approaches of multiprocessing caused the model or model parameters that were copied to each gpu were different?
Below is how lightning Trainer is initialed and then fit is called (very standard steps):
Please help!
The text was updated successfully, but these errors were encountered: