training with ddp get replicas mismatch error #5894

BlockWaving · 2021-02-10T04:45:25Z

Hi,

I've been getting this replicas error with ddp training.
setup: windows 10, torch 1.7.1, pytorch-lightning 1.1.7, on a 3 gpus machine.

The model training was working well with ddp on another machine 2 gpus (same setup w/ win10, torch 1.7.1 and pl 1.1.7)

the code crashed after printed the following error message:

self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.

(Note: the sizes [12, 6] in error message changes in different run, could be any numbers, such as sizes[128, 45], etc.)

I then tried with setting accelerator='ddp_spawn', this makes the replicas error disappear. But just as being warned in documentation, ddp_spawn is very unstable, e.g, smaller batches, lower gpu utilizations, longer training time, etc. and the training can hardly proceed to 7-8 epches, because it always mysteriously crashes with memory error.

So still need to figure out how to revert back to ddp mode.

asked the Pytorch forum, their answer is as follow:

"This happens if the model parameters are not the same across all replicas in DDP. Have you tried printing the sizes of all the params in the model from each rank (using model.parameters())? This would be the first thing to verify mismatched sizes."

and I did printed the number model parameters in each process, they are the same. (The printing is after model initiation, but before the Trainer initiation, which will then initialize underline ddp, which is where error happened)

I understand that, in ddp mode, the program is restarted in each process, while in ddp_spawn mode, it's been carrying on in the subprocess -- does this different approaches of multiprocessing caused the model or model parameters that were copied to each gpu were different?

Below is how lightning Trainer is initialed and then fit is called (very standard steps):

self.trainer = pl.Trainer(
max_epochs=configs[“max_epochs”],
gpus=[0, 1, 3],
accelerator=‘ddp’,
weights_summary=“top”,
gradient_clip_val=0.1,
limit_train_batches=30,
callbacks=[lr_logger, early_stop_callback, checkpoint_callback],
)

model = …

self.trainer.fit(
model,
train_dataloader=self.train_dataloader,
val_dataloaders=self.val_dataloader,
)

Please help!

tchaton · 2021-02-15T15:43:15Z

Dear @BlockWaving,

Thanks for reporting this bug.
Would this fail if you provide gpus=[0, 1, 2], instead of gpus=[0, 1, 3],

Best,
T.C

BlockWaving · 2021-02-16T15:25:59Z

thanx for catching the typo, the actual testing was [0, 1, 2]

awaelchli · 2021-03-08T02:39:09Z

I have never seen this error you are reporting.
I would do the following: Replace your model with one of ours, the bug report model or whatever.
If the error does not occur with our model, then the problem is with your model.
With these types of problems, it is only possible to guess forever. We will probably not be able to help much unless we have a script that reproduced the issue.

stale · 2021-04-07T02:50:53Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

pengyuange · 2021-11-08T12:30:48Z

My code happened to the same problem.
when i looked at my training log, I found there is a moduledict in my module, and the key is get from a list loop.
when i sorted the list, the problem solved.
hope it is helpful for u.

BlockWaving added bug Something isn't working help wanted Open to be worked on labels Feb 10, 2021

BlockWaving mentioned this issue Feb 10, 2021

benchmark subprocess vs spawn #5772

Closed

tchaton added the priority: 1 Medium priority task label Feb 15, 2021

awaelchli added distributed Generic distributed-related topic information needed labels Mar 8, 2021

stale bot added the won't fix This will not be worked on label Apr 7, 2021

stale bot closed this as completed Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training with ddp get replicas mismatch error #5894

training with ddp get replicas mismatch error #5894

BlockWaving commented Feb 10, 2021 •

edited by awaelchli

Loading

tchaton commented Feb 15, 2021

BlockWaving commented Feb 16, 2021

awaelchli commented Mar 8, 2021 •

edited

Loading

stale bot commented Apr 7, 2021

pengyuange commented Nov 8, 2021

training with ddp get replicas mismatch error #5894

training with ddp get replicas mismatch error #5894

Comments

BlockWaving commented Feb 10, 2021 • edited by awaelchli Loading

tchaton commented Feb 15, 2021

BlockWaving commented Feb 16, 2021

awaelchli commented Mar 8, 2021 • edited Loading

stale bot commented Apr 7, 2021

pengyuange commented Nov 8, 2021

BlockWaving commented Feb 10, 2021 •

edited by awaelchli

Loading

awaelchli commented Mar 8, 2021 •

edited

Loading