Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training with ddp get replicas mismatch error #5894

Closed
BlockWaving opened this issue Feb 10, 2021 · 5 comments
Closed

training with ddp get replicas mismatch error #5894

BlockWaving opened this issue Feb 10, 2021 · 5 comments
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task won't fix This will not be worked on

Comments

@BlockWaving
Copy link

BlockWaving commented Feb 10, 2021

Hi,

I've been getting this replicas error with ddp training.
setup: windows 10, torch 1.7.1, pytorch-lightning 1.1.7, on a 3 gpus machine.

The model training was working well with ddp on another machine 2 gpus (same setup w/ win10, torch 1.7.1 and pl 1.1.7)

the code crashed after printed the following error message:

self.reducer = dist.Reducer(
RuntimeError: replicas[0][0] in this process with sizes [12, 6] appears not to match sizes of the same param in process 0.

(Note: the sizes [12, 6] in error message changes in different run, could be any numbers, such as sizes[128, 45], etc.)

I then tried with setting accelerator='ddp_spawn', this makes the replicas error disappear. But just as being warned in documentation, ddp_spawn is very unstable, e.g, smaller batches, lower gpu utilizations, longer training time, etc. and the training can hardly proceed to 7-8 epches, because it always mysteriously crashes with memory error.

So still need to figure out how to revert back to ddp mode.

asked the Pytorch forum, their answer is as follow:

"This happens if the model parameters are not the same across all replicas in DDP. Have you tried printing the sizes of all the params in the model from each rank (using model.parameters())? This would be the first thing to verify mismatched sizes."

and I did printed the number model parameters in each process, they are the same. (The printing is after model initiation, but before the Trainer initiation, which will then initialize underline ddp, which is where error happened)

I understand that, in ddp mode, the program is restarted in each process, while in ddp_spawn mode, it's been carrying on in the subprocess -- does this different approaches of multiprocessing caused the model or model parameters that were copied to each gpu were different?

Below is how lightning Trainer is initialed and then fit is called (very standard steps):

self.trainer = pl.Trainer(
max_epochs=configs[“max_epochs”],
gpus=[0, 1, 3],
accelerator=ddp’,
weights_summary=top”,
gradient_clip_val=0.1,
limit_train_batches=30,
callbacks=[lr_logger, early_stop_callback, checkpoint_callback],
)

model =self.trainer.fit(
model,
train_dataloader=self.train_dataloader,
val_dataloaders=self.val_dataloader,
)

Please help!

@BlockWaving BlockWaving added bug Something isn't working help wanted Open to be worked on labels Feb 10, 2021
@tchaton tchaton added the priority: 1 Medium priority task label Feb 15, 2021
@tchaton
Copy link
Contributor

tchaton commented Feb 15, 2021

Dear @BlockWaving,

Thanks for reporting this bug.
Would this fail if you provide gpus=[0, 1, 2], instead of gpus=[0, 1, 3],

Best,
T.C

@BlockWaving
Copy link
Author

thanx for catching the typo, the actual testing was [0, 1, 2]

@awaelchli
Copy link
Contributor

awaelchli commented Mar 8, 2021

I have never seen this error you are reporting.
I would do the following: Replace your model with one of ours, the bug report model or whatever.
If the error does not occur with our model, then the problem is with your model.
With these types of problems, it is only possible to guess forever. We will probably not be able to help much unless we have a script that reproduced the issue.

@awaelchli awaelchli added distributed Generic distributed-related topic information needed labels Mar 8, 2021
@stale
Copy link

stale bot commented Apr 7, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Apr 7, 2021
@stale stale bot closed this as completed Apr 18, 2021
@pengyuange
Copy link

My code happened to the same problem.
when i looked at my training log, I found there is a moduledict in my module, and the key is get from a list loop.
when i sorted the list, the problem solved.
hope it is helpful for u.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic help wanted Open to be worked on priority: 1 Medium priority task won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants