-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dpp training stuck ( lightning=1.4.9 ) #9851
Comments
Dear @yinrong, Would it be possible for you to provide a reproducible script, possibly with the BoringModel ? Best, |
I tested my code with fewer data ( random 1% ) and it works well ( quick runs to epoch 66 ). So I think a minimum reproducible script is hard to find. More clues:
|
@tchaton could you continue to work on this issue? |
Hey @yinrong, Could you make sure you have the same number of batches for all ranks. It will hang if there is an uneven number of batches Best, |
To expand on @tchaton answer: When you randomly select your dataset, make sure each process chooses the same one. You can set the seed to make sure of this: from pytorch_lightning import seed_everything
seed_everything(1) |
|
|
same question, have you solved it? |
I'm stuck using lightning=1.4.9
the numbers "96% 4260/4435" keeps the same forever.
The text was updated successfully, but these errors were encountered: