Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dpp training stuck ( lightning=1.4.9 ) #9851

Closed
yinrong opened this issue Oct 7, 2021 · 9 comments
Closed

dpp training stuck ( lightning=1.4.9 ) #9851

yinrong opened this issue Oct 7, 2021 · 9 comments
Labels
bug Something isn't working distributed Generic distributed-related topic priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update

Comments

@yinrong
Copy link

yinrong commented Oct 7, 2021

I'm stuck using lightning=1.4.9
the numbers "96% 4260/4435" keeps the same forever.
image

@tchaton
Copy link
Contributor

tchaton commented Oct 7, 2021

Dear @yinrong,

Would it be possible for you to provide a reproducible script, possibly with the BoringModel ?

Best,
T.C

@tchaton tchaton added distributed Generic distributed-related topic bug Something isn't working priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update labels Oct 7, 2021
@yinrong
Copy link
Author

yinrong commented Oct 7, 2021

Dear @yinrong,

Would it be possible for you to provide a reproducible script, possibly with the BoringModel ?

Best, T.C

@tchaton

image

I tested my code with fewer data ( random 1% ) and it works well ( quick runs to epoch 66 ). So I think a minimum reproducible script is hard to find.

More clues:

  1. restart python, it stucks at exact the same place.
  2. the training dataset is randomly picked, like pick_row() if random() < 0.5
  3. when stuck, cpu/gpu are fully utilized.
    image
  4. ctrl+c no response. kill the process some warning shows multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown

@yinrong
Copy link
Author

yinrong commented Oct 7, 2021

I change some code. This time it stuck at epoch-0-val 100% , and after I CTRL+C :
image

@yinrong
Copy link
Author

yinrong commented Oct 8, 2021

@tchaton could you continue to work on this issue?

@tchaton
Copy link
Contributor

tchaton commented Oct 9, 2021

Hey @yinrong,

Could you make sure you have the same number of batches for all ranks.

It will hang if there is an uneven number of batches

Best,
T.C

@awaelchli
Copy link
Contributor

To expand on @tchaton answer: When you randomly select your dataset, make sure each process chooses the same one. You can set the seed to make sure of this:

from pytorch_lightning import seed_everything
seed_everything(1)

@yinrong
Copy link
Author

yinrong commented Oct 13, 2021

make all gpu (also called rank) have the same number of batches fixes the issue

@colligant
Copy link

seed_everything() fixed my issue as well. I am also randomly sampling from my dataset, but I was confronted with a different traceback - pytorch_lightning.utilities.exceptions.DeadlockException and WorkNCCL(OpType=AllReduce, Timeout(ms)=1800000). Just dropping this here in case it helps anyone.

@IamHimon
Copy link

same question, have you solved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed Generic distributed-related topic priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

6 participants