dpp training stuck ( lightning=1.4.9 ) #9851

yinrong · 2021-10-07T07:26:10Z

I'm stuck using lightning=1.4.9
the numbers "96% 4260/4435" keeps the same forever.

tchaton · 2021-10-07T07:35:10Z

Would it be possible for you to provide a reproducible script, possibly with the BoringModel ?

Best,
T.C

yinrong · 2021-10-07T09:21:00Z

Dear @yinrong,

Would it be possible for you to provide a reproducible script, possibly with the BoringModel ?

Best, T.C

@tchaton

I tested my code with fewer data ( random 1% ) and it works well ( quick runs to epoch 66 ). So I think a minimum reproducible script is hard to find.

More clues:

restart python, it stucks at exact the same place.
the training dataset is randomly picked, like pick_row() if random() < 0.5
when stuck, cpu/gpu are fully utilized.
ctrl+c no response. kill the process some warning shows multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown

yinrong · 2021-10-07T14:32:45Z

I change some code. This time it stuck at epoch-0-val 100% , and after I CTRL+C :

yinrong · 2021-10-08T03:32:56Z

@tchaton could you continue to work on this issue?

tchaton · 2021-10-09T11:15:08Z

Hey @yinrong,

Could you make sure you have the same number of batches for all ranks.

It will hang if there is an uneven number of batches

Best,
T.C

awaelchli · 2021-10-10T11:01:15Z

To expand on @tchaton answer: When you randomly select your dataset, make sure each process chooses the same one. You can set the seed to make sure of this:

from pytorch_lightning import seed_everything
seed_everything(1)

yinrong · 2021-10-13T14:10:25Z

make all gpu (also called rank) have the same number of batches fixes the issue

colligant · 2021-10-13T16:55:50Z

seed_everything() fixed my issue as well. I am also randomly sampling from my dataset, but I was confronted with a different traceback - pytorch_lightning.utilities.exceptions.DeadlockException and WorkNCCL(OpType=AllReduce, Timeout(ms)=1800000). Just dropping this here in case it helps anyone.

IamHimon · 2021-10-28T08:06:21Z

same question, have you solved it?

tchaton added distributed Generic distributed-related topic bug Something isn't working priority: 1 Medium priority task waiting on author Waiting on user action, correction, or update labels Oct 7, 2021

kaushikb11 closed this as completed Oct 13, 2021

yoonseok312 mentioned this issue Dec 23, 2021

DDP training randomly stopping #11242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dpp training stuck ( lightning=1.4.9 ) #9851

dpp training stuck ( lightning=1.4.9 ) #9851

yinrong commented Oct 7, 2021 •

edited

Loading

tchaton commented Oct 7, 2021

yinrong commented Oct 7, 2021 •

edited

Loading

yinrong commented Oct 7, 2021

yinrong commented Oct 8, 2021

tchaton commented Oct 9, 2021

awaelchli commented Oct 10, 2021

yinrong commented Oct 13, 2021 •

edited

Loading

colligant commented Oct 13, 2021

IamHimon commented Oct 28, 2021

dpp training stuck ( lightning=1.4.9 ) #9851

dpp training stuck ( lightning=1.4.9 ) #9851

Comments

yinrong commented Oct 7, 2021 • edited Loading

tchaton commented Oct 7, 2021

yinrong commented Oct 7, 2021 • edited Loading

yinrong commented Oct 7, 2021

yinrong commented Oct 8, 2021

tchaton commented Oct 9, 2021

awaelchli commented Oct 10, 2021

yinrong commented Oct 13, 2021 • edited Loading

colligant commented Oct 13, 2021

IamHimon commented Oct 28, 2021

yinrong commented Oct 7, 2021 •

edited

Loading

yinrong commented Oct 7, 2021 •

edited

Loading

yinrong commented Oct 13, 2021 •

edited

Loading