-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training stuck at 0% after few epochs while training with DDP #5865
Comments
Related? I'm facing a similar issue, not sure if those might help |
Hi, @HareshKarnan @ndrplz. Folks, may I ask you to run your pipelines with |
@ndrplz i dont know if it is related, but the problem here is that training does happen for first few epochs - in my case, it ran for 13 epochs and then gets stuck at 0% by epoch 14. |
Here is my output. It gets stuck at 0% while training in epoch 16 |
run your script with |
I ran the script again as This time it got stuck at epoch 11 |
same problem here, stuck at 0% at epoch 18 |
Mind share code ideally in Colab to reproduce? |
In my case, it stuck at 0% at epoch 18 with 2 gpus ddp before. |
have the same issue after updating to 1.1.8, will try with 1.2.0.dev0 to see if it has the same error pytorch 1.7 |
1.2.0rc1 also has the issue, 1.1.6 does not |
Would you be able to try master? We've recently consolidated the branches back to master! |
Hey everyone, Could it be related to this issue and solved by this PR: #6004 I have seen Best, |
Started a run with master + @tchaton 's patch, will see how it goes. UPDATE: run stalled at epoch 6 :( |
@tchaton any update here? |
Also having this problem. |
Wanted to add some details. |
Thanks! Will take a look and try to resolve it soon. |
@HareshKarnan, @talolard, @stillwalker1234 or @genghisun can any of you please provide a reproducible script/colab? Do you checkpoint based on a value that is not coming from lightning metrics and can be different on different processes? Probably related to #5604 (comment). |
I face the same issue and the problem was resolved by changing the number of workers to 0. This is not an acceptable fix workaround. I assume we have some deadlock when something goes wrong with spaning dataloaders. |
Dear @JonasFrey96, @taltalim,@HareshKarnan Would it be possible for you to work on a reproducible script using the BoringModel. Best, |
I'm having the same issue that goes away when using the default edit: just wanted to add that I am logging the loss as |
I've been running into this issue with 1.18 & 1.2.1 the last few days with ddp and 2 GPUs Things tried:
The epoch seems to be random (might stop after first, or after several), but always just after validation and on the start of the next epoch (0%). No error messages. Both GPUs locked at "100%" but the data being sent to GPU 0 (RX in nvtop) is 0 MB/s and the temperatures show that the GPUs are not working hard. 2 CPU cores locked at 100%. |
I think I've isolated the issue from the discussion in #5604 (comment). This issue started when I switched the
In |
I can see the same issue when running my training script. My |
Edit: response to a deleted question ¯_(ツ)_/¯
If for example, x is not a scalar (e.g. you want to calculate IoU or something), you can combine them using something like:
At the moment I'm working on a joint segmentation & classification problem and using
|
I'm also noticing that there are logger files (tfeevents) for each process now. I wonder if #6364 and this issue are related. |
At the moment I can't seem to reproduce using BoringModel. I will look into that in the next days. |
I encountered a similar issue, training hanging at the end of a validation epoch, when a custom metric is being synced between processes. I used This is probably not the issue for most people, but I just wanted to point out that this can cause some confusion too. I guess there's no way to check that all processes produce identically-sized tensors. Maybe the Metric API documentation could be updated to note that all state variables must have identical shape across processes. |
I hope this helps: I implemented all the reduction and logging in Looks like PL attempts to do a model checkpoint before running |
I have the same problem when using version==1.2.3 |
Hi, @ifsheldon I am facing the same problem, may I ask how do you solve this problem? |
An initial test with today's master seems to show this issue is fixed for me |
looks like this issue is fixed in version 1.2.4 |
@thiyagu145 oh really? I'll double check, I thought it required some changes not in 1.2.4. But that would be great if 1.2.4 fixes it. |
yea, training completed without any issues. |
Yeah, now I try 1.2.4 and there is no issue anymore. |
I can confirm 1.2.4 fixes the issue. I'm asking myself which PR fixed this - possibly #6410 ? |
Same issue. |
You can downgrade to 1.2.4 |
same issue with 1.8.3.post1 |
🐛 Bug
I recently updated to pytorch_lightning 1.1.7 and noticed that after a few epochs of training, the training % is stuck at 0% and never progresses. When I switched back to 1.1.4, this strange behavior does not occur. I do not know the root cause of this issue.
conda
,pip
, source): pip installThe text was updated successfully, but these errors were encountered: