Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainning stopped after 31 epochs without printing any error msg #91

Closed
deep-practice opened this issue Jul 23, 2021 · 11 comments
Closed

Comments

@deep-practice
Copy link

Ubuntu18.04
cuda10.1
pytorch 1.7.1

@GOATmessi7
Copy link
Member

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

@lidehuihxjz
Copy link

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

训练VOC数据集时也遇到这个问题,观察发现是内存一直在增大,内存耗完后退出了,是存在内存泄露泄露吗?

@GOATmessi7
Copy link
Member

@lidehuihxjz 你可以尝试把num_workers调小试试

@lidehuihxjz
Copy link

@lidehuihxjz 你可以尝试把num_workers调小试试

你好,num_workers从4调为2,仍存在内存使用一直增长。
使用VOC2007训练yolox_nano,除了dataloader相关部分切换为VOC,其它代码未改动。

@GOATmessi7
Copy link
Member

是显存增长还是内存增长?还有其他异常的信息没

@lidehuihxjz
Copy link

是显存增长还是内存增长?还有其他异常的信息没

是内存增长,没有报告其它异常信息呢。
单卡batchsize 16,十几分钟大约增长1G内存。

@GOATmessi7
Copy link
Member

是显存增长还是内存增长?还有其他异常的信息没

是内存增长,没有报告其它异常信息呢。
单卡batchsize 16,十几分钟大约增长1G内存。

那你先把workers调成0再看下什么情况吧。如果还有泄露,可能是dataloader那块有些问题

@JinYAnGHe
Copy link

We guess something wrong in the multi-threading of the dataloader and your env. We are planning to add a subprocess launcher as an alternative, but for now, you can check your nccl configure and try to resume the training with --resume

训练VOC数据集时也遇到这个问题,观察发现是内存一直在增大,内存耗完后退出了,是存在内存泄露泄露吗?

是的我遇到了同样的问题

@lidehuihxjz
Copy link

是显存增长还是内存增长?还有其他异常的信息没

是内存增长,没有报告其它异常信息呢。
单卡batchsize 16,十几分钟大约增长1G内存。

那你先把workers调成0再看下什么情况吧。如果还有泄露,可能是dataloader那块有些问题

num_workers设置为0,内存还是增长,只是增长速度变慢。

@FateScript
Copy link
Member

This issue is solved in #216

@ladyxuxu
Copy link

ladyxuxu commented Jul 6, 2022

the same error still occur ,version: yolox 0.3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants