Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killed #114

Closed
zcl912 opened this issue Jul 23, 2021 · 4 comments
Closed

Killed #114

zcl912 opened this issue Jul 23, 2021 · 4 comments

Comments

@zcl912
Copy link

zcl912 commented Jul 23, 2021

hello, when i train nano on single GPU by python tools/train.py -f exps/default/nano.py -d 1 -b 8 --fp16 -o, it went out (epoch 7):

2021-07-23 03:52:26 | INFO | yolox.core.trainer:245 - epoch: 7/300, iter: 12700/14786, mem: 20340Mb, iter_time: 0.679s, data_time: 0.493s, total_loss: 8.6, iou_loss: 2.5, l1_loss: 0.0, conf_loss: 3.8, cls_loss: 2.3, lr: 1.250e-03, size: 416, ETA: 11 days, 13:25:24
Killed

@GOATmessi7
Copy link
Member

This may be also caused by the memory leaky bug #103. We can reproduce this bug and fixing is on the way.

@yonomitt
Copy link

yonomitt commented Jul 24, 2021

I am getting this pretty consistently, too. I think it could be related to the memory leak because if I run on a single GPU, I get an "out of memory" error, but if I run on multiple GPUs, I get something similar to this. More specifically:

2021-07-23 23:35:18 | INFO     | yolox.core.trainer:237 - epoch: 7/300, iter: 480/3236, mem: 6018Mb, iter_time: 32.623s, data_time: 0.058s, total_loss: 6.2, iou_loss: 2.3, l1_loss: 0.0, conf_loss: 2.9, cls_loss: 1.0, lr: 1.250e-03, size: 448, ETA: 3 days, 15:14:04
2021-07-23 23:35:27 | INFO     | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-23 23:35:27 | ERROR    | yolox.core.launch:104 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (142), thread 'MainThread' (140190807295808):
Traceback (most recent call last):

(and I already tried setting the eval_loader's num_workers to 0)

@GOATmessi7
Copy link
Member

Yes, I think we may trigger the memory leak bug of pytorch dataloader like in pytorch/pytorch#13246

And I am trying some solutions listed in the above issues, but currently, some of them didn't work.

@yonomitt
Copy link

Ok, thanks for the info. I'll look into it some, too. If I find anything that works, I'll let you know 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants