Killed #114

zcl912 · 2021-07-23T09:26:21Z

hello, when i train nano on single GPU by python tools/train.py -f exps/default/nano.py -d 1 -b 8 --fp16 -o, it went out (epoch 7):

2021-07-23 03:52:26 | INFO | yolox.core.trainer:245 - epoch: 7/300, iter: 12700/14786, mem: 20340Mb, iter_time: 0.679s, data_time: 0.493s, total_loss: 8.6, iou_loss: 2.5, l1_loss: 0.0, conf_loss: 3.8, cls_loss: 2.3, lr: 1.250e-03, size: 416, ETA: 11 days, 13:25:24
Killed

GOATmessi7 · 2021-07-24T02:22:32Z

This may be also caused by the memory leaky bug #103. We can reproduce this bug and fixing is on the way.

yonomitt · 2021-07-24T06:25:05Z

I am getting this pretty consistently, too. I think it could be related to the memory leak because if I run on a single GPU, I get an "out of memory" error, but if I run on multiple GPUs, I get something similar to this. More specifically:

2021-07-23 23:35:18 | INFO     | yolox.core.trainer:237 - epoch: 7/300, iter: 480/3236, mem: 6018Mb, iter_time: 32.623s, data_time: 0.058s, total_loss: 6.2, iou_loss: 2.3, l1_loss: 0.0, conf_loss: 2.9, cls_loss: 1.0, lr: 1.250e-03, size: 448, ETA: 3 days, 15:14:04
2021-07-23 23:35:27 | INFO     | yolox.core.trainer:183 - Training of experiment is done and the best AP is 0.00
2021-07-23 23:35:27 | ERROR    | yolox.core.launch:104 - An error has been caught in function '_distributed_worker', process 'SpawnProcess-1' (142), thread 'MainThread' (140190807295808):
Traceback (most recent call last):

(and I already tried setting the eval_loader's num_workers to 0)

GOATmessi7 · 2021-07-24T06:50:02Z

Yes, I think we may trigger the memory leak bug of pytorch dataloader like in pytorch/pytorch#13246

And I am trying some solutions listed in the above issues, but currently, some of them didn't work.

yonomitt · 2021-07-24T06:54:10Z

Ok, thanks for the info. I'll look into it some, too. If I find anything that works, I'll let you know 👍

Joker316701882 mentioned this issue Jul 28, 2021

Fix(core): fix memory leak issue and switch to subprocess backend #216

Merged

GOATmessi7 closed this as completed Sep 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Killed #114

Killed #114

zcl912 commented Jul 23, 2021

GOATmessi7 commented Jul 24, 2021

yonomitt commented Jul 24, 2021 •

edited

Loading

GOATmessi7 commented Jul 24, 2021

yonomitt commented Jul 24, 2021

Killed #114

Killed #114

Comments

zcl912 commented Jul 23, 2021

GOATmessi7 commented Jul 24, 2021

yonomitt commented Jul 24, 2021 • edited Loading

GOATmessi7 commented Jul 24, 2021

yonomitt commented Jul 24, 2021

yonomitt commented Jul 24, 2021 •

edited

Loading