Out of memory except d0 and d1 training #459

rcg12387 · 2020-05-30T04:10:15Z

Hi @mingxingtan.
Thank you for your great job. I have some troubles with your work. When I try to train bigger nets e.g. efficientdet-d3, it fails because of the resource exhausting problem. I only have trained d0 and d1 successfully, but from efficientdet-d2 they all fail.
For example, below is my command line:

MODEL = 'efficientdet-d2'
!python main.py --mode=train_and_eval
--training_file_pattern=tfrecord/{file_pattern}
--validation_file_pattern=tfrecord/{file_pattern}
--val_json_file=tfrecord/json_pascal.json
--model_name={MODEL}
--model_dir=/deep/temp/model_dir/{MODEL}-finetune
--ckpt={MODEL}
--train_batch_size=8
--eval_batch_size=8 --eval_samples=512
--num_examples_per_epoch={images_per_epoch} --num_epochs=10
--hparams="num_classes=20,moving_average_decay=0"
--use_tpu=False

Then it fails and outputs: "(0) Resource exhausted: OOM when allocating tensor with shape[8,48,48,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc. ... 0 successful operations. 0 derived errors ignored."

For efficientdet-d2, if train_batch_size and eval_batch_size are set to 4, it succeeds, but only for d2. Even though batch sizes are set to 4 or 2 for d3~d7, they all fail because of the resource exhausting problem.
My GPU is a GeForce GTX 1080 Ti, which has 11GB GDDR5X as you know.

Is this normal situation? If it is, how did you train all nets? Did you train them using TPU? Do I have to use TPU to train bigger nets?

Thanks in advance.

InfiniteLife · 2020-05-31T12:20:41Z

Yes it fails for me too... 4x RTX 2080 TI. Essentially I can train only d0. They used TPU

mingxingtan · 2020-05-31T17:15:19Z

Yeah, I realize this is a problem for GPUs.

TPU also has 16GB per core, but it can fit 4 batch size for D0 - D7. After some study, I realized that TPU XLA compiler automatically uses re-materialization (see https://cloud.google.com/tpu/docs/system-architecture) to reduce the memory usage: it is similar to gradient checkpointing.

I am investigating some possible ways to allow training bigger models on GPUs, but it is still in progress.

mingxingtan · 2020-05-31T17:16:09Z

For now, you can use "precision=mixed_float16" to partially mitigate this problem (would reduce memory usage by half).

mingxingtan · 2020-05-31T17:16:37Z

change --hparams="num_classes=20,moving_average_decay=0" to --hparams="num_classes=20,moving_average_decay=0,precision=mixed_float16"

rcg12387 · 2020-06-01T05:45:48Z

Thank you for your reply. Your recommendation works to some extend. I have tried "precision=mixed_float16" for D3, but it has been awfully slow and is not finished yet for only one epoch. Anyway now I can understand that the OOM is general for bigger models on GPU. Thanks.

bonlime · 2020-06-18T15:44:08Z

I've been working on PyTorch reimplementation of EffDet and memory is also an issue there.
using a V100 with 32 Gb and mixed precision. I've been increasing batch size until OOM occurred for each model. The results are below

Model	Resolution	Max BS
D0	512	~54
D3	896	8
D4	1024	5
D5	1280	3
D6	1280	2
D7	1536	1

So if you have V100 16Gb or 2080Ti you could probably train a D6 with 1 img / gpu but to get a decent batch size you would need at least 8 gpus and it still may not converge very nicely.

If it is possible to fit 2 imgs/ gpu for D5, then it definitely could be trained on 8 gpus station. Smaller models could be trained without OOM pain but would still require multiple GPUs and weeks.

Just for the reference on my machine for D3 it's ~150 minutes / COCO epoch for single gpu.

If you're interested the code for the model is here: https://github.com/bonlime/pytorch-tools/blob/dev/pytorch_tools/detection_models/efficientdet.py
Training code is a WIP but a draft is here: https://github.com/bonlime/pytorch_detection

mingxingtan added the P1 Priority 1 (high) label Jun 23, 2020

mingxingtan mentioned this issue Jun 23, 2020

How to let Learning rate be first linearly increased from 0 to 0.16 in the first training epoch and then annealed down using cosine decay rule. #205

Closed

NikZak mentioned this issue Aug 28, 2020

Gradient checkpointing #711

Merged

mingxingtan closed this as completed in #711 Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Out of memory except d0 and d1 training #459

Out of memory except d0 and d1 training #459

rcg12387 commented May 30, 2020 •

edited

Loading

InfiniteLife commented May 31, 2020

mingxingtan commented May 31, 2020

mingxingtan commented May 31, 2020

mingxingtan commented May 31, 2020

rcg12387 commented Jun 1, 2020 •

edited

Loading

bonlime commented Jun 18, 2020

Out of memory except d0 and d1 training #459

Out of memory except d0 and d1 training #459

Comments

rcg12387 commented May 30, 2020 • edited Loading

InfiniteLife commented May 31, 2020

mingxingtan commented May 31, 2020

mingxingtan commented May 31, 2020

mingxingtan commented May 31, 2020

rcg12387 commented Jun 1, 2020 • edited Loading

bonlime commented Jun 18, 2020

rcg12387 commented May 30, 2020 •

edited

Loading

rcg12387 commented Jun 1, 2020 •

edited

Loading