Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory except d0 and d1 training #459

Closed
rcg12387 opened this issue May 30, 2020 · 6 comments · Fixed by #711
Closed

Out of memory except d0 and d1 training #459

rcg12387 opened this issue May 30, 2020 · 6 comments · Fixed by #711
Labels
P1 Priority 1 (high)

Comments

@rcg12387
Copy link

rcg12387 commented May 30, 2020

Hi @mingxingtan.
Thank you for your great job. I have some troubles with your work. When I try to train bigger nets e.g. efficientdet-d3, it fails because of the resource exhausting problem. I only have trained d0 and d1 successfully, but from efficientdet-d2 they all fail.
For example, below is my command line:

MODEL = 'efficientdet-d2'
!python main.py --mode=train_and_eval
--training_file_pattern=tfrecord/{file_pattern}
--validation_file_pattern=tfrecord/{file_pattern}
--val_json_file=tfrecord/json_pascal.json
--model_name={MODEL}
--model_dir=/deep/temp/model_dir/{MODEL}-finetune
--ckpt={MODEL}
--train_batch_size=8
--eval_batch_size=8 --eval_samples=512
--num_examples_per_epoch={images_per_epoch} --num_epochs=10
--hparams="num_classes=20,moving_average_decay=0"
--use_tpu=False

Then it fails and outputs: "(0) Resource exhausted: OOM when allocating tensor with shape[8,48,48,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc. ... 0 successful operations. 0 derived errors ignored."

For efficientdet-d2, if train_batch_size and eval_batch_size are set to 4, it succeeds, but only for d2. Even though batch sizes are set to 4 or 2 for d3~d7, they all fail because of the resource exhausting problem.
My GPU is a GeForce GTX 1080 Ti, which has 11GB GDDR5X as you know.

Is this normal situation? If it is, how did you train all nets? Did you train them using TPU? Do I have to use TPU to train bigger nets?

Thanks in advance.

@InfiniteLife
Copy link

Yes it fails for me too... 4x RTX 2080 TI. Essentially I can train only d0. They used TPU

@mingxingtan
Copy link
Member

Yeah, I realize this is a problem for GPUs.

TPU also has 16GB per core, but it can fit 4 batch size for D0 - D7. After some study, I realized that TPU XLA compiler automatically uses re-materialization (see https://cloud.google.com/tpu/docs/system-architecture) to reduce the memory usage: it is similar to gradient checkpointing.

I am investigating some possible ways to allow training bigger models on GPUs, but it is still in progress.

@mingxingtan
Copy link
Member

For now, you can use "precision=mixed_float16" to partially mitigate this problem (would reduce memory usage by half).

@mingxingtan
Copy link
Member

change --hparams="num_classes=20,moving_average_decay=0" to --hparams="num_classes=20,moving_average_decay=0,precision=mixed_float16"

@rcg12387
Copy link
Author

rcg12387 commented Jun 1, 2020

Thank you for your reply. Your recommendation works to some extend. I have tried "precision=mixed_float16" for D3, but it has been awfully slow and is not finished yet for only one epoch. Anyway now I can understand that the OOM is general for bigger models on GPU. Thanks.

@bonlime
Copy link

bonlime commented Jun 18, 2020

I've been working on PyTorch reimplementation of EffDet and memory is also an issue there.
using a V100 with 32 Gb and mixed precision. I've been increasing batch size until OOM occurred for each model. The results are below

Model Resolution Max BS
D0 512 ~54
D3 896 8
D4 1024 5
D5 1280 3
D6 1280 2
D7 1536 1

So if you have V100 16Gb or 2080Ti you could probably train a D6 with 1 img / gpu but to get a decent batch size you would need at least 8 gpus and it still may not converge very nicely.

If it is possible to fit 2 imgs/ gpu for D5, then it definitely could be trained on 8 gpus station. Smaller models could be trained without OOM pain but would still require multiple GPUs and weeks.

Just for the reference on my machine for D3 it's ~150 minutes / COCO epoch for single gpu.

If you're interested the code for the model is here: https://github.com/bonlime/pytorch-tools/blob/dev/pytorch_tools/detection_models/efficientdet.py
Training code is a WIP but a draft is here: https://github.com/bonlime/pytorch_detection

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Priority 1 (high)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants