-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of memory except d0 and d1 training #459
Comments
Yes it fails for me too... 4x RTX 2080 TI. Essentially I can train only d0. They used TPU |
Yeah, I realize this is a problem for GPUs. TPU also has 16GB per core, but it can fit 4 batch size for D0 - D7. After some study, I realized that TPU XLA compiler automatically uses re-materialization (see https://cloud.google.com/tpu/docs/system-architecture) to reduce the memory usage: it is similar to gradient checkpointing. I am investigating some possible ways to allow training bigger models on GPUs, but it is still in progress. |
For now, you can use "precision=mixed_float16" to partially mitigate this problem (would reduce memory usage by half). |
change --hparams="num_classes=20,moving_average_decay=0" to --hparams="num_classes=20,moving_average_decay=0,precision=mixed_float16" |
Thank you for your reply. Your recommendation works to some extend. I have tried "precision=mixed_float16" for D3, but it has been awfully slow and is not finished yet for only one epoch. Anyway now I can understand that the OOM is general for bigger models on GPU. Thanks. |
I've been working on PyTorch reimplementation of EffDet and memory is also an issue there.
So if you have V100 16Gb or 2080Ti you could probably train a D6 with 1 img / gpu but to get a decent batch size you would need at least 8 gpus and it still may not converge very nicely. If it is possible to fit 2 imgs/ gpu for D5, then it definitely could be trained on 8 gpus station. Smaller models could be trained without OOM pain but would still require multiple GPUs and weeks. Just for the reference on my machine for D3 it's ~150 minutes / COCO epoch for single gpu. If you're interested the code for the model is here: https://github.com/bonlime/pytorch-tools/blob/dev/pytorch_tools/detection_models/efficientdet.py |
Hi @mingxingtan.
Thank you for your great job. I have some troubles with your work. When I try to train bigger nets e.g. efficientdet-d3, it fails because of the resource exhausting problem. I only have trained d0 and d1 successfully, but from efficientdet-d2 they all fail.
For example, below is my command line:
MODEL = 'efficientdet-d2'
!python main.py --mode=train_and_eval
--training_file_pattern=tfrecord/{file_pattern}
--validation_file_pattern=tfrecord/{file_pattern}
--val_json_file=tfrecord/json_pascal.json
--model_name={MODEL}
--model_dir=/deep/temp/model_dir/{MODEL}-finetune
--ckpt={MODEL}
--train_batch_size=8
--eval_batch_size=8 --eval_samples=512
--num_examples_per_epoch={images_per_epoch} --num_epochs=10
--hparams="num_classes=20,moving_average_decay=0"
--use_tpu=False
Then it fails and outputs: "(0) Resource exhausted: OOM when allocating tensor with shape[8,48,48,112] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc. ... 0 successful operations. 0 derived errors ignored."
For efficientdet-d2, if train_batch_size and eval_batch_size are set to 4, it succeeds, but only for d2. Even though batch sizes are set to 4 or 2 for d3~d7, they all fail because of the resource exhausting problem.
My GPU is a GeForce GTX 1080 Ti, which has 11GB GDDR5X as you know.
Is this normal situation? If it is, how did you train all nets? Did you train them using TPU? Do I have to use TPU to train bigger nets?
Thanks in advance.
The text was updated successfully, but these errors were encountered: