Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM for GPU training #85

Closed
b03505036 opened this issue Mar 29, 2020 · 21 comments · Fixed by #711
Closed

OOM for GPU training #85

b03505036 opened this issue Mar 29, 2020 · 21 comments · Fixed by #711
Assignees

Comments

@b03505036
Copy link

b03505036 commented Mar 29, 2020

#!/bin/bash

MODEL=efficientdet-d1

train

CUDA_VISIBLE_DEVICES="1" python main.py --training_file_pattern=tfrecord/image_train*
--validation_file_pattern=tfrecord/image_val*
--mode='train_and_eval'
--model_name=$MODEL
--model_dir=$MODEL
--val_json_file='dataset/coco/annotations/image_val.json'
--hparams="use_bfloat16=false,num_classes=4" --use_tpu=False
--train_batch_size 8
No matter train_batch_size is 16 、8, the OOM always occur.
But for efficientdet-d0 all is fine.
My device is RTX2080 11G.
And using tensorflow-gpu 2.1

I'm surprised that the efficientdet-d1 occupied so much memory.
Is that normal?

@fsx950223
Copy link
Collaborator

#!/bin/bash

MODEL=efficientdet-d1

train

CUDA_VISIBLE_DEVICES="1" python main.py --training_file_pattern=tfrecord/image_train*
--validation_file_pattern=tfrecord/image_val*
--mode='train_and_eval'
--model_name=$MODEL
--model_dir=$MODEL
--val_json_file='dataset/coco/annotations/image_val.json'
--hparams="use_bfloat16=false,num_classes=4" --use_tpu=False
--train_batch_size 8
No matter train_batch_size is 16 、8, the OOM always occur.
But for efficientdet-d0 all is fine.
My device is RTX2080 11G.
And using tensorflow-gpu 2.1

I'm surprised that the efficientdet-d1 occupied so much memory.
Is that normal?

Could you try tf.enable_resource_variables()?

@TomHeaven
Copy link

TomHeaven commented Apr 3, 2020

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

@fsx950223
Copy link
Collaborator

fsx950223 commented Apr 3, 2020

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

How about

config_proto.graph_options.rewrite_options.auto_mixed_precision=rewriter_config_pb2.RewriterConfig.ON
config_proto.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.RECOMPUTATION_HEURISTICS

@mad-fogs
Copy link

mad-fogs commented Apr 3, 2020

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

Yes, i guess this implement initially might not be designed for existing GPU libs, i have to set batch size of 2 when training d4 on my 24G device. This is not so acceptable since i can train larger model(tridentnet r101) with batch size of 4/8.
And like the tf offical object detection api samples, multi-gpu training could not be launch currently.

@TomHeaven
Copy link

rewriter_config_pb2.RewriterConfig.ON

I also tried that with

from tensorflow.core.protobuf import rewriter_config_pb2

### Tom added to save gpu memory
tf.enable_resource_variables()
config = tf.ConfigProto()
config.graph_options.rewrite_options.auto_mixed_precision=rewriter_config_pb2.RewriterConfig.ON
config.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.RECOMPUTATION_HEURISTICS
###
config.gpu_options.allow_growth = True

at the top of main.py. The OOM was still there.

@fsx950223
Copy link
Collaborator

I disable ema and I could train bigger model.

@TomHeaven
Copy link

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

@fsx950223
Copy link
Collaborator

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

h.moving_average_decay = 0.

@mad-fogs
Copy link

mad-fogs commented Apr 5, 2020

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

h.moving_average_decay = 0.

24GB GPU, d6 model, with h.moving_average_decay =0 or h.moving_average_decay = 0.9998, batch_size=1.
OOM error.

@mingxingtan mingxingtan self-assigned this Apr 5, 2020
@mingxingtan mingxingtan mentioned this issue Apr 5, 2020
@mingxingtan mingxingtan changed the title OOM occur OOM for GPU training Apr 5, 2020
@mingxingtan
Copy link
Member

I could train EfficientDet-D7 with batch size 4 per core on TPUv3, where each core has 16GB memory. But it seems like GPU training OOM is a big issue. Need more investigation why GPU has so much memory.

Does anyone happen to know good memory profiling tools or instructions? Thanks!

@mingxingtan mingxingtan pinned this issue Apr 5, 2020
@fsx950223
Copy link
Collaborator

fsx950223 commented Apr 14, 2020

I could train EfficientDet-D7 with batch size 4 per core on TPUv3, where each core has 16GB memory. But it seems like GPU training OOM is a big issue. Need more investigation why GPU has so much memory.

Does anyone happen to know good memory profiling tools or instructions? Thanks!

Here is the solution.
It seems the reason

cls_losses = []
box_losses = []
for level in levels:
# Onehot encoding for classification labels.
cls_targets_at_level = tf.one_hot(
labels['cls_targets_%d' % level],
params['num_classes'])
bs, width, height, _, _ = cls_targets_at_level.get_shape().as_list()
cls_targets_at_level = tf.reshape(cls_targets_at_level,
[bs, width, height, -1])
box_targets_at_level = labels['box_targets_%d' % level]
cls_loss = _classification_loss(
cls_outputs[level],
cls_targets_at_level,
num_positives_sum,
alpha=params['alpha'],
gamma=params['gamma'])
cls_loss = tf.reshape(cls_loss,
[bs, width, height, -1, params['num_classes']])
cls_loss *= tf.cast(tf.expand_dims(
tf.not_equal(labels['cls_targets_%d' % level], -2), -1), tf.float32)
cls_losses.append(tf.reduce_sum(cls_loss))
box_losses.append(
_box_loss(
box_outputs[level],
box_targets_at_level,
num_positives_sum,
delta=params['delta']))
# Sum per level losses to total loss.
cls_loss = tf.add_n(cls_losses)
box_loss = tf.add_n(box_losses)

@mingxingtan mingxingtan unpinned this issue Apr 15, 2020
@LucasSloan
Copy link
Collaborator

LucasSloan commented Apr 17, 2020

I'm sorry, how do I solve this?

I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.

I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

@fsx950223
Copy link
Collaborator

I'm sorry, how do I solve this?

I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.

I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

Decrease batch_size.

@LucasSloan
Copy link
Collaborator

I'm sorry, how do I solve this?
I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.
I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

Decrease batch_size.

With a batch size of 16, without the AggregationMethod flag, it trains, albeit with a warning that "maybe things would be faster if we had more RAM". If I add the flag, the same thing happens - is that the expected result?

@Samjith888
Copy link
Contributor

`os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
 os.environ['CUDA_VISIBLE_DEVICES'] = "5"
 from tensorflow.compat.v1 import ConfigProto 
from tensorflow.compat.v1 import InteractiveSession 
config = ConfigProto() 
config.gpu_options.allow_growth = True 
session = InteractiveSession(config=config)`

Added the above line in main.py resolved the error. But the higher model training is very very slower than training with d0 model.
Any suggestions ?

@junyongyou
Copy link

Hi all, is there any progress on this problem? I am using GeForce GTX 1080 Ti, and train the models on my own dataset. I can train D4 when using batch size = 1, and get OOM on D5. I have tried all the approached mentioned ere, but none of them worked.

@Samjith888 If adding the lines works but very slowly, are you sure you are using GPU or perhaps the system happened to use CPU? I also tried your approach, but still get OOM.

@Samjith888
Copy link
Contributor

Hi all, is there any progress on this problem? I am using GeForce GTX 1080 Ti, and train the models on my own dataset. I can train D4 when using batch size = 1, and get OOM on D5. I have tried all the approached mentioned ere, but none of them worked.

@Samjith888 If adding the lines works but very slowly, are you sure you are using GPU or perhaps the system happened to use CPU? I also tried your approach, but still get OOM.

You are right, its using cpu instead of gpu

@fsx950223
Copy link
Collaborator

fsx950223 commented Apr 21, 2020

Decrease fpn_cell_repeats could solve your problem. But it also decreases performance.

@staeff777
Copy link

@LucasSloan Are you having some accaptable results with your RTX card and this configuration?

@staeff777
Copy link

Just for the record:

I'm trying to finetune PASCAL VOC 2012 as described in the readme on a RTX 2080ti with

  • fpn_cell_repeats=1
  • aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N
  • batch size 1

it still exceeds the memory limit of 10.7 GB on D5. It is about the same limitation as without the changes.

@fitoule
Copy link

fitoule commented May 4, 2020

Same here I can't train D4 on colab even with train_batch_size=1 and moving_average_decay = 0
What about reducing the input image size : image_size=896
Is it a non sense ? (because it works without OOM then)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants