OOM for GPU training #85

b03505036 · 2020-03-29T05:07:26Z

#!/bin/bash

MODEL=efficientdet-d1

train

CUDA_VISIBLE_DEVICES="1" python main.py --training_file_pattern=tfrecord/image_train*
--validation_file_pattern=tfrecord/image_val*
--mode='train_and_eval'
--model_name=$MODEL
--model_dir=$MODEL
--val_json_file='dataset/coco/annotations/image_val.json'
--hparams="use_bfloat16=false,num_classes=4" --use_tpu=False
--train_batch_size 8
No matter train_batch_size is 16 、8, the OOM always occur.
But for efficientdet-d0 all is fine.
My device is RTX2080 11G.
And using tensorflow-gpu 2.1

I'm surprised that the efficientdet-d1 occupied so much memory.
Is that normal?

fsx950223 · 2020-04-01T04:20:44Z

#!/bin/bash

MODEL=efficientdet-d1

train

CUDA_VISIBLE_DEVICES="1" python main.py --training_file_pattern=tfrecord/image_train*
--validation_file_pattern=tfrecord/image_val*
--mode='train_and_eval'
--model_name=$MODEL
--model_dir=$MODEL
--val_json_file='dataset/coco/annotations/image_val.json'
--hparams="use_bfloat16=false,num_classes=4" --use_tpu=False
--train_batch_size 8
No matter train_batch_size is 16 、8, the OOM always occur.
But for efficientdet-d0 all is fine.
My device is RTX2080 11G.
And using tensorflow-gpu 2.1

I'm surprised that the efficientdet-d1 occupied so much memory.
Is that normal?

Could you try tf.enable_resource_variables()?

TomHeaven · 2020-04-03T14:12:49Z

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

fsx950223 · 2020-04-03T14:18:30Z

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

How about

config_proto.graph_options.rewrite_options.auto_mixed_precision=rewriter_config_pb2.RewriterConfig.ON
config_proto.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.RECOMPUTATION_HEURISTICS

mad-fogs · 2020-04-03T16:47:27Z

tf.enable_resource_variables

I tried that without luck. I'm also very curious about the huge GPU memory consumption. In my mind, efficientdet is light-weighted and efficient. However, I can only run efficientdet-d4 on a single Nvidia Titan V GPU by setting train_batch_size=1. Training efficientdet-d5 will result in OOM.

Yes, i guess this implement initially might not be designed for existing GPU libs, i have to set batch size of 2 when training d4 on my 24G device. This is not so acceptable since i can train larger model(tridentnet r101) with batch size of 4/8.
And like the tf offical object detection api samples, multi-gpu training could not be launch currently.

TomHeaven · 2020-04-04T01:32:49Z

rewriter_config_pb2.RewriterConfig.ON

I also tried that with

from tensorflow.core.protobuf import rewriter_config_pb2

### Tom added to save gpu memory
tf.enable_resource_variables()
config = tf.ConfigProto()
config.graph_options.rewrite_options.auto_mixed_precision=rewriter_config_pb2.RewriterConfig.ON
config.graph_options.rewrite_options.memory_optimization = rewriter_config_pb2.RewriterConfig.RECOMPUTATION_HEURISTICS
###
config.gpu_options.allow_growth = True

at the top of main.py. The OOM was still there.

fsx950223 · 2020-04-04T15:30:12Z

I disable ema and I could train bigger model.

TomHeaven · 2020-04-05T01:40:11Z

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

fsx950223 · 2020-04-05T01:48:25Z

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

h.moving_average_decay = 0.

mad-fogs · 2020-04-05T05:54:39Z

I disable ema and I could train bigger model.

What do you refer to by "ema"? Could you give us a detailed guide?

h.moving_average_decay = 0.

24GB GPU, d6 model, with h.moving_average_decay =0 or h.moving_average_decay = 0.9998, batch_size=1.
OOM error.

mingxingtan · 2020-04-05T22:22:44Z

I could train EfficientDet-D7 with batch size 4 per core on TPUv3, where each core has 16GB memory. But it seems like GPU training OOM is a big issue. Need more investigation why GPU has so much memory.

Does anyone happen to know good memory profiling tools or instructions? Thanks!

fsx950223 · 2020-04-14T01:29:52Z

I could train EfficientDet-D7 with batch size 4 per core on TPUv3, where each core has 16GB memory. But it seems like GPU training OOM is a big issue. Need more investigation why GPU has so much memory.

Does anyone happen to know good memory profiling tools or instructions? Thanks!

Here is the solution.
It seems the reason

automl/efficientdet/det_model_fn.py

Lines 238 to 269 in 75ba619

    
           cls_losses = [] 
        
           box_losses = [] 
        
           for level in levels: 
        
             # Onehot encoding for classification labels. 
        
             cls_targets_at_level = tf.one_hot( 
        
                 labels['cls_targets_%d' % level], 
        
                 params['num_classes']) 
        
             bs, width, height, _, _ = cls_targets_at_level.get_shape().as_list() 
        
             cls_targets_at_level = tf.reshape(cls_targets_at_level, 
        
                                               [bs, width, height, -1]) 
        
             box_targets_at_level = labels['box_targets_%d' % level] 
        
             cls_loss = _classification_loss( 
        
                 cls_outputs[level], 
        
                 cls_targets_at_level, 
        
                 num_positives_sum, 
        
                 alpha=params['alpha'], 
        
                 gamma=params['gamma']) 
        
             cls_loss = tf.reshape(cls_loss, 
        
                                   [bs, width, height, -1, params['num_classes']]) 
        
             cls_loss *= tf.cast(tf.expand_dims( 
        
                 tf.not_equal(labels['cls_targets_%d' % level], -2), -1), tf.float32) 
        
             cls_losses.append(tf.reduce_sum(cls_loss)) 
        
             box_losses.append( 
        
                 _box_loss( 
        
                     box_outputs[level], 
        
                     box_targets_at_level, 
        
                     num_positives_sum, 
        
                     delta=params['delta'])) 
        
           # Sum per level losses to total loss. 
        
           cls_loss = tf.add_n(cls_losses) 
        
           box_loss = tf.add_n(box_losses)

LucasSloan · 2020-04-17T03:21:39Z

I'm sorry, how do I solve this?

I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.

I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

fsx950223 · 2020-04-17T03:36:05Z

I'm sorry, how do I solve this?

I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.

I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

Decrease batch_size.

LucasSloan · 2020-04-18T02:25:01Z

I'm sorry, how do I solve this?
I added aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N to optimizer.compute_gradients() in det_model_fn.py, but I still got a warning that I was nearly out of GPU memory.
I'm using a batch size of 16 with efficientdet-d0, with the mixed precision and memory optimization flags from above on a 2080ti with 11 gigs of ram.

Decrease batch_size.

With a batch size of 16, without the AggregationMethod flag, it trains, albeit with a warning that "maybe things would be faster if we had more RAM". If I add the flag, the same thing happens - is that the expected result?

Samjith888 · 2020-04-20T17:35:07Z

`os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
 os.environ['CUDA_VISIBLE_DEVICES'] = "5"
 from tensorflow.compat.v1 import ConfigProto 
from tensorflow.compat.v1 import InteractiveSession 
config = ConfigProto() 
config.gpu_options.allow_growth = True 
session = InteractiveSession(config=config)`

Added the above line in main.py resolved the error. But the higher model training is very very slower than training with d0 model.
Any suggestions ?

junyongyou · 2020-04-21T09:25:05Z

Hi all, is there any progress on this problem? I am using GeForce GTX 1080 Ti, and train the models on my own dataset. I can train D4 when using batch size = 1, and get OOM on D5. I have tried all the approached mentioned ere, but none of them worked.

@Samjith888 If adding the lines works but very slowly, are you sure you are using GPU or perhaps the system happened to use CPU? I also tried your approach, but still get OOM.

Samjith888 · 2020-04-21T13:25:02Z

Hi all, is there any progress on this problem? I am using GeForce GTX 1080 Ti, and train the models on my own dataset. I can train D4 when using batch size = 1, and get OOM on D5. I have tried all the approached mentioned ere, but none of them worked.

@Samjith888 If adding the lines works but very slowly, are you sure you are using GPU or perhaps the system happened to use CPU? I also tried your approach, but still get OOM.

You are right, its using cpu instead of gpu

fsx950223 · 2020-04-21T13:37:24Z

Decrease fpn_cell_repeats could solve your problem. But it also decreases performance.

staeff777 · 2020-04-28T10:27:13Z

@LucasSloan Are you having some accaptable results with your RTX card and this configuration?

staeff777 · 2020-05-04T14:42:15Z

Just for the record:

I'm trying to finetune PASCAL VOC 2012 as described in the readme on a RTX 2080ti with

fpn_cell_repeats=1
aggregation_method=tf.AggregationMethod.EXPERIMENTAL_ACCUMULATE_N
batch size 1

it still exceeds the memory limit of 10.7 GB on D5. It is about the same limitation as without the changes.

fitoule · 2020-05-04T21:15:08Z

Same here I can't train D4 on colab even with train_batch_size=1 and moving_average_decay = 0
What about reducing the input image size : image_size=896
Is it a non sense ? (because it works without OOM then)

mingxingtan self-assigned this Apr 5, 2020

mingxingtan mentioned this issue Apr 5, 2020

Train error #39

Closed

mingxingtan changed the title ~~OOM occur~~ OOM for GPU training Apr 5, 2020

mingxingtan mentioned this issue Apr 5, 2020

some questions about GPU Devices #72

Closed

mingxingtan pinned this issue Apr 5, 2020

mingxingtan unpinned this issue Apr 15, 2020

glenn-jocher mentioned this issue Jun 12, 2020

About reproduced results ultralytics/yolov5#6

Closed

glenn-jocher mentioned this issue Jun 20, 2020

Repo Claims To Be YOLOv5 AlexeyAB/darknet#5920

Closed

NikZak mentioned this issue Aug 7, 2020

What is the configuration of your computer, such as GPU model and GPU memory size zylo117/Yet-Another-EfficientDet-Pytorch#430

Open

NikZak mentioned this issue Aug 28, 2020

Gradient checkpointing #711

Merged

mingxingtan closed this as completed in #711 Sep 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM for GPU training #85

OOM for GPU training #85

b03505036 commented Mar 29, 2020 •

edited

Loading

fsx950223 commented Apr 1, 2020

train

TomHeaven commented Apr 3, 2020 •

edited

Loading

fsx950223 commented Apr 3, 2020 •

edited

Loading

mad-fogs commented Apr 3, 2020

TomHeaven commented Apr 4, 2020

fsx950223 commented Apr 4, 2020

TomHeaven commented Apr 5, 2020

fsx950223 commented Apr 5, 2020

mad-fogs commented Apr 5, 2020

mingxingtan commented Apr 5, 2020

fsx950223 commented Apr 14, 2020 •

edited

Loading

LucasSloan commented Apr 17, 2020 •

edited

Loading

fsx950223 commented Apr 17, 2020

LucasSloan commented Apr 18, 2020

Samjith888 commented Apr 20, 2020

junyongyou commented Apr 21, 2020

Samjith888 commented Apr 21, 2020

fsx950223 commented Apr 21, 2020 •

edited

Loading

staeff777 commented Apr 28, 2020

staeff777 commented May 4, 2020

fitoule commented May 4, 2020 •

edited

Loading

OOM for GPU training #85

OOM for GPU training #85

Comments

b03505036 commented Mar 29, 2020 • edited Loading

train

fsx950223 commented Apr 1, 2020

train

TomHeaven commented Apr 3, 2020 • edited Loading

fsx950223 commented Apr 3, 2020 • edited Loading

mad-fogs commented Apr 3, 2020

TomHeaven commented Apr 4, 2020

fsx950223 commented Apr 4, 2020

TomHeaven commented Apr 5, 2020

fsx950223 commented Apr 5, 2020

mad-fogs commented Apr 5, 2020

mingxingtan commented Apr 5, 2020

fsx950223 commented Apr 14, 2020 • edited Loading

LucasSloan commented Apr 17, 2020 • edited Loading

fsx950223 commented Apr 17, 2020

LucasSloan commented Apr 18, 2020

Samjith888 commented Apr 20, 2020

junyongyou commented Apr 21, 2020

Samjith888 commented Apr 21, 2020

fsx950223 commented Apr 21, 2020 • edited Loading

staeff777 commented Apr 28, 2020

staeff777 commented May 4, 2020

fitoule commented May 4, 2020 • edited Loading

b03505036 commented Mar 29, 2020 •

edited

Loading

TomHeaven commented Apr 3, 2020 •

edited

Loading

fsx950223 commented Apr 3, 2020 •

edited

Loading

fsx950223 commented Apr 14, 2020 •

edited

Loading

LucasSloan commented Apr 17, 2020 •

edited

Loading

fsx950223 commented Apr 21, 2020 •

edited

Loading

fitoule commented May 4, 2020 •

edited

Loading