Custom Training with TF 2 Object Detect API Fails #8951

mm7721 · 2020-07-23T20:06:22Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[Y] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
[Y] I am reporting the issue to the correct repository. (Model Garden official or research directory)
[Y] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/...

2. Describe the bug

I'm attempting to migrate from the TF1 object detect API to the TF2 object detect API. The exact model isn't available in the TF2 version (quantized SSD Mobilenet v2), so I'm using EfficientDet-d0. But I'm attempting to keep as many things the same as possible, including using the exact same tfrecord training and validation files, similar config settings, etc. And I'm starting the fine-tuning from the config + weights found in the TF zoo. Note that there are 4 classes, and the config and label_map files have been updated appropriately.

This is being run locally on a machine with 2 GPUs, and I had to use tf.config.experimental.set_memory_growth(gpu, True) to get it to run at all.

Two failure modes are observed:

num_workers = 1: prints a long list of warnings regarding unresolved objects in the checkpoint, then exits without printing any errors. Here's the tail of the console output:

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
W0723 12:41:23.119486 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
W0723 12:41:23.119561 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0723 12:41:23.119644 139840608290624 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.

num_workers = 2: crashes with the following attribute error:

File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 549, in train_loop
load_fine_tune_checkpoint(detection_model,
File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 357, in load_fine_tune_checkpoint
strategy.run(
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 743, in _call_for_each_replica
wrapped = self._cfer_fn_cache.get(fn)
AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache'

3. Steps to reproduce

Can't be reproduced exactly on your end, as it involves some local files.

4. Expected behavior

Expect training to run successfully.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device name if the issue happens on a mobile device:
TensorFlow installed from (source or binary): pip
TensorFlow version (use command below): 2.2.0
Python version: 3.8
Bazel version (if compiling from source):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version: 10.1, 7.6.5
GPU model and memory: 2 x GeForce 2080 11GB

attianopp · 2020-07-23T22:03:04Z

Have you made sure your pipeline.config has the checkpoint_type = "detection" ? I got the same type of errors for num_workers=1 and that resolved it for me

mm7721 · 2020-07-23T22:55:19Z

@attianopp, thanks for the pointer - that solved failure mode #1. Not sure who maintains the zoo, but if the owners happen to see this post: could you update the .config files to have the correct default as "detection" instead of "classification"?

Failure mode #2 is still happening, so I think this thread needs to remain open.

mm7721 · 2020-07-24T00:08:18Z

@attianopp, did you manage to get training + evaluation to show up on tensorboard? I'm trying to replicate the TF1 output, in which you can see all the COCO metrics (AP50, ARmax1, etc) as well as eval images with boxes + groundtruth images with boxes. But for me it seems only training is running, not eval. And after inspecting the code, I'm wondering if you have to run one process for training (FLAGS.checkpoint_dir=None) and one for eval (with FLAGS.checkpoint_dir="some directory").

attianopp · 2020-07-24T00:29:11Z

Yes I was. Exactly, you need to specify the checkpoint dir when calling model_main_tf2.py to launch the evaluation script that outputs all the coco metrics and what not, in a separate process, if you have enough RAM to sustain that.

My metric output is all off for coco (0/-1 see image) but both my training and test losses are improving and <1. I also have no bounding boxes output in the “images” tab of tensorboard where it shows side_by_side_eval images. I am not sure how to fix the coco metrics being off (see attached image), I found a post saying that it had to do with training with a batch_size > 1. So I’m re-training my model.

EDIT: After re-training w/ batch_size=1 I got bounding boxes in the side_by_side_eval, they are incorrect but likely because I barely trained the model (~1000 steps). I will update again when I have train more (~10000 steps). SECOND UPDATE: still the same issue.

The other likely source of this issue for me I think could be from the way I make the tf record files. I think the bounding boxes might be mis-specified but I used a stock script for generation of the bounding boxes. Have you had to trouble shoot the coco metrics being off?

EDIT: link to #8917 saying these results are due to batch_size/learning rate. Do you if the learning rate referenced is the learning_rate_base: 0.07999999821186066 ?

mm7721 · 2020-07-24T02:56:03Z

I've got the two processes up and running now (I think I prefer the previous TF1 version that interleaved training and eval from a single process, and am kind of hoping they build that back in).

The COCO metrics are coming out as reasonable values (e.g. mAP50 = 0.6), and the images tab now shows the validation images with bounding boxes (both model inferences and groundtruths). So it seems to be working.

Not sure why yours is giving 0/-1 for all the values, but will let you know if I come across any clues. For reference, I trained with a batch size of 8, and ran eval with a batch size of 1.

attianopp · 2020-07-24T02:59:41Z

I seem to have just updated my comment as you posted with the example output. I also agree that I would rather have the single process that interleaves both, as I get OOM errors on my tiny GPU when I try to run both processes at once.

Does that image above provide any insight?

This was the main part of the code used to generate tfrecords:


from __future__ import division
from __future__ import print_function
from __future__ import absolute_import

import os
import io
import pandas as pd
import tensorflow as tf
from PIL import Image
from object_detection.utils import dataset_util
from collections import namedtuple

flags = tf.compat.v1.app.flags
flags.DEFINE_string('csv_input', '', 'Path to the CSV input')
flags.DEFINE_string('output_path', '', 'Path to output TFRecord')
flags.DEFINE_string('image_dir', '', 'Path to images')
FLAGS = flags.FLAGS

def class_text_to_int(row_label):
    if row_label == 'Custom':
        return 1
    else:
        None

def split(df, group):
    data = namedtuple('data', ['filename', 'object'])
    gb = df.groupby(group)
    return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)]


def create_tf_example(group, path):
    with tf.compat.v1.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
        encoded_jpg = fid.read()
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image = Image.open(encoded_jpg_io)
    width, height = image.size

    filename = group.filename.encode('utf8')
    image_format = b'jpg'
    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    classes_text = []
    classes = []

    for index, row in group.object.iterrows():
        xmins.append(row['xmin'] / width)
        xmaxs.append(row['xmax'] / width)
        ymins.append(row['ymin'] / height)
        ymaxs.append(row['ymax'] / height)
        classes_text.append(row['class'].encode('utf8'))
        classes.append(class_text_to_int(row['class']))
    tf_example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(filename),
        # 'image/source_id': dataset_util.bytes_feature(filename),
        'image/source_id': dataset_util.bytes_feature('0'.encode('utf8')),
        'image/encoded': dataset_util.bytes_feature(encoded_jpg),
        'image/format': dataset_util.bytes_feature(image_format),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))
    return tf_example


def main(_):
    writer = tf.compat.v1.python_io.TFRecordWriter(FLAGS.output_path)
    path = os.path.join(FLAGS.image_dir)
    examples = pd.read_csv(FLAGS.csv_input)
    grouped = split(examples, 'filename')
    for group in grouped:
        tf_example = create_tf_example(group, path)
        writer.write(tf_example.SerializeToString())

    writer.close()
    output_path = os.path.join(os.getcwd(), FLAGS.output_path)
    print('Successfully created the TFRecords: {}'.format(output_path))


if __name__ == '__main__':
    tf.compat.v1.app.run()

How many steps did you specify for --num_train_steps?

I also read another post (#6273) that said the order of the coords in the bounding boxes had to be [xmin,ymin,xmax,ymax]. Does that apply here? Did you specify your tf examples bounding boxes with different order coords?

My "train_input_images" on tensorboard don't have bounding boxes even after training w a batch_size=1. The post eval images have bounding boxes in the eval_side_by_side output on tensorboard, but the ground truth images do not (assuming the left side image is the prediction and the right side is ground truth).

mm7721 · 2020-07-25T16:09:16Z

Regarding your tfrecord generation code, it looks extremely similar to mine. One tiny difference is that I'm using tf2 rather than tf.compat.v1, but I doubt that makes a difference. I don't believe the order of xmin/xmax/ymin/ymax matters, as I think the reader is looking for tags rather than particular indexes (plus, my ordering is identical to yours). What I'd recommend is building a little custom reader so you can open the tfrecord and inspect its contents. That might help uncover issues. Also, have you gotten this particular tfrecord to run with the TF1 object detect API?

Regarding --num_train_steps, I'm not using that parameter. Instead, I specify the number of steps in pipeline.config.

Finally, regarding batch_size, I'm guessing that's a red herring. Mine works with a variety of batch sizes.

attianopp · 2020-07-25T18:23:12Z

I really appreciate your detailed response. Your intuition was correct, using https://github.com/sulc/tfrecord-viewer I see now that images that were taken with horizontal orientation have incorrectly placed bboxes. My dataset is made of variable size images that are dimensions HxW or WxH. The record generation process I currently use seems to assume a constant image size, so the horizontal images have boxes that are offset incorrectly. So this likely wouldn’t work with tf1. I am trouble shooting that right now, passing in the correct for each image height/width to the tf_example in that tf_record script doesn’t seem to fix it. I’ll update this when I figure it out. Thanks again for your help :)

ramesh8v · 2020-07-26T17:59:53Z

@attianopp: My tfrecord creation script is very similar to yours. I resized all images and bboxes to 640x640. My tfrecord looks perfect. The training job is running without any errors, but images in tensorboard look weird (some images are in a different contrast and color), and there are no bboxes. Tried with a variety of batch sizes, number of training steps, and recreating tfrecord, but still, the issue persists. Please let us know if you find a solution.

attianopp · 2020-07-29T19:35:14Z

@ramesh8v It was an issue with my tf-records. After I fixed it so I was sure all my bounding boxes were correct in the tf-record I got correct output for coco metrics and my evaluation images on tensorboard had, for the most part, correct predictions. The input train images don't have bounding boxes, and the contrast/color is changed. I am using faster_rcnn and I believe it has built in data augmentation and that's what you're seeing. Check your evaluation results by running python model_main_tf2.py --model_dir=$model_dir --checkpoint_dir = $model_dir --sample_1_of_n_eval_examples=1 --alsologtostderr The end of the output should look something like:

ramesh8v · 2020-07-30T01:14:01Z

Thank you @attianopp. Yes, my eval results look good.

attianopp · 2020-08-22T03:46:02Z

I too get the same error AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache' when trying to run with num_workers>1. I upgraded to an instance which has 4 Tesla T4s on it to try to train Efficientdet-d7 with a larger distributed batch size on a dataset I was able to successfully produce results with num_workers=1.

However I see utilization of the 4 gpus on watch nvidia-smi when I set num_workers=1 with a larger batch size, and it seems to distribute the batch among all gpus, but also reproduce the model on each gpu separately if I am reading the verbose output from tensorflow correctly (taking up a lot of un-necessary space I assume). Is it set to do distributed training by default?

04633435 · 2020-12-02T08:12:57Z

I got the same error when I set the num_workers = 2 to train my model on a machine with 2 GPUs and tensorflow 2.2.0, with the log as follow,

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\absl\app.py", line 300, in run
    _run_main(main, args)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 104, in main
    model_lib_v2.train_loop(
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\object_detection\model_lib_v2.py", line 561, in train_loop
    load_fine_tune_checkpoint(detection_model,
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\object_detection\model_lib_v2.py", line 361, in load_fine_tune_checkpoint
    strategy.run(
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 951, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2290, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\mirrored_strategy.py", line 743, in _call_for_each_replica
    wrapped = self._cfer_fn_cache.get(fn)
AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache'

The problem got fixed after I upgraded tensorflow from 2.2.0 to 2.3.0. The training process was running normally as did on one-GPU setup.
Hopefully can be helpful to your case.

mm7721 added models:official models that come under official repository type:bug Bug in the code labels Jul 23, 2020

ravikyram added models:research models that come under research directory and removed models:official models that come under official repository labels Jul 24, 2020

ravikyram assigned tombstone, jch1 and pkulzc Jul 24, 2020

attianopp mentioned this issue Jul 24, 2020

Object detection obtain a mAP value so low #8917

Closed

ujjwal-ai mentioned this issue Aug 9, 2020

ResNet50V1 classification checkpoint not loaded when training Faster-RCNN from scratch on custom dataset #9079

Open

peaceiris mentioned this issue Oct 28, 2020

Tutorial-TensorFlow-Models: Bump TensorFlow from 1.12.0 to 2.3.0 SI-Aizu/documentation#166

Closed

jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Custom Training with TF 2 Object Detect API Fails #8951

Custom Training with TF 2 Object Detect API Fails #8951

mm7721 commented Jul 23, 2020 •

edited

Loading

attianopp commented Jul 23, 2020

mm7721 commented Jul 23, 2020

mm7721 commented Jul 24, 2020

attianopp commented Jul 24, 2020 •

edited

Loading

mm7721 commented Jul 24, 2020 •

edited

Loading

attianopp commented Jul 24, 2020 •

edited

Loading

mm7721 commented Jul 25, 2020

attianopp commented Jul 25, 2020 •

edited

Loading

ramesh8v commented Jul 26, 2020

attianopp commented Jul 29, 2020 •

edited

Loading

ramesh8v commented Jul 30, 2020

attianopp commented Aug 22, 2020

04633435 commented Dec 2, 2020

Custom Training with TF 2 Object Detect API Fails #8951

Custom Training with TF 2 Object Detect API Fails #8951

Comments

mm7721 commented Jul 23, 2020 • edited Loading

Prerequisites

1. The entire URL of the file you are using

2. Describe the bug

3. Steps to reproduce

4. Expected behavior

5. Additional context

6. System information

attianopp commented Jul 23, 2020

mm7721 commented Jul 23, 2020

mm7721 commented Jul 24, 2020

attianopp commented Jul 24, 2020 • edited Loading

mm7721 commented Jul 24, 2020 • edited Loading

attianopp commented Jul 24, 2020 • edited Loading

mm7721 commented Jul 25, 2020

attianopp commented Jul 25, 2020 • edited Loading

ramesh8v commented Jul 26, 2020

attianopp commented Jul 29, 2020 • edited Loading

ramesh8v commented Jul 30, 2020

attianopp commented Aug 22, 2020

04633435 commented Dec 2, 2020

mm7721 commented Jul 23, 2020 •

edited

Loading

attianopp commented Jul 24, 2020 •

edited

Loading

mm7721 commented Jul 24, 2020 •

edited

Loading

attianopp commented Jul 24, 2020 •

edited

Loading

attianopp commented Jul 25, 2020 •

edited

Loading

attianopp commented Jul 29, 2020 •

edited

Loading