Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Custom Training with TF 2 Object Detect API Fails #8951

Open
mm7721 opened this issue Jul 23, 2020 · 13 comments
Open

Custom Training with TF 2 Object Detect API Fails #8951

mm7721 opened this issue Jul 23, 2020 · 13 comments
Assignees
Labels

Comments

@mm7721
Copy link

mm7721 commented Jul 23, 2020

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • [Y] I am using the latest TensorFlow Model Garden release and TensorFlow 2.
  • [Y] I am reporting the issue to the correct repository. (Model Garden official or research directory)
  • [Y] I checked to make sure that this issue has not been filed already.

1. The entire URL of the file you are using

https://github.com/tensorflow/models/tree/master/official/...

2. Describe the bug

I'm attempting to migrate from the TF1 object detect API to the TF2 object detect API. The exact model isn't available in the TF2 version (quantized SSD Mobilenet v2), so I'm using EfficientDet-d0. But I'm attempting to keep as many things the same as possible, including using the exact same tfrecord training and validation files, similar config settings, etc. And I'm starting the fine-tuning from the config + weights found in the TF zoo. Note that there are 4 classes, and the config and label_map files have been updated appropriately.

This is being run locally on a machine with 2 GPUs, and I had to use tf.config.experimental.set_memory_growth(gpu, True) to get it to run at all.

Two failure modes are observed:

  1. num_workers = 1: prints a long list of warnings regarding unresolved objects in the checkpoint, then exits without printing any errors. Here's the tail of the console output:

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
W0723 12:41:23.119486 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
W0723 12:41:23.119561 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0723 12:41:23.119644 139840608290624 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.


  1. num_workers = 2: crashes with the following attribute error:

File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 549, in train_loop
load_fine_tune_checkpoint(detection_model,
File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 357, in load_fine_tune_checkpoint
strategy.run(
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 743, in _call_for_each_replica
wrapped = self._cfer_fn_cache.get(fn)
AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache'

3. Steps to reproduce

Can't be reproduced exactly on your end, as it involves some local files.

4. Expected behavior

Expect training to run successfully.

5. Additional context

Include any logs that would be helpful to diagnose the problem.

6. System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device name if the issue happens on a mobile device:
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 2.2.0
  • Python version: 3.8
  • Bazel version (if compiling from source):
  • GCC/Compiler version (if compiling from source):
  • CUDA/cuDNN version: 10.1, 7.6.5
  • GPU model and memory: 2 x GeForce 2080 11GB
@mm7721 mm7721 added models:official models that come under official repository type:bug Bug in the code labels Jul 23, 2020
@attianopp
Copy link

Have you made sure your pipeline.config has the checkpoint_type = "detection" ? I got the same type of errors for num_workers=1 and that resolved it for me

@mm7721
Copy link
Author

mm7721 commented Jul 23, 2020

@attianopp, thanks for the pointer - that solved failure mode #1. Not sure who maintains the zoo, but if the owners happen to see this post: could you update the .config files to have the correct default as "detection" instead of "classification"?

Failure mode #2 is still happening, so I think this thread needs to remain open.

@mm7721
Copy link
Author

mm7721 commented Jul 24, 2020

@attianopp, did you manage to get training + evaluation to show up on tensorboard? I'm trying to replicate the TF1 output, in which you can see all the COCO metrics (AP50, ARmax1, etc) as well as eval images with boxes + groundtruth images with boxes. But for me it seems only training is running, not eval. And after inspecting the code, I'm wondering if you have to run one process for training (FLAGS.checkpoint_dir=None) and one for eval (with FLAGS.checkpoint_dir="some directory").

@attianopp
Copy link

attianopp commented Jul 24, 2020

train_results
Yes I was. Exactly, you need to specify the checkpoint dir when calling model_main_tf2.py to launch the evaluation script that outputs all the coco metrics and what not, in a separate process, if you have enough RAM to sustain that.

My metric output is all off for coco (0/-1 see image) but both my training and test losses are improving and <1. I also have no bounding boxes output in the “images” tab of tensorboard where it shows side_by_side_eval images. I am not sure how to fix the coco metrics being off (see attached image), I found a post saying that it had to do with training with a batch_size > 1. So I’m re-training my model.

EDIT: After re-training w/ batch_size=1 I got bounding boxes in the side_by_side_eval, they are incorrect but likely because I barely trained the model (~1000 steps). I will update again when I have train more (~10000 steps). SECOND UPDATE: still the same issue.

The other likely source of this issue for me I think could be from the way I make the tf record files. I think the bounding boxes might be mis-specified but I used a stock script for generation of the bounding boxes. Have you had to trouble shoot the coco metrics being off?

EDIT: link to #8917 saying these results are due to batch_size/learning rate. Do you if the learning rate referenced is the learning_rate_base: 0.07999999821186066 ?

@mm7721
Copy link
Author

mm7721 commented Jul 24, 2020

I've got the two processes up and running now (I think I prefer the previous TF1 version that interleaved training and eval from a single process, and am kind of hoping they build that back in).

The COCO metrics are coming out as reasonable values (e.g. mAP50 = 0.6), and the images tab now shows the validation images with bounding boxes (both model inferences and groundtruths). So it seems to be working.

Not sure why yours is giving 0/-1 for all the values, but will let you know if I come across any clues. For reference, I trained with a batch size of 8, and ran eval with a batch size of 1.

@attianopp
Copy link

attianopp commented Jul 24, 2020

I seem to have just updated my comment as you posted with the example output. I also agree that I would rather have the single process that interleaves both, as I get OOM errors on my tiny GPU when I try to run both processes at once.

Does that image above provide any insight?

This was the main part of the code used to generate tfrecords:


from __future__ import division
from __future__ import print_function
from __future__ import absolute_import

import os
import io
import pandas as pd
import tensorflow as tf
from PIL import Image
from object_detection.utils import dataset_util
from collections import namedtuple

flags = tf.compat.v1.app.flags
flags.DEFINE_string('csv_input', '', 'Path to the CSV input')
flags.DEFINE_string('output_path', '', 'Path to output TFRecord')
flags.DEFINE_string('image_dir', '', 'Path to images')
FLAGS = flags.FLAGS

def class_text_to_int(row_label):
    if row_label == 'Custom':
        return 1
    else:
        None

def split(df, group):
    data = namedtuple('data', ['filename', 'object'])
    gb = df.groupby(group)
    return [data(filename, gb.get_group(x)) for filename, x in zip(gb.groups.keys(), gb.groups)]


def create_tf_example(group, path):
    with tf.compat.v1.gfile.GFile(os.path.join(path, '{}'.format(group.filename)), 'rb') as fid:
        encoded_jpg = fid.read()
    encoded_jpg_io = io.BytesIO(encoded_jpg)
    image = Image.open(encoded_jpg_io)
    width, height = image.size

    filename = group.filename.encode('utf8')
    image_format = b'jpg'
    xmins = []
    xmaxs = []
    ymins = []
    ymaxs = []
    classes_text = []
    classes = []

    for index, row in group.object.iterrows():
        xmins.append(row['xmin'] / width)
        xmaxs.append(row['xmax'] / width)
        ymins.append(row['ymin'] / height)
        ymaxs.append(row['ymax'] / height)
        classes_text.append(row['class'].encode('utf8'))
        classes.append(class_text_to_int(row['class']))
    tf_example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(filename),
        # 'image/source_id': dataset_util.bytes_feature(filename),
        'image/source_id': dataset_util.bytes_feature('0'.encode('utf8')),
        'image/encoded': dataset_util.bytes_feature(encoded_jpg),
        'image/format': dataset_util.bytes_feature(image_format),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))
    return tf_example


def main(_):
    writer = tf.compat.v1.python_io.TFRecordWriter(FLAGS.output_path)
    path = os.path.join(FLAGS.image_dir)
    examples = pd.read_csv(FLAGS.csv_input)
    grouped = split(examples, 'filename')
    for group in grouped:
        tf_example = create_tf_example(group, path)
        writer.write(tf_example.SerializeToString())

    writer.close()
    output_path = os.path.join(os.getcwd(), FLAGS.output_path)
    print('Successfully created the TFRecords: {}'.format(output_path))


if __name__ == '__main__':
    tf.compat.v1.app.run()

How many steps did you specify for --num_train_steps?

I also read another post (#6273) that said the order of the coords in the bounding boxes had to be [xmin,ymin,xmax,ymax]. Does that apply here? Did you specify your tf examples bounding boxes with different order coords?

My "train_input_images" on tensorboard don't have bounding boxes even after training w a batch_size=1. The post eval images have bounding boxes in the eval_side_by_side output on tensorboard, but the ground truth images do not (assuming the left side image is the prediction and the right side is ground truth).

@ravikyram ravikyram added models:research models that come under research directory and removed models:official models that come under official repository labels Jul 24, 2020
@mm7721
Copy link
Author

mm7721 commented Jul 25, 2020

Regarding your tfrecord generation code, it looks extremely similar to mine. One tiny difference is that I'm using tf2 rather than tf.compat.v1, but I doubt that makes a difference. I don't believe the order of xmin/xmax/ymin/ymax matters, as I think the reader is looking for tags rather than particular indexes (plus, my ordering is identical to yours). What I'd recommend is building a little custom reader so you can open the tfrecord and inspect its contents. That might help uncover issues. Also, have you gotten this particular tfrecord to run with the TF1 object detect API?

Regarding --num_train_steps, I'm not using that parameter. Instead, I specify the number of steps in pipeline.config.

Finally, regarding batch_size, I'm guessing that's a red herring. Mine works with a variety of batch sizes.

@attianopp
Copy link

attianopp commented Jul 25, 2020

I really appreciate your detailed response. Your intuition was correct, using https://github.com/sulc/tfrecord-viewer I see now that images that were taken with horizontal orientation have incorrectly placed bboxes. My dataset is made of variable size images that are dimensions HxW or WxH. The record generation process I currently use seems to assume a constant image size, so the horizontal images have boxes that are offset incorrectly. So this likely wouldn’t work with tf1. I am trouble shooting that right now, passing in the correct for each image height/width to the tf_example in that tf_record script doesn’t seem to fix it. I’ll update this when I figure it out. Thanks again for your help :)

@ramesh8v
Copy link

@attianopp: My tfrecord creation script is very similar to yours. I resized all images and bboxes to 640x640. My tfrecord looks perfect. The training job is running without any errors, but images in tensorboard look weird (some images are in a different contrast and color), and there are no bboxes. Tried with a variety of batch sizes, number of training steps, and recreating tfrecord, but still, the issue persists. Please let us know if you find a solution.

@attianopp
Copy link

attianopp commented Jul 29, 2020

@ramesh8v It was an issue with my tf-records. After I fixed it so I was sure all my bounding boxes were correct in the tf-record I got correct output for coco metrics and my evaluation images on tensorboard had, for the most part, correct predictions. The input train images don't have bounding boxes, and the contrast/color is changed. I am using faster_rcnn and I believe it has built in data augmentation and that's what you're seeing. Check your evaluation results by running python model_main_tf2.py --model_dir=$model_dir --checkpoint_dir = $model_dir --sample_1_of_n_eval_examples=1 --alsologtostderr The end of the output should look something like:
Capture

@ramesh8v
Copy link

Thank you @attianopp. Yes, my eval results look good.

@attianopp
Copy link

I too get the same error AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache' when trying to run with num_workers>1. I upgraded to an instance which has 4 Tesla T4s on it to try to train Efficientdet-d7 with a larger distributed batch size on a dataset I was able to successfully produce results with num_workers=1.

However I see utilization of the 4 gpus on watch nvidia-smi when I set num_workers=1 with a larger batch size, and it seems to distribute the batch among all gpus, but also reproduce the model on each gpu separately if I am reading the verbose output from tensorflow correctly (taking up a lot of un-necessary space I assume). Is it set to do distributed training by default?

@04633435
Copy link

04633435 commented Dec 2, 2020

I got the same error when I set the num_workers = 2 to train my model on a machine with 2 GPUs and tensorflow 2.2.0, with the log as follow,

Traceback (most recent call last):
  File "model_main_tf2.py", line 113, in <module>
    tf.compat.v1.app.run()
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\platform\app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\absl\app.py", line 300, in run
    _run_main(main, args)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\absl\app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "model_main_tf2.py", line 104, in main
    model_lib_v2.train_loop(
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\object_detection\model_lib_v2.py", line 561, in train_loop
    load_fine_tune_checkpoint(detection_model,
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\object_detection\model_lib_v2.py", line 361, in load_fine_tune_checkpoint
    strategy.run(
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 951, in run
    return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\distribute_lib.py", line 2290, in call_for_each_replica
    return self._call_for_each_replica(fn, args, kwargs)
  File "C:\User\Anaconda3\envs\TF\lib\site-packages\tensorflow\python\distribute\mirrored_strategy.py", line 743, in _call_for_each_replica
    wrapped = self._cfer_fn_cache.get(fn)
AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache'

The problem got fixed after I upgraded tensorflow from 2.2.0 to 2.3.0. The training process was running normally as did on one-GPU setup.
Hopefully can be helpful to your case.

@jaeyounkim jaeyounkim added models:research:odapi ODAPI and removed models:research models that come under research directory labels Jun 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

9 participants