-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Custom Training with TF 2 Object Detect API Fails #8951
Comments
Have you made sure your pipeline.config has the checkpoint_type = "detection" ? I got the same type of errors for num_workers=1 and that resolved it for me |
@attianopp, thanks for the pointer - that solved failure mode #1. Not sure who maintains the zoo, but if the owners happen to see this post: could you update the .config files to have the correct default as "detection" instead of "classification"? Failure mode #2 is still happening, so I think this thread needs to remain open. |
@attianopp, did you manage to get training + evaluation to show up on tensorboard? I'm trying to replicate the TF1 output, in which you can see all the COCO metrics (AP50, ARmax1, etc) as well as eval images with boxes + groundtruth images with boxes. But for me it seems only training is running, not eval. And after inspecting the code, I'm wondering if you have to run one process for training (FLAGS.checkpoint_dir=None) and one for eval (with FLAGS.checkpoint_dir="some directory"). |
My metric output is all off for coco (0/-1 see image) but both my training and test losses are improving and <1. I also have no bounding boxes output in the “images” tab of tensorboard where it shows side_by_side_eval images. I am not sure how to fix the coco metrics being off (see attached image), I found a post saying that it had to do with training with a batch_size > 1. So I’m re-training my model. EDIT: After re-training w/ batch_size=1 I got bounding boxes in the side_by_side_eval, they are incorrect but likely because I barely trained the model (~1000 steps). I will update again when I have train more (~10000 steps). SECOND UPDATE: still the same issue. The other likely source of this issue for me I think could be from the way I make the tf record files. I think the bounding boxes might be mis-specified but I used a stock script for generation of the bounding boxes. Have you had to trouble shoot the coco metrics being off? EDIT: link to #8917 saying these results are due to batch_size/learning rate. Do you if the learning rate referenced is the |
I've got the two processes up and running now (I think I prefer the previous TF1 version that interleaved training and eval from a single process, and am kind of hoping they build that back in). The COCO metrics are coming out as reasonable values (e.g. mAP50 = 0.6), and the images tab now shows the validation images with bounding boxes (both model inferences and groundtruths). So it seems to be working. Not sure why yours is giving 0/-1 for all the values, but will let you know if I come across any clues. For reference, I trained with a batch size of 8, and ran eval with a batch size of 1. |
I seem to have just updated my comment as you posted with the example output. I also agree that I would rather have the single process that interleaves both, as I get OOM errors on my tiny GPU when I try to run both processes at once. Does that image above provide any insight? This was the main part of the code used to generate tfrecords:
How many steps did you specify for --num_train_steps? I also read another post (#6273) that said the order of the coords in the bounding boxes had to be [xmin,ymin,xmax,ymax]. Does that apply here? Did you specify your tf examples bounding boxes with different order coords? My "train_input_images" on tensorboard don't have bounding boxes even after training w a batch_size=1. The post eval images have bounding boxes in the eval_side_by_side output on tensorboard, but the ground truth images do not (assuming the left side image is the prediction and the right side is ground truth). |
Regarding your tfrecord generation code, it looks extremely similar to mine. One tiny difference is that I'm using tf2 rather than tf.compat.v1, but I doubt that makes a difference. I don't believe the order of xmin/xmax/ymin/ymax matters, as I think the reader is looking for tags rather than particular indexes (plus, my ordering is identical to yours). What I'd recommend is building a little custom reader so you can open the tfrecord and inspect its contents. That might help uncover issues. Also, have you gotten this particular tfrecord to run with the TF1 object detect API? Regarding --num_train_steps, I'm not using that parameter. Instead, I specify the number of steps in pipeline.config. Finally, regarding batch_size, I'm guessing that's a red herring. Mine works with a variety of batch sizes. |
I really appreciate your detailed response. Your intuition was correct, using https://github.com/sulc/tfrecord-viewer I see now that images that were taken with horizontal orientation have incorrectly placed bboxes. My dataset is made of variable size images that are dimensions HxW or WxH. The record generation process I currently use seems to assume a constant image size, so the horizontal images have boxes that are offset incorrectly. So this likely wouldn’t work with tf1. I am trouble shooting that right now, passing in the correct for each image height/width to the tf_example in that tf_record script doesn’t seem to fix it. I’ll update this when I figure it out. Thanks again for your help :) |
@attianopp: My tfrecord creation script is very similar to yours. I resized all images and bboxes to 640x640. My tfrecord looks perfect. The training job is running without any errors, but images in tensorboard look weird (some images are in a different contrast and color), and there are no bboxes. Tried with a variety of batch sizes, number of training steps, and recreating tfrecord, but still, the issue persists. Please let us know if you find a solution. |
@ramesh8v It was an issue with my tf-records. After I fixed it so I was sure all my bounding boxes were correct in the tf-record I got correct output for coco metrics and my evaluation images on tensorboard had, for the most part, correct predictions. The input train images don't have bounding boxes, and the contrast/color is changed. I am using faster_rcnn and I believe it has built in data augmentation and that's what you're seeing. Check your evaluation results by running |
Thank you @attianopp. Yes, my eval results look good. |
I too get the same error However I see utilization of the 4 gpus on |
I got the same error when I set the
The problem got fixed after I upgraded tensorflow from 2.2.0 to 2.3.0. The training process was running normally as did on one-GPU setup. |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
1. The entire URL of the file you are using
https://github.com/tensorflow/models/tree/master/official/...
2. Describe the bug
I'm attempting to migrate from the TF1 object detect API to the TF2 object detect API. The exact model isn't available in the TF2 version (quantized SSD Mobilenet v2), so I'm using EfficientDet-d0. But I'm attempting to keep as many things the same as possible, including using the exact same tfrecord training and validation files, similar config settings, etc. And I'm starting the fine-tuning from the config + weights found in the TF zoo. Note that there are 4 classes, and the config and label_map files have been updated appropriately.
This is being run locally on a machine with 2 GPUs, and I had to use tf.config.experimental.set_memory_growth(gpu, True) to get it to run at all.
Two failure modes are observed:
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
W0723 12:41:23.119486 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
W0723 12:41:23.119561 139840608290624 util.py:143] Unresolved object in checkpoint: (root).model._feature_extractor._bifpn_stage.node_input_blocks.7.0.1.1.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W0723 12:41:23.119644 139840608290624 util.py:151] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 549, in train_loop
load_fine_tune_checkpoint(detection_model,
File "/home/user1/tf2odapi/models/research/object_detection/model_lib_v2.py", line 357, in load_fine_tune_checkpoint
strategy.run(
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
return self._call_for_each_replica(fn, args, kwargs)
File "/home/user1/anaconda3/envs/tf2odapi/lib/python3.8/site-packages/tensorflow/python/distribute/mirrored_strategy.py", line 743, in _call_for_each_replica
wrapped = self._cfer_fn_cache.get(fn)
AttributeError: 'CollectiveAllReduceExtended' object has no attribute '_cfer_fn_cache'
3. Steps to reproduce
Can't be reproduced exactly on your end, as it involves some local files.
4. Expected behavior
Expect training to run successfully.
5. Additional context
Include any logs that would be helpful to diagnose the problem.
6. System information
The text was updated successfully, but these errors were encountered: