update for tf2.4 #908

fsx950223 · 2020-12-15T10:43:31Z

cc @mingxingtan
I met some issues when I train the model with keras multi gpus on tf2.4, could you check it?
I wonder it's a bug of my environment or tensorflow.

fsx950223 · 2020-12-16T08:02:26Z

efficientdet/keras/train.py

@@ -74,7 +74,7 @@
 flags.DEFINE_integer('batch_size', 64, 'training batch size')
 flags.DEFINE_integer('eval_samples', 5000, 'The number of samples for '
                     'evaluation.')
-flags.DEFINE_integer('steps_per_execution', 1000,
+flags.DEFINE_integer('steps_per_execution', 1,


Disable it in default since there are some issues in multi gpus training with uninitlized optimizer.

fsx950223 · 2020-12-21T10:19:17Z

efficientdet/dataloader.py

+        input_processor.set_scale_factors_to_output_size()
+
+      image = input_processor.resize_and_crop_image()
+      boxes, classes = input_processor.resize_and_crop_boxes()


Resize image first could double speed up pipeline.

Interesting finding!

mingxingtan · 2020-12-23T07:22:26Z

efficientdet/dataloader.py

+        input_processor.set_scale_factors_to_output_size()
+
+      image = input_processor.resize_and_crop_image()
+      boxes, classes = input_processor.resize_and_crop_boxes()


Interesting finding!

mingxingtan · 2020-12-23T07:24:43Z

efficientdet/utils.py

-    # TODO(fsx950223): use SyncBatchNorm after TF bug is fixed (incorrect nccl
-    # all_reduce). See https://github.com/tensorflow/tensorflow/issues/41980
-    return BatchNormalization
+    return SyncBatchNormalization


How about the speed of SyncBatchNormalization for multiple GPUs?

It's slower about 40% than BatchNormalization for multiple GPUs. I believe it's acceptable.

According to https://github.com/tensorflow/tensorflow/blob/9489702e35b16a40a1accf3b8b5ed557efae10c7/tensorflow/python/keras/layers/normalization_v2.py#L151.
Should I split replica_ctx.all_reduce?

I did not quite understand these comments, but I think 40% slower is fine.

fsx950223 · 2020-12-24T08:41:20Z

efficientdet/utils.py

@@ -622,13 +622,11 @@ def build_model_with_precision(pp, mm, ii, *args, **kwargs):
    inputs = tf.cast(ii, tf.bfloat16)
    with tf.tpu.bfloat16_scope():
      outputs = mm(inputs, *args, **kwargs)
-    set_precision_policy('float32')


After remove 2 lines, I could train estimator model with recompute_grad and mixed_precision.
Why set policy back to float32? Could I remove them? @mingxingtan

Yes, please feel free to remove it. it is not necessary.

update for tf2.4 (google#908)

* update for tf2.4 * fix mixed precision with recompute gradient * update README * fix multi gpus training * update README * fix LossScaleOptimizer bug * disable steps_per_execution in default * split all reduce

google-cla bot added the cla: yes CLA has been signed. label Dec 15, 2020

fsx950223 requested a review from mingxingtan December 16, 2020 08:01

fsx950223 commented Dec 16, 2020

View reviewed changes

fsx950223 commented Dec 21, 2020

View reviewed changes

mingxingtan approved these changes Dec 23, 2020

View reviewed changes

update for tf2.4

da3171b

fsx950223 force-pushed the fix12 branch from 02d9b47 to da3171b Compare December 24, 2020 02:10

fix mixed precision with recompute gradient

1cfb9f2

fsx950223 commented Dec 24, 2020

View reviewed changes

fsx950223 added 6 commits December 24, 2020 16:44

update README

750b36d

fix multi gpus training

aa70c66

update README

4dd12f6

fix LossScaleOptimizer bug

e771e73

disable steps_per_execution in default

7f2423e

split all reduce

77b627a

mingxingtan approved these changes Dec 26, 2020

View reviewed changes

fsx950223 merged commit 539ab65 into google:master Dec 27, 2020

sujitahirrao added a commit to sujitahirrao/automl that referenced this pull request Dec 27, 2020

Merge pull request #12 from google/master

6dd2aae

update for tf2.4 (google#908)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update for tf2.4 #908

update for tf2.4 #908

fsx950223 commented Dec 15, 2020

fsx950223 Dec 16, 2020 •

edited

Loading

fsx950223 Dec 21, 2020

mingxingtan Dec 23, 2020

mingxingtan Dec 23, 2020

mingxingtan Dec 23, 2020

fsx950223 Dec 24, 2020

fsx950223 Dec 25, 2020

mingxingtan Dec 26, 2020

fsx950223 Dec 24, 2020 •

edited

Loading

mingxingtan Dec 26, 2020

update for tf2.4 #908

update for tf2.4 #908

Conversation

fsx950223 commented Dec 15, 2020

fsx950223 Dec 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fsx950223 Dec 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fsx950223 Dec 16, 2020 •

edited

Loading

fsx950223 Dec 24, 2020 •

edited

Loading