`recompute_grad` does not save memory and is incompatible with graph mode #36981

BinyanHu · 2020-02-22T14:57:53Z

System information

Have I written custom code (as opposed to using a stock
example script provided in TensorFlow): No.
OS Platform and Distribution (e.g.,
Linux Ubuntu 16.04): Linux Ubuntu 16.04 and Windows 10.
TensorFlow installed from (source or
binary): from binary (pip install)
TensorFlow version (use command below): 2.1.0
Python version: 3.7
CUDA/cuDNN version: CUDA10.2+CuDNN7.6.5 (Windows), CUDA10.1+CuDNN7.6.5+TensorRT 6 (Ubuntu),
GPU model and memory: GeForce GTX 1060 with Max-Q Design, 6GB (Windows) and GeForce GTX 1080 Ti, 12GB (Ubuntu)

Describe the current behavior
Using tf.recompute_grad to wrap keras layers does not take any effect. I build a DenseNet model and wrap each "-bn-relu-conv1x1-bn-relu-conv" block by the function. But I have not seen any GPU memory reduction on both the Windows and Ubuntu platforms. When eager mode is disabled, it throws "ValueError: Variable <tf.Variable 'batch_normalization/gamma:0' shape=(32,) dtype=float32> has None for gradient.", indicating that using compute_grad blocks the gradient backpropagation in graph mode.

Describe the expected behavior
The function seems to originate from OpenAI's gradient checkpointing (https://github.com/cybertronai/gradient-checkpointing) and is expected to save GPU memory during training. Recently, a tensorflow implementation of efficient DenseNets (https://github.com/joeyearsley/efficient_densenet_tensorflow) also uses this function to perform the gradient checkpointing (they used tf.contrib.layers.recompute_grad in tf1 graph mode, not exactly the same environment as our case.)

Please fix the incompatibility bug so that the function can still work with the graph mode. If the function is designed to perform gradient checkpointing, please verify its effectiveness. If it is not supposed to implement efficient DenseNets, please provide the correct and effective implementation.

Standalone code to reproduce the issue

import os

import tensorflow as tf
import tensorflow_datasets as tfds
from absl import app, flags
from absl.flags import FLAGS
from tensorflow import keras

flags.DEFINE_list("gpu",
                  default=None,
                  help="index of GPU")
flags.DEFINE_bool("recompute_grad",
                  default=False,
                  help="whether to recompute gradients to save GPU RAM")
flags.DEFINE_integer("batch_size",
                     default=1024,
                     help="batch size")
flags.DEFINE_bool("graph",
                  default=False,
                  help="use graph mode instead of eager mode")


def dense_lenet(inputs):
    net = keras.layers.Conv2D(32, 5, strides=2, use_bias=False, padding="SAME")(inputs)

    for _ in range(5):
        def _block(x):
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(16, 1, use_bias=False, padding="SAME")(x)
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(4, 3, use_bias=False, padding="SAME")(x)
            return x
        if FLAGS.recompute_grad:
            _block = tf.recompute_grad(_block)
        net = keras.layers.concatenate([net, _block(net)])

    net = keras.layers.BatchNormalization()(net)
    net = keras.layers.ReLU()(net)
    net = keras.layers.Conv2D(64, 1, use_bias=False, padding="SAME")(net)
    net = keras.layers.AveragePooling2D()(net)

    for _ in range(10):
        def _block(x):
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(32, 1, use_bias=False, padding="SAME")(x)
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(8, 3, use_bias=False, padding="SAME")(x)
            return x
        if FLAGS.recompute_grad:
            _block = tf.recompute_grad(_block)
        net = keras.layers.concatenate([net, _block(net)])

    net = keras.layers.BatchNormalization()(net)
    net = keras.layers.ReLU()(net)
    net = keras.layers.Conv2D(128, 1, use_bias=False, padding="SAME")(net)
    net = keras.layers.AveragePooling2D()(net)

    for _ in range(10):
        def _block(x):
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(32, 1, use_bias=False, padding="SAME")(x)
            x = keras.layers.BatchNormalization()(x)
            x = keras.layers.ReLU()(x)
            x = keras.layers.Conv2D(8, 3, use_bias=False, padding="SAME")(x)
            return x
        if FLAGS.recompute_grad:
            _block = tf.recompute_grad(_block)
        net = keras.layers.concatenate([net, _block(net)])

    net = keras.layers.BatchNormalization()(net)
    net = keras.layers.ReLU()(net)
    net = keras.layers.GlobalAveragePooling2D()(net)

    net = keras.layers.Dense(10)(net)
    net = keras.layers.Softmax()(net)

    return net


def main(_):
    if FLAGS.gpu:
        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(map(str, FLAGS.gpu))
    if FLAGS.graph:
        tf.compat.v1.disable_eager_execution()
        tf.compat.v1.keras.backend.set_session(
            session=tf.compat.v1.Session(
                config=tf.compat.v1.ConfigProto(
                    gpu_options=tf.compat.v1.GPUOptions(
                        allow_growth=True
                    )
                )
            )
        )
    else:
        for gpu in tf.config.experimental.list_physical_devices('GPU'):
            tf.config.experimental.set_memory_growth(gpu, True)

    tfds.core.constants.DATA_DIR = "data"
    dataset_builder = tfds.image.FashionMNIST(version="3.*.*")
    dataset_builder.download_and_prepare()
    dataset = dataset_builder.as_dataset(
        split="train",
        shuffle_files=True,
        as_supervised=True,
    ).repeat().batch(FLAGS.batch_size)

    inputs = keras.layers.Input((28, 28, 1), batch_size=FLAGS.batch_size)
    model = keras.Model(inputs, dense_lenet(inputs))

    model.compile(
        optimizer='adam',
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=['accuracy']
    )
    model.summary()

    model.fit(
        x=dataset,
        epochs=3,
        steps_per_epoch=60000//FLAGS.batch_size,
    )


if __name__ == "__main__":
    app.run(main)

The text was updated successfully, but these errors were encountered:

davisyoshida · 2020-02-23T22:03:23Z

@BinyanHu I've got a minimal example showing that it doesn't work for memory reduction over here: #30418 (comment)

davisyoshida · 2020-03-03T21:23:00Z

@BinyanHu if you're interested, I've written a simple gradient checkpointing decorator here

mathemakitten · 2020-03-03T21:52:30Z

@davisyoshida Thanks for doing this! I've been looking at implementing the same thing; will test out yours.

davisyoshida · 2020-03-03T22:09:08Z

@mathemakitten Happy to help! Do let me know if you run into any issues.

BinyanHu · 2020-03-05T11:37:36Z

@davisyoshida Thank you for sharing. Will test your implementation as soon as possible!

pawngrubber · 2020-04-18T17:21:02Z

@davisyoshida does this work with keras? if so, can you provide a small example of how to use it with keras?

pidajay · 2020-04-18T19:03:01Z

@Paulter I have a version working with Keras but sequential models only.
I have create a pull request as part of TF addons github repo - tensorflow/addons#1600.
You can find an example notebook here - https://github.com/pidajay/addons/blob/grad_checkpointing_eager/docs/tutorials/training_gradient_checkpointing.ipynb

BinyanHu · 2020-04-22T04:05:49Z

Thank you @pidajay

pawngrubber · 2020-04-29T03:15:15Z

@pidajay thanks for the work on this.

you say this only works on sequential models? unfortunately, my keras model is too complex to be a sequential model, so I can't use your code. Is there a way I can use what you've written?

I don't mind manually checkpointing - in fact, it is probably preferable. I'm writing custom keras lines for research and it would be nice to have something that specifies to recompute the gradient for a particular layer.

Unfortunately, I don't really understand the documentation for recompute_grad here https://www.tensorflow.org/api_docs/python/tf/recompute_grad

The documentation there seems to imply that you go:

my_layer = tf.recompute_grad(keras.layers.Conv2D(...))

but this gives no memory improvements.

Any chance that I can use what you've written, even if it's in a manual way?

pidajay · 2020-04-29T06:36:12Z

@Paulter I have posted a small tutorial here https://github.com/pidajay/tf2_gradient_checkpointing/blob/master/tf_recompute_grad_tutorial.ipynb
For this to work you need to replace (or just copy the delta) the custom_gradient.py file with this version in my TF fork https://github.com/pidajay/tensorflow/blob/fix_gradient_checkpointing/tensorflow/python/ops/custom_gradient.py
I plan to submit this fix as a PR soon but not sure if TF folks would be interested.
Unfortunately my example demonstrates how to do this for a keras sequential model in eager mode. But splitting a functional or custom model and invoking recompute_grad should work the same way. Just that I need to check if the graph mode decorator has the same bug as the eager mode decorator (conversation at top of this thread says it has been fixed). Will dig into this week and let you know. Hope this helps.

YuhuaBillChen · 2020-05-01T00:14:36Z

Any news for the Graph Mode models? I tried to use the code from @pidajay. Still, as long as I passed any keywords like variables to the recomputed grad function, TF raised an error 'The custom_gradient decorator currently supports keywords arguments only when eager execution is enabled".

mathemakitten · 2020-05-01T00:26:51Z

If you're looking to do gradient checkpointing in graph mode I suggest the implementation tf-slim here, which I've extracted and successfully tested on tf-nightly in graph mode on TPU: https://github.com/google-research/tf-slim/blob/a62dc893de5e46e6f2e9ec24a74b2abce026307a/tf_slim/layers/rev_block_lib.py

YuhuaBillChen · 2020-05-11T17:25:05Z

If you're looking to do gradient checkpointing in graph mode I suggest the implementation tf-slim here, which I've extracted and successfully tested on tf-nightly in graph mode on TPU: https://github.com/google-research/tf-slim/blob/a62dc893de5e46e6f2e9ec24a74b2abce026307a/tf_slim/layers/rev_block_lib.py

Thanks for your advice. I tried the extracted code from tf-slim. It did work to some degree, but in my case, it just reduced 5% of memory usage. Finally, I just copied the Tensorflow v1.15's contribs library's Graph Editor. With the OpenAI's Gradient Checkpointing, I got the memory reduction of 40% at the cost of 48% longer time.

Apprisco · 2022-07-13T01:19:30Z

This still doesn't seem to work... with a custom keras model.

nyngwang · 2022-08-14T15:12:02Z

@BinyanHu Did you find any workaround for gradient-checkpointing that indeed works?

Venkat6871 · 2024-07-26T08:23:28Z

Hi,

Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base.

The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate.

Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space.

github-actions · 2024-08-03T01:52:41Z

This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you.

github-actions · 2024-08-10T01:55:11Z

This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further.

google-ml-butler · 2024-08-10T01:55:19Z

Are you satisfied with the resolution of your issue?
Yes
No

BinyanHu added the type:bug Bug label Feb 22, 2020

tensorflow-bot bot assigned amahendrakar Feb 22, 2020

saikumarchalla assigned saikumarchalla and unassigned amahendrakar Feb 24, 2020

saikumarchalla added comp:gpu GPU related issues TF 2.1 for tracking issues in 2.1 release type:performance Performance Issue labels Feb 24, 2020

amahendrakar assigned ymodak Feb 25, 2020

saikumarchalla removed their assignment Mar 2, 2020

ymodak assigned sanjoy and unassigned ymodak Jul 10, 2020

ymodak removed the type:bug Bug label Jul 10, 2020

NikZak mentioned this issue Aug 28, 2020

Gradient checkpointing google/automl#711

Merged

Venkat6871 self-assigned this Jul 26, 2024

Venkat6871 unassigned sanjoy Jul 26, 2024

Venkat6871 added the stat:awaiting response Status - Awaiting response from author label Jul 26, 2024

github-actions bot added the stale This label marks the issue/pr stale - to be closed automatically if no activity label Aug 3, 2024

github-actions bot closed this as completed Aug 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`recompute_grad` does not save memory and is incompatible with graph mode #36981

`recompute_grad` does not save memory and is incompatible with graph mode #36981

BinyanHu commented Feb 22, 2020 •

edited

Loading

davisyoshida commented Feb 23, 2020 •

edited

Loading

davisyoshida commented Mar 3, 2020

mathemakitten commented Mar 3, 2020

davisyoshida commented Mar 3, 2020

BinyanHu commented Mar 5, 2020

pawngrubber commented Apr 18, 2020

pidajay commented Apr 18, 2020

BinyanHu commented Apr 22, 2020

pawngrubber commented Apr 29, 2020

pidajay commented Apr 29, 2020

YuhuaBillChen commented May 1, 2020

mathemakitten commented May 1, 2020

YuhuaBillChen commented May 11, 2020

Apprisco commented Jul 13, 2022

nyngwang commented Aug 14, 2022 •

edited

Loading

Venkat6871 commented Jul 26, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 10, 2024

google-ml-butler bot commented Aug 10, 2024

recompute_grad does not save memory and is incompatible with graph mode #36981

recompute_grad does not save memory and is incompatible with graph mode #36981

Comments

BinyanHu commented Feb 22, 2020 • edited Loading

davisyoshida commented Feb 23, 2020 • edited Loading

davisyoshida commented Mar 3, 2020

mathemakitten commented Mar 3, 2020

davisyoshida commented Mar 3, 2020

BinyanHu commented Mar 5, 2020

pawngrubber commented Apr 18, 2020

pidajay commented Apr 18, 2020

BinyanHu commented Apr 22, 2020

pawngrubber commented Apr 29, 2020

pidajay commented Apr 29, 2020

YuhuaBillChen commented May 1, 2020

mathemakitten commented May 1, 2020

YuhuaBillChen commented May 11, 2020

Apprisco commented Jul 13, 2022

nyngwang commented Aug 14, 2022 • edited Loading

Venkat6871 commented Jul 26, 2024

github-actions bot commented Aug 3, 2024

github-actions bot commented Aug 10, 2024

google-ml-butler bot commented Aug 10, 2024

`recompute_grad` does not save memory and is incompatible with graph mode #36981

`recompute_grad` does not save memory and is incompatible with graph mode #36981

BinyanHu commented Feb 22, 2020 •

edited

Loading

davisyoshida commented Feb 23, 2020 •

edited

Loading

nyngwang commented Aug 14, 2022 •

edited

Loading