-
Notifications
You must be signed in to change notification settings - Fork 74.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
recompute_grad
does not save memory and is incompatible with graph mode
#36981
Comments
@BinyanHu I've got a minimal example showing that it doesn't work for memory reduction over here: #30418 (comment) |
@davisyoshida Thanks for doing this! I've been looking at implementing the same thing; will test out yours. |
@mathemakitten Happy to help! Do let me know if you run into any issues. |
@davisyoshida Thank you for sharing. Will test your implementation as soon as possible! |
@davisyoshida does this work with keras? if so, can you provide a small example of how to use it with keras? |
@Paulter I have a version working with Keras but sequential models only. |
Thank you @pidajay |
@pidajay thanks for the work on this. you say this only works on sequential models? unfortunately, my keras model is too complex to be a sequential model, so I can't use your code. Is there a way I can use what you've written? I don't mind manually checkpointing - in fact, it is probably preferable. I'm writing custom keras lines for research and it would be nice to have something that specifies to recompute the gradient for a particular layer. Unfortunately, I don't really understand the documentation for The documentation there seems to imply that you go:
but this gives no memory improvements. Any chance that I can use what you've written, even if it's in a manual way? |
@Paulter I have posted a small tutorial here https://github.com/pidajay/tf2_gradient_checkpointing/blob/master/tf_recompute_grad_tutorial.ipynb |
Any news for the Graph Mode models? I tried to use the code from @pidajay. Still, as long as I passed any keywords like |
If you're looking to do gradient checkpointing in graph mode I suggest the implementation tf-slim here, which I've extracted and successfully tested on tf-nightly in graph mode on TPU: https://github.com/google-research/tf-slim/blob/a62dc893de5e46e6f2e9ec24a74b2abce026307a/tf_slim/layers/rev_block_lib.py |
Thanks for your advice. I tried the extracted code from tf-slim. It did work to some degree, but in my case, it just reduced 5% of memory usage. Finally, I just copied the Tensorflow v1.15's contribs library's Graph Editor. With the OpenAI's Gradient Checkpointing, I got the memory reduction of 40% at the cost of 48% longer time. |
This still doesn't seem to work... with a custom keras model. |
@BinyanHu Did you find any workaround for gradient-checkpointing that indeed works? |
Hi, Thank you for opening this issue. Since this issue has been open for a long time, the code/debug information for this issue may not be relevant with the current state of the code base. The Tensorflow team is constantly improving the framework by fixing bugs and adding new features. We suggest you try the latest TensorFlow version with the latest compatible hardware configuration which could potentially resolve the issue. If you are still facing the issue, please create a new GitHub issue with your latest findings, with all the debugging information which could help us investigate. Please follow the release notes to stay up to date with the latest developments which are happening in the Tensorflow space. |
This issue is stale because it has been open for 7 days with no activity. It will be closed if no further activity occurs. Thank you. |
This issue was closed because it has been inactive for 7 days since being marked as stale. Please reopen if you'd like to work on this further. |
System information
example script provided in TensorFlow): No.
Linux Ubuntu 16.04): Linux Ubuntu 16.04 and Windows 10.
binary): from binary (pip install)
Describe the current behavior
Using
tf.recompute_grad
to wrap keras layers does not take any effect. I build a DenseNet model and wrap each "-bn-relu-conv1x1-bn-relu-conv" block by the function. But I have not seen any GPU memory reduction on both the Windows and Ubuntu platforms. When eager mode is disabled, it throws "ValueError: Variable <tf.Variable 'batch_normalization/gamma:0' shape=(32,) dtype=float32> hasNone
for gradient.", indicating that usingcompute_grad
blocks the gradient backpropagation in graph mode.Describe the expected behavior
The function seems to originate from OpenAI's gradient checkpointing (https://github.com/cybertronai/gradient-checkpointing) and is expected to save GPU memory during training. Recently, a tensorflow implementation of efficient DenseNets (https://github.com/joeyearsley/efficient_densenet_tensorflow) also uses this function to perform the gradient checkpointing (they used
tf.contrib.layers.recompute_grad
in tf1 graph mode, not exactly the same environment as our case.)Please fix the incompatibility bug so that the function can still work with the graph mode. If the function is designed to perform gradient checkpointing, please verify its effectiveness. If it is not supposed to implement efficient DenseNets, please provide the correct and effective implementation.
Standalone code to reproduce the issue
The text was updated successfully, but these errors were encountered: