-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient checkpointing #711
Conversation
I mean, this could help, |
@NikZak Hi
Also above is a naive example for recompute_grad , you might need to split differently. |
Awesome! But I have the same issue as @LaurensHagendoorn. I prefer to fix the network. |
@kartik4949 In principle, my algorithm is doing the same but instead of pointing the parts explicitly I looked at the graph of efficientnet and thought that 'Add' will be a good node to split the graph and added a functionality to split the graph by node name. I did not try recompute_grad due to this [thread] (tensorflow/tensorflow#36981) and some other threads mentioning that recompute_grad does not help in reimplementing gradient checkpointing. I suggest adding this capability as it allows training on smaller GPUs and then make enhancements to it or implement another method like recompute_grad and replace this functionality. |
@NikZak Sure ,Gradient Checkpoint is worth working on , but i will also prefer what @fsx950223 suggested |
efficientdet/graph_editor/BUILD
Outdated
@@ -0,0 +1,162 @@ | |||
# Description: | |||
# contains parts of TensorFlow that are experimental or unstable and which are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need Bazel to build it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fsx950223 you don't need to build it for this gradient check-pointing and I never tried. Graph editor is not fully tested as a standalone tf 2.0 library but the functionality needed for gradient checkpointing works
@NikZak also have you looked at this -> cybertronai/gradient-checkpointing#29 |
@kartik4949 this implementation of gradient checkpointing together with port of graph editor do work though |
Oh i see! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@NikZak Fantastic work! Thanks a lot for adding this.
A high-level comment: since this CL is large, could you split it into 2 PRs:
PR1: just add graph_editor and memory_saving_gradients, without changing any existing files (This PR is large, but safe)
PR2: hook up memory_saving_gradients with existing files (this PR would be small, but with some risks)
Thank you!
@@ -0,0 +1,41 @@ | |||
# Copyright 2015 The TensorFlow Authors. All Rights Reserved. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not the right copyright? Could you use the same copyright as existing files (Google Research)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
efficientdet/graph_editor/README.md
Outdated
The TensorFlow Graph Editor library allows for modification of an existing | ||
tf.Graph instance in-place. | ||
|
||
The author's github username is [purpledog](https://github.com/purpledog). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the source code fo this lib? What's the original copyright?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source code is the one embedded in Tensorflow.contrib. https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/graph_editor
In the init.py it is written "Licensed under the Apache License, Version 2.0" (same as tensorflow) which assumes that we can copy and modify.
Apache License 2.0
A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
Permissions
Commercial use
Modification
Distribution
Patent use
Private use
Limitations
Trademark use
Liability
Warranty
@NikZak @LaurensHagendoorn @fsx950223 @kartik4949 For CNNs, memory usage is mostly dominated by activations rather than parameters. EfficientDets use more memory for a few reasons:
It is great to see gradient checkpointing can significantly reduce memory without increasing much training time! Well done, @NikZak |
@mingxingtan makes sense ,got it now. |
@mingxingtan thanks for the comments! I will rectify on Monday |
H Nikzak, I have tried your method with clip_gradients_norm: 5.0. https://colab.research.google.com/drive/1wQBc2ukZ4gU9PryPOQKmUUuqyRPW7tJa?usp=sharing I find that many place in this Efiicientdet official repo was written with detault coco classes number 90(100), if the classes number more then 100, maybe it will cause problem, such as the problem in eval modul.
I need change Initialization of ap_perclasses: Then it worked. So i dont know whether they have the similar reason(classes number more then 100). |
@williamhyin
and I saw cls_loss = 10.0, det_loss = 10.02 the first few hundreds of steps.
which is same as gradient_checkpointing: False So shall I wait longer? Does it start to converge earlier with gradient_checkpointing set to False? At what step does it start to converge with gradient_checkpointing: True and at what step does it start to converge with gradient_checkpointing: False? |
Should wait more than 9000 steps... If gradient_checkpointing: false, it will start converge from begining. |
@williamhyin thanks. Same cls_loss = 10.0, det_loss = 10.03 Could you create two identical colabs then with only difference in the dfg.yaml file and gradient_checkpoint set to true or false. Then run them one by one and share the result At the moment I did not see a difference from the beginning |
https://colab.research.google.com/drive/1wQBc2ukZ4gU9PryPOQKmUUuqyRPW7tJa?usp=sharing
|
Is this compatible with XLA? XLA gives massive speedups on efficientdet (2-3x) and I'd hate to lose that. |
Hi LucasSloan, Thank your for your great suggestion! It works. I have setted use_xla as true. It retured a warning info.
I thought xla is not opened. Before :
After:
I'm confused about whether XLA is turned on or not, and if not, how should I set it. Thanks |
@LucasSloan In short, XLA and gradient checkpoint work together like a charm and provide a little bit of both worlds: reduced memory consumption and increased speed. Switching on XLA and gradient checkpointing uses a little bit more memory compared to pure gradient checkpointing without XLA but significantly less memory compared to pure XLA without gradient checkpointing I will probably provide more detailed stats next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks @NikZak
efficientdet/README.md
Outdated
| D4(640) [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d4-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d4-640.tar.gz) | 45.7 | 21.7ms | | ||
| D5(640 [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d5-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d5-640.tar.gz) | 46.6 | 26.6ms | | ||
| D6(640) [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d6-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d6-640.tar.gz) | 47.9 | 33.8ms | | ||
| D2(640) [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d2-640.tar.gz) | 41.7 | 14.8ms | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these changes intended?
If set to True, strings defined by gradient_checkpointing_list (["Add"] by default) are searched in the tensors names and any tensors that match a string from the list are kept as checkpoints. When this option is used the standard tensorflow.python.ops.gradients method is being replaced with a custom method. | ||
|
||
Testing shows that: | ||
* On d4 network with batch-size of 1 (mixed precision enabled) it takes only 1/3.2 of memory with roughly 32% slower computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice document!
efficientdet/det_model_fn.py
Outdated
from third_party.grad_checkpoint \ | ||
import memory_saving_gradients # pylint: disable=g-import-not-at-top | ||
from tensorflow.python.ops \ | ||
import gradients # pylint: disable=g-import-not-at-top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These imports can probably fit into a single line (try to avoid "").
efficientdet/det_model_fn.py
Outdated
if params["nvgpu_logging"]: | ||
try: | ||
from third_party.tools import nvgpu # pylint: disable=g-import-not-at-top | ||
from functools import reduce # pylint: disable=g-import-not-at-top |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just import functools, and use functools.reduce
efficientdet/det_model_fn.py
Outdated
from functools import reduce # pylint: disable=g-import-not-at-top | ||
|
||
def get_nested_value(d, path): | ||
return reduce(dict.get, path, d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move most of the code to nvgpu, so this file can be clean? thanks.
For example: nvgpu_gpu_info and commonsize and formatter_log can be moved to nvgpu.
@@ -161,6 +161,8 @@ def as_dict(self): | |||
else: | |||
config_dict[k] = copy.deepcopy(v) | |||
return config_dict | |||
|
|||
|
|||
# pylint: enable=protected-access | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can move "# pylint: enable=protected-access" right after return (with same indent), to avoid too many empty lines.
@@ -281,6 +283,13 @@ def default_detection_configs(): | |||
h.dataset_type = None | |||
h.positives_momentum = None | |||
|
|||
# Reduces memory during training | |||
h.gradient_checkpointing = False | |||
h.gradient_checkpointing_list = ["Add"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment to explain what values can be used other than "Add"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding more details. Could you explain a little bit more: what's the impact of this list?
If I use ["Add"], does it mean it would automatically checkpoint all "Add" operation?
If I use ['Add', 'Sigmoid'], does it mean it would automatically checkpoint all 'Add' and 'Sigmoid" ops?
If so, what's the pros and cons for adding more ops, and why the default is 'Add'?
Sorry if these questions annoy you, but I am hoping to make it clear as this is a greatly useful feature. Thanks!
efficientdet/main.py
Outdated
@@ -117,6 +116,38 @@ | |||
'run in a separate process for train and eval and memory will be cleared.' | |||
'Drawback: need to kill 2 processes if trainining needs to be interrupted.') | |||
|
|||
flags.DEFINE_bool( | |||
'gradient_checkpointing', False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to define flags here since they are already in hparams.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comment. Overall looks good. Thanks!
@@ -281,6 +283,13 @@ def default_detection_configs(): | |||
h.dataset_type = None | |||
h.positives_momentum = None | |||
|
|||
# Reduces memory during training | |||
h.gradient_checkpointing = False | |||
h.gradient_checkpointing_list = ["Add"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding more details. Could you explain a little bit more: what's the impact of this list?
If I use ["Add"], does it mean it would automatically checkpoint all "Add" operation?
If I use ['Add', 'Sigmoid'], does it mean it would automatically checkpoint all 'Add' and 'Sigmoid" ops?
If so, what's the pros and cons for adding more ops, and why the default is 'Add'?
Sorry if these questions annoy you, but I am hoping to make it clear as this is a greatly useful feature. Thanks!
|
||
|
||
def commonsize(inp): | ||
"""Convert all to MiB.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about a more informative name such as 'input_size'? Similar, you can rename 'inp_' to 'converted_size' or 'output_size'
* add option each_epoch_in_separate_process * typos in description * comments wording * h.each_epoch_in_separate_process = True in default * renamed option to run_epoch_in_child_process to avoid confusion * flags.run_epoch_in_child_process also set to True in default * h.run_epoch_in_child_process = True : don't need this config * replaced lambda function with functools.partial to get read of pylint warning * gradient checkpointing * gradient checkpointing * gradient checkpointing * remove .ropeproject * description enhancement * description cleanup * gradient checkpoint libraries * deleted graph edtor and gradient checkpointing libraris from this branch * log message * remove BUILD * added back to master * logging * graph_editor and gradient checkpointing libs * deleted: graph_editor/BUILD * readme * readme * Copyright of gradient checkpointing * redo * redo * third_party linted * README * README * merge conflict typo * merge conflict typo * renaming * no log level reset * no log level reset * logging of step per epoch is no longer correct in the latest train_and_eval mode * add a bit of verbosity to avoid frustration during graph rebuld * readme * readme * less user discretion * replaced third party nvgpu with intenal module * replaced third party nvgpu with intenal module * replaced third party nvgpu with intenal module * comments added * carve out toposort and include it here * refactor toposort based on this repo reqs * checkout third party * minor typo * cleanup * cleanup, comments
@mingxingtan
@kartik4949
Closes #85 , closes #368, closes #459, closes #737
Depends on #716
This is a slightly augmented version of gradient check-pointing algorithm for efficientdet network. It also includes porting of graph editor from tensorflow.contrib 1.15
This is an experimental option. It helps to save GPU memory while training.
As input, you need to provide a list of strings that would indicate which layers of the network to use to save checkpoints.
When this option is used the standard tensorflow.python.ops.gradients method is being replaced with a custom method. The parameters that you use are important and this requires further optimization. It takes time to reassemble the computation graph with new checkpoints and this operation is not multi-threaded at the moment - this could be improved. The graph reassembling only happens once per epoch in the beginning of the training epoch. Another improvement that could be made is caching the graph between epochs (which may not be straightforward given that every epoch runs in a separate process)
You have to provide a list of strings. These strings will be searched in the tensors name and only those tensors that match will be kept as checkpoints.
['L2Loss', 'entropy', 'FusedBatchNorm', 'Switch', 'dropout', 'Cast'] layers are always removed.
gradient_checkpointing: ["Add"] (leave only tensors with Add in the name) is an option that has been tested and works reasonably well:
There were also some logging improvements added and in particular memory logging for Nvidia GPU (disabled by default) (for a single GPU at the moment as I don't have a multi GPU machine to test multiple GPUs)
I suggest to add this option as it does not break the main process flow. The memory improvement for GPU is very substantial but could be further optimized and improved through providing right params or changing the algorithm