-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gradient checkpointing libraries #716
Conversation
efficientdet/graph_editor/BUILD
Outdated
@@ -0,0 +1,162 @@ | |||
# Description: | |||
# contains parts of TensorFlow that are experimental or unstable and which are not supported. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could remove BUILD files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My biggest advice is to avoid importing tensorflow internal modules, which would make it much easier to read and maintain.
(internally, importing tensorflow internal modules is prohibited except very special cases.)
I realized the original graph_edit is a package of tf.contrib, which is an internal tf library, so it might be fine in this case. I tried to run the tests under third_party/graph_edit, but they all failed. Could you help fix those tests? Thanks! |
The tests are running now. There were two special cases, testing tf.cond and tf.while_loop in a copy of the graph, that did not pass. This is not needed for the purpose of efficientdet. Please run the tests as described in the FAQ to avoid import errors if you want to run with pytest make sure efficientdet folder is in PYTHONPATH |
@NikZak |
It's stable. OK, got it |
import time | ||
import sys | ||
from toposort import toposort | ||
import numpy as np |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to replace this toposort with a function? (We don't have this library internally).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Done
Why are we vendoring these dependencies? It looks like the answer is that there were a bunch of tweaks to the underlying libraries? |
@LucasSloan Main changes:
|
bwd_inputs = [t for op in bwd_ops for t in op.inputs] | ||
# list of tensors in forward graph that is in input to bwd graph | ||
ts_filtered = list(set(bwd_inputs).intersection(ts_all)) | ||
debug_print("Using tensors %s", ts_filtered) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's difference between debug_print and logging.debug.
Maybe we should use absl.logging.
Could you add some test cases for memory_saving_gradients? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fsx950223
- Regarding debug_print you can refer to the function description
"""Like logger.log, but also replaces all TensorFlow ops/tensors with their
names. Sensitive to value of DEBUG_LOGGING, see enable_debug/disable_debug
Usage:
debug_print("see tensors %s for %s", tensorlist, [1,2,3])
"""
If you want to log the tensors themselves debug_print is a good option (but it may be quite wordy). If no tensors standard logging.debug would do
2) Thanks for the suggestion. What does this library do? I do not want to add a dependency on any other third party library as it may lead to necessity to vendor it to this repo due to google internal rules. Then need refactoring and so on. Could be a neverending cycle
3) Regarding testing. More testing coverage is always good. Probably I will add some tests at a later stage (maybe it should be part of a bigger integration test for this repo as gradient checkpointing is a major switch) and probably I will add some optimizations to the algorithm. However, this pull request is stable and it is on the way to be merged so unless it is a bug I don't want to make any changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We always use absl.logging
in the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, as it is not a bug, I will rectify after merging this pull request
@NikZak @fsx950223 @LucasSloan |
Fix some format issues after #716
* add option each_epoch_in_separate_process * typos in description * comments wording * h.each_epoch_in_separate_process = True in default * renamed option to run_epoch_in_child_process to avoid confusion * flags.run_epoch_in_child_process also set to True in default * h.run_epoch_in_child_process = True : don't need this config * replaced lambda function with functools.partial to get read of pylint warning * gradient checkpointing * gradient checkpointing * gradient checkpointing * remove .ropeproject * description enhancement * description cleanup * gradient checkpoint libraries * deleted graph edtor and gradient checkpointing libraris from this branch * log message * remove BUILD * added back to master * logging * graph_editor and gradient checkpointing libs * deleted: graph_editor/BUILD * readme * readme * Copyright of gradient checkpointing * removed files again * redo * redo * redo * redo * redo * redo * third_party linted * README restore * renaming * no log level reset * merge with upstream * tests rectified * add a bit of verbosity to avoid frustration during graph rebuld * logging added * replaced third party nvgpu with intenal module * comments added * carve out toposort and include it here * refactor toposort based on this repo reqs
These are the libraries for #711
This PR does not change the existing files
It contains graph_editor and memory_saving_gradients.py used in #711