Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient checkpointing libraries #716

Merged
merged 55 commits into from
Sep 14, 2020

Conversation

NikZak
Copy link
Collaborator

@NikZak NikZak commented Aug 31, 2020

These are the libraries for #711

This PR does not change the existing files

It contains graph_editor and memory_saving_gradients.py used in #711

@google-cla google-cla bot added the cla: yes CLA has been signed. label Aug 31, 2020
@NikZak NikZak changed the title gradient checkpoint libraries Gradient Checkpointing libraries Aug 31, 2020
@NikZak NikZak mentioned this pull request Aug 31, 2020
@@ -0,0 +1,162 @@
# Description:
# contains parts of TensorFlow that are experimental or unstable and which are not supported.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could remove BUILD files?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, removed

@NikZak NikZak changed the title Gradient Checkpointing libraries Gradient checkpointing libraries Aug 31, 2020
Copy link
Member

@mingxingtan mingxingtan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest advice is to avoid importing tensorflow internal modules, which would make it much easier to read and maintain.

(internally, importing tensorflow internal modules is prohibited except very special cases.)

@mingxingtan
Copy link
Member

mingxingtan commented Sep 7, 2020

I realized the original graph_edit is a package of tf.contrib, which is an internal tf library, so it might be fine in this case.

I tried to run the tests under third_party/graph_edit, but they all failed. Could you help fix those tests? Thanks!

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 8, 2020

The tests are running now. There were two special cases, testing tf.cond and tf.while_loop in a copy of the graph, that did not pass. This is not needed for the purpose of efficientdet.

Please run the tests as described in the FAQ to avoid import errors
https://github.com/google/automl/blob/eb74c6739382e9444817d2ad97c4582dbe9a9020/efficientdet/g3doc/faq.md#22-how-can-i-run-all-tests

if you want to run with pytest make sure efficientdet folder is in PYTHONPATH

@mingxingtan
Copy link
Member

@NikZak
Is this PR stable now? If so, I will try to import this PR, but you need to freeze this PR (don't submit new commits to this branch).

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 11, 2020

It's stable. OK, got it

import time
import sys
from toposort import toposort
import numpy as np
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to replace this toposort with a function? (We don't have this library internally).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Done

@NikZak NikZak marked this pull request as draft September 11, 2020 05:42
@NikZak NikZak marked this pull request as ready for review September 11, 2020 05:52
@LucasSloan
Copy link
Collaborator

Why are we vendoring these dependencies? It looks like the answer is that there were a bunch of tweaks to the underlying libraries?

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 11, 2020

@LucasSloan
None of these libraries work with tf 2.x originally

Main changes:

  1. Graph editing - minimal changes in the syntaxis, refactoring and some tweaks in the tests.
  2. Memory gradients - the algorithm of picking the gradients checkpointing nodes is substantially changed, the algorithm of gluing graph back together is left as is. Refactored.
  3. Nvgpu - a completely independent algorithm with same name as a pypi repo
  4. Toposort - no changes, just refactoring

bwd_inputs = [t for op in bwd_ops for t in op.inputs]
# list of tensors in forward graph that is in input to bwd graph
ts_filtered = list(set(bwd_inputs).intersection(ts_all))
debug_print("Using tensors %s", ts_filtered)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's difference between debug_print and logging.debug.
Maybe we should use absl.logging.
Could you add some test cases for memory_saving_gradients? Thanks.

Copy link
Collaborator Author

@NikZak NikZak Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fsx950223

  1. Regarding debug_print you can refer to the function description
  """Like logger.log, but also replaces all TensorFlow ops/tensors with their
  names. Sensitive to value of DEBUG_LOGGING, see enable_debug/disable_debug
  Usage:
    debug_print("see tensors %s for %s", tensorlist, [1,2,3])
  """

If you want to log the tensors themselves debug_print is a good option (but it may be quite wordy). If no tensors standard logging.debug would do
2) Thanks for the suggestion. What does this library do? I do not want to add a dependency on any other third party library as it may lead to necessity to vendor it to this repo due to google internal rules. Then need refactoring and so on. Could be a neverending cycle
3) Regarding testing. More testing coverage is always good. Probably I will add some tests at a later stage (maybe it should be part of a bigger integration test for this repo as gradient checkpointing is a major switch) and probably I will add some optimizations to the algorithm. However, this pull request is stable and it is on the way to be merged so unless it is a bug I don't want to make any changes

Copy link
Collaborator

@fsx950223 fsx950223 Sep 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We always use absl.logging in the repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, as it is not a bug, I will rectify after merging this pull request

@mingxingtan
Copy link
Member

@NikZak @fsx950223 @LucasSloan
I am merging this PR now, followed by another PR to fix some format issues.

@mingxingtan mingxingtan merged commit 3dab05e into google:master Sep 14, 2020
mingxingtan added a commit that referenced this pull request Sep 14, 2020
mingxingtan added a commit that referenced this pull request Sep 14, 2020
Fix some format issues after #716
@NikZak NikZak deleted the gradient_checkpoint_libs branch September 20, 2020 15:37
glenvorel pushed a commit to glenvorel/automl that referenced this pull request Apr 14, 2021
* add option each_epoch_in_separate_process

* typos in description

* comments wording

* h.each_epoch_in_separate_process = True in default

* renamed option to run_epoch_in_child_process to avoid confusion

* flags.run_epoch_in_child_process also set to True in default

* h.run_epoch_in_child_process = True : don't need this config

* replaced lambda function with functools.partial to get read of pylint warning

* gradient checkpointing

* gradient checkpointing

* gradient checkpointing

* remove .ropeproject

* description enhancement

* description cleanup

* gradient checkpoint libraries

* deleted graph edtor and gradient checkpointing libraris from this branch

* log message

* remove BUILD

* added back to master

* logging

* graph_editor and gradient checkpointing libs

* deleted:    graph_editor/BUILD

* readme

* readme

* Copyright of gradient checkpointing

* removed files again

* redo

* redo

* redo

* redo

* redo

* redo

* third_party linted

* README restore

* renaming

* no log level reset

* merge with upstream

* tests rectified

* add a bit of verbosity to avoid frustration during graph rebuld

* logging added

* replaced third party nvgpu with intenal module

* comments added

* carve out toposort and include it here

* refactor toposort based on this repo reqs
glenvorel pushed a commit to glenvorel/automl that referenced this pull request Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA has been signed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants