Gradient checkpointing libraries #716

NikZak · 2020-08-31T04:43:02Z

These are the libraries for #711

This PR does not change the existing files

It contains graph_editor and memory_saving_gradients.py used in #711

… warning

fsx950223 · 2020-08-31T04:52:41Z

efficientdet/graph_editor/BUILD

@@ -0,0 +1,162 @@
+# Description:
+#   contains parts of TensorFlow that are experimental or unstable and which are not supported.
+


Maybe we could remove BUILD files?

Sure, removed

mingxingtan

My biggest advice is to avoid importing tensorflow internal modules, which would make it much easier to read and maintain.

(internally, importing tensorflow internal modules is prohibited except very special cases.)

mingxingtan · 2020-09-07T22:44:52Z

I realized the original graph_edit is a package of tf.contrib, which is an internal tf library, so it might be fine in this case.

I tried to run the tests under third_party/graph_edit, but they all failed. Could you help fix those tests? Thanks!

NikZak · 2020-09-08T07:36:09Z

The tests are running now. There were two special cases, testing tf.cond and tf.while_loop in a copy of the graph, that did not pass. This is not needed for the purpose of efficientdet.

Please run the tests as described in the FAQ to avoid import errors
https://github.com/google/automl/blob/eb74c6739382e9444817d2ad97c4582dbe9a9020/efficientdet/g3doc/faq.md#22-how-can-i-run-all-tests

if you want to run with pytest make sure efficientdet folder is in PYTHONPATH

mingxingtan · 2020-09-11T00:31:04Z

@NikZak
Is this PR stable now? If so, I will try to import this PR, but you need to freeze this PR (don't submit new commits to this branch).

NikZak · 2020-09-11T00:40:48Z

It's stable. OK, got it

mingxingtan · 2020-09-11T03:10:12Z

efficientdet/third_party/grad_checkpoint/memory_saving_gradients.py

+import time
+import sys
+from toposort import toposort
+import numpy as np


Is it possible to replace this toposort with a function? (We don't have this library internally).

LucasSloan · 2020-09-11T16:24:49Z

Why are we vendoring these dependencies? It looks like the answer is that there were a bunch of tweaks to the underlying libraries?

NikZak · 2020-09-11T16:32:18Z

@LucasSloan
None of these libraries work with tf 2.x originally

Main changes:

Graph editing - minimal changes in the syntaxis, refactoring and some tweaks in the tests.
Memory gradients - the algorithm of picking the gradients checkpointing nodes is substantially changed, the algorithm of gluing graph back together is left as is. Refactored.
Nvgpu - a completely independent algorithm with same name as a pypi repo
Toposort - no changes, just refactoring

fsx950223 · 2020-09-12T04:14:24Z

efficientdet/third_party/grad_checkpoint/memory_saving_gradients.py

+      bwd_inputs = [t for op in bwd_ops for t in op.inputs]
+      # list of tensors in forward graph that is in input to bwd graph
+      ts_filtered = list(set(bwd_inputs).intersection(ts_all))
+      debug_print("Using tensors %s", ts_filtered)


What's difference between debug_print and logging.debug.
Maybe we should use absl.logging.
Could you add some test cases for memory_saving_gradients? Thanks.

Hi @fsx950223

Regarding debug_print you can refer to the function description

"""Like logger.log, but also replaces all TensorFlow ops/tensors with their names. Sensitive to value of DEBUG_LOGGING, see enable_debug/disable_debug Usage: debug_print("see tensors %s for %s", tensorlist, [1,2,3]) """

If you want to log the tensors themselves debug_print is a good option (but it may be quite wordy). If no tensors standard logging.debug would do
2) Thanks for the suggestion. What does this library do? I do not want to add a dependency on any other third party library as it may lead to necessity to vendor it to this repo due to google internal rules. Then need refactoring and so on. Could be a neverending cycle
3) Regarding testing. More testing coverage is always good. Probably I will add some tests at a later stage (maybe it should be part of a bigger integration test for this repo as gradient checkpointing is a major switch) and probably I will add some optimizations to the algorithm. However, this pull request is stable and it is on the way to be merged so unless it is a bug I don't want to make any changes

We always use absl.logging in the repo.

Thanks, as it is not a bug, I will rectify after merging this pull request

mingxingtan · 2020-09-14T06:55:48Z

@NikZak @fsx950223 @LucasSloan
I am merging this PR now, followed by another PR to fix some format issues.

Fix some format issues after #716

* add option each_epoch_in_separate_process * typos in description * comments wording * h.each_epoch_in_separate_process = True in default * renamed option to run_epoch_in_child_process to avoid confusion * flags.run_epoch_in_child_process also set to True in default * h.run_epoch_in_child_process = True : don't need this config * replaced lambda function with functools.partial to get read of pylint warning * gradient checkpointing * gradient checkpointing * gradient checkpointing * remove .ropeproject * description enhancement * description cleanup * gradient checkpoint libraries * deleted graph edtor and gradient checkpointing libraris from this branch * log message * remove BUILD * added back to master * logging * graph_editor and gradient checkpointing libs * deleted: graph_editor/BUILD * readme * readme * Copyright of gradient checkpointing * removed files again * redo * redo * redo * redo * redo * redo * third_party linted * README restore * renaming * no log level reset * merge with upstream * tests rectified * add a bit of verbosity to avoid frustration during graph rebuld * logging added * replaced third party nvgpu with intenal module * comments added * carve out toposort and include it here * refactor toposort based on this repo reqs

NikZak added 15 commits August 26, 2020 12:23

add option each_epoch_in_separate_process

012f92d

typos in description

79ff583

comments wording

7ec607a

h.each_epoch_in_separate_process = True in default

0426f58

renamed option to run_epoch_in_child_process to avoid confusion

433167a

flags.run_epoch_in_child_process also set to True in default

a2811ff

h.run_epoch_in_child_process = True : don't need this config

6ec564a

replaced lambda function with functools.partial to get read of pylint…

213f5ef

… warning

gradient checkpointing

5269096

gradient checkpointing

5fb26d1

gradient checkpointing

2029e42

remove .ropeproject

d2e864a

description enhancement

c690eb9

description cleanup

8336138

gradient checkpoint libraries

d74dc82

google-cla bot added the cla: yes CLA has been signed. label Aug 31, 2020

NikZak changed the title ~~gradient checkpoint libraries~~ Gradient Checkpointing libraries Aug 31, 2020

NikZak mentioned this pull request Aug 31, 2020

Gradient checkpointing #711

Merged

fsx950223 reviewed Aug 31, 2020

View reviewed changes

NikZak added 3 commits August 31, 2020 12:58

deleted graph edtor and gradient checkpointing libraris from this branch

ef6584a

log message

8ddff72

remove BUILD

bea39c1

NikZak changed the title ~~Gradient Checkpointing libraries~~ Gradient checkpointing libraries Aug 31, 2020

NikZak added 7 commits August 31, 2020 13:23

added back to master

fc3c31f

logging

1daf75f

graph_editor and gradient checkpointing libs

a098a3c

Merge branch 'gradient_checkpoint_libs'

7db4091

merge gradient checkpoint to master

14bb3e1

deleted: graph_editor/BUILD

7adff15

readme

9cfe955

mingxingtan reviewed Sep 7, 2020

View reviewed changes

fsx950223 force-pushed the master branch from 9a5815a to 031e806 Compare September 8, 2020 04:57

tests rectified

2ee81f1

NikZak requested a review from mingxingtan September 8, 2020 07:36

NikZak added 4 commits September 8, 2020 17:47

add a bit of verbosity to avoid frustration during graph rebuld

aa65eb1

logging added

87170f5

replaced third party nvgpu with intenal module

3bdbd4b

comments added

e6c198e

mingxingtan reviewed Sep 11, 2020

View reviewed changes

mingxingtan requested a review from LucasSloan September 11, 2020 03:28

carve out toposort and include it here

46db387

NikZak marked this pull request as draft September 11, 2020 05:42

refactor toposort based on this repo reqs

7db18da

NikZak marked this pull request as ready for review September 11, 2020 05:52

LucasSloan approved these changes Sep 11, 2020

View reviewed changes

fsx950223 reviewed Sep 12, 2020

View reviewed changes

mingxingtan merged commit 3dab05e into google:master Sep 14, 2020

mingxingtan added a commit that referenced this pull request Sep 14, 2020

Fix some format issues after #716

a6fd393

mingxingtan mentioned this pull request Sep 14, 2020

Fix some format issues after #716 #767

Merged

mingxingtan added a commit that referenced this pull request Sep 14, 2020

Merge pull request #767 from google/ckpt

71414ac

Fix some format issues after #716

NikZak deleted the gradient_checkpoint_libs branch September 20, 2020 15:37

glenvorel pushed a commit to glenvorel/automl that referenced this pull request Apr 14, 2021

Fix some format issues after google#716

426f703

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gradient checkpointing libraries #716

Gradient checkpointing libraries #716

NikZak commented Aug 31, 2020 •

edited

Loading

fsx950223 Aug 31, 2020

NikZak Aug 31, 2020

mingxingtan left a comment

mingxingtan commented Sep 7, 2020 •

edited

Loading

NikZak commented Sep 8, 2020 •

edited

Loading

mingxingtan commented Sep 11, 2020

NikZak commented Sep 11, 2020

mingxingtan Sep 11, 2020

NikZak Sep 11, 2020

LucasSloan commented Sep 11, 2020

NikZak commented Sep 11, 2020 •

edited

Loading

fsx950223 Sep 12, 2020

NikZak Sep 12, 2020 •

edited

Loading

fsx950223 Sep 12, 2020 •

edited

Loading

NikZak Sep 12, 2020

mingxingtan commented Sep 14, 2020

		@@ -0,0 +1,162 @@
		# Description:
		# contains parts of TensorFlow that are experimental or unstable and which are not supported.

Gradient checkpointing libraries #716

Gradient checkpointing libraries #716

Conversation

NikZak commented Aug 31, 2020 • edited Loading

fsx950223 Aug 31, 2020

Choose a reason for hiding this comment

NikZak Aug 31, 2020

Choose a reason for hiding this comment

mingxingtan left a comment

Choose a reason for hiding this comment

mingxingtan commented Sep 7, 2020 • edited Loading

NikZak commented Sep 8, 2020 • edited Loading

mingxingtan commented Sep 11, 2020

NikZak commented Sep 11, 2020

mingxingtan Sep 11, 2020

Choose a reason for hiding this comment

NikZak Sep 11, 2020

Choose a reason for hiding this comment

LucasSloan commented Sep 11, 2020

NikZak commented Sep 11, 2020 • edited Loading

fsx950223 Sep 12, 2020

Choose a reason for hiding this comment

NikZak Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

fsx950223 Sep 12, 2020 • edited Loading

Choose a reason for hiding this comment

NikZak Sep 12, 2020

Choose a reason for hiding this comment

mingxingtan commented Sep 14, 2020

NikZak commented Aug 31, 2020 •

edited

Loading

mingxingtan commented Sep 7, 2020 •

edited

Loading

NikZak commented Sep 8, 2020 •

edited

Loading

NikZak commented Sep 11, 2020 •

edited

Loading

NikZak Sep 12, 2020 •

edited

Loading

fsx950223 Sep 12, 2020 •

edited

Loading