Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradient checkpointing #711

Merged
merged 78 commits into from
Sep 21, 2020
Merged

Conversation

NikZak
Copy link
Collaborator

@NikZak NikZak commented Aug 28, 2020

@mingxingtan
@kartik4949

Closes #85 , closes #368, closes #459, closes #737
Depends on #716

This is a slightly augmented version of gradient check-pointing algorithm for efficientdet network. It also includes porting of graph editor from tensorflow.contrib 1.15

This is an experimental option. It helps to save GPU memory while training.

As input, you need to provide a list of strings that would indicate which layers of the network to use to save checkpoints.

When this option is used the standard tensorflow.python.ops.gradients method is being replaced with a custom method. The parameters that you use are important and this requires further optimization. It takes time to reassemble the computation graph with new checkpoints and this operation is not multi-threaded at the moment - this could be improved. The graph reassembling only happens once per epoch in the beginning of the training epoch. Another improvement that could be made is caching the graph between epochs (which may not be straightforward given that every epoch runs in a separate process)

You have to provide a list of strings. These strings will be searched in the tensors name and only those tensors that match will be kept as checkpoints.

['L2Loss', 'entropy', 'FusedBatchNorm', 'Switch', 'dropout', 'Cast'] layers are always removed.

gradient_checkpointing: ["Add"] (leave only tensors with Add in the name) is an option that has been tested and works reasonably well:

  1. For d4 network with batch-size of 1 (mixed precision enabled) it takes only 1/3.2 of memory with roughly 32% slower computation with this option enabled.
    
  2. It also allows to train a d6 network with batch size of 2 (mixed precision enabled) on a 11Gb GPU which is impossible without this option
    

There were also some logging improvements added and in particular memory logging for Nvidia GPU (disabled by default) (for a single GPU at the moment as I don't have a multi GPU machine to test multiple GPUs)

I suggest to add this option as it does not break the main process flow. The memory improvement for GPU is very substantial but could be further optimized and improved through providing right params or changing the algorithm

@google-cla google-cla bot added the cla: yes CLA has been signed. label Aug 28, 2020
@NikZak NikZak changed the title Gradient checkpoint Gradient checkpointing Aug 28, 2020
@ghost
Copy link

ghost commented Aug 28, 2020

I mean, this could help,
But according to the paper, Retinanet has 34M parameters, and I can train that network just fine on my 1080, however, with all the same settings, even a -D3 with 12M parameters will not fit? Doesn't that indicate that something is going wrong somewhere?

@kartik4949
Copy link
Collaborator

kartik4949 commented Aug 28, 2020

@NikZak Hi
good work!
Have you tried all this with wrapping sub-modules (network) in tf.recompute_grad()
if this didnt work you can split the network in different parts which comprises big computation individually
and wrap them in tf.recompute_grad()

split_sub-model1 = tf.recompute_grad(split_sub-model1)
split_sub-model2 = tf.recompute_grad(split_sub-model2)
model = tf.keras.Sequential([split_sub-model1 , split_sub-model2])

Also above is a naive example for recompute_grad , you might need to split differently.
Can you try gradient checkpoint with this?
Thanks

@fsx950223
Copy link
Collaborator

Awesome! But I have the same issue as @LaurensHagendoorn. I prefer to fix the network.
And I wonder why Tensorflow's memory optimizer doesn't work. According to document, they have similar behavior.

@NikZak
Copy link
Collaborator Author

NikZak commented Aug 28, 2020

@kartik4949 In principle, my algorithm is doing the same but instead of pointing the parts explicitly I looked at the graph of efficientnet and thought that 'Add' will be a good node to split the graph and added a functionality to split the graph by node name.

I did not try recompute_grad due to this [thread] (tensorflow/tensorflow#36981) and some other threads mentioning that recompute_grad does not help in reimplementing gradient checkpointing.

I suggest adding this capability as it allows training on smaller GPUs and then make enhancements to it or implement another method like recompute_grad and replace this functionality.

@kartik4949
Copy link
Collaborator

@NikZak Sure ,Gradient Checkpoint is worth working on , but i will also prefer what @fsx950223 suggested
i.e to work on network fix , but on the go we can add this gradient checkpoint if it really help in memory saving .

@@ -0,0 +1,162 @@
# Description:
# contains parts of TensorFlow that are experimental or unstable and which are not supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need Bazel to build it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fsx950223 you don't need to build it for this gradient check-pointing and I never tried. Graph editor is not fully tested as a standalone tf 2.0 library but the functionality needed for gradient checkpointing works

@kartik4949
Copy link
Collaborator

kartik4949 commented Aug 28, 2020

@NikZak also have you looked at this -> cybertronai/gradient-checkpointing#29
seems like it doesnt work above tf1.15
looking at the thread they started using recompute_grads instead of memory_gradients at least for tf2.x

@NikZak
Copy link
Collaborator Author

NikZak commented Aug 28, 2020

@kartik4949 this implementation of gradient checkpointing together with port of graph editor do work though

@kartik4949
Copy link
Collaborator

@kartik4949 this implementation of gradient checkpointing together with port of graph editor do work though

Oh i see!

Copy link
Member

@mingxingtan mingxingtan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NikZak Fantastic work! Thanks a lot for adding this.

A high-level comment: since this CL is large, could you split it into 2 PRs:

PR1: just add graph_editor and memory_saving_gradients, without changing any existing files (This PR is large, but safe)
PR2: hook up memory_saving_gradients with existing files (this PR would be small, but with some risks)

Thank you!

@@ -0,0 +1,41 @@
# Copyright 2015 The TensorFlow Authors. All Rights Reserved.
#
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not the right copyright? Could you use the same copyright as existing files (Google Research)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

The TensorFlow Graph Editor library allows for modification of an existing
tf.Graph instance in-place.

The author's github username is [purpledog](https://github.com/purpledog).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the source code fo this lib? What's the original copyright?

Copy link
Collaborator Author

@NikZak NikZak Aug 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source code is the one embedded in Tensorflow.contrib. https://github.com/tensorflow/tensorflow/tree/r1.15/tensorflow/contrib/graph_editor

In the init.py it is written "Licensed under the Apache License, Version 2.0" (same as tensorflow) which assumes that we can copy and modify.

Apache License 2.0
A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.

Permissions
Commercial use
Modification
Distribution
Patent use
Private use
Limitations
Trademark use
Liability
Warranty

@mingxingtan
Copy link
Member

I mean, this could help,
But according to the paper, Retinanet has 34M parameters, and I can train that network just fine on my 1080, however, with all the same settings, even a -D3 with 12M parameters will not fit? Doesn't that indicate that something is going wrong somewhere?

@NikZak @LaurensHagendoorn @fsx950223 @kartik4949

For CNNs, memory usage is mostly dominated by activations rather than parameters. EfficientDets use more memory for a few reasons:

  1. Large input resolution: because resolution is one of the scaling dimension, our resolution tends to be higher, which significantly increase activations (although no parameter increase).

  2. Large internal activations for backbone: our backbone uses a relatively large expansion ratio (6), causing the large expanded activations.

  3. Deep BiFPN: RetinaNet uses a single top-down FPN, while our BiFPN has multiple top-down and bottom-up paths, which leads to much more intermediate memory usage during training.

It is great to see gradient checkpointing can significantly reduce memory without increasing much training time! Well done, @NikZak

@kartik4949
Copy link
Collaborator

I mean, this could help,
But according to the paper, Retinanet has 34M parameters, and I can train that network just fine on my 1080, however, with all the same settings, even a -D3 with 12M parameters will not fit? Doesn't that indicate that something is going wrong somewhere?

@NikZak @LaurensHagendoorn @fsx950223 @kartik4949

For CNNs, memory usage is mostly dominated by activations rather than parameters. EfficientDets use more memory for a few reasons:

  1. Large input resolution: because resolution is one of the scaling dimension, our resolution tends to be higher, which significantly increase activations (although no parameter increase).
  2. Large internal activations for backbone: our backbone uses a relatively large expansion ratio (6), causing the large expanded activations.
  3. Deep BiFPN: RetinaNet uses a single top-down FPN, while our BiFPN has multiple top-down and bottom-up paths, which leads to much more intermediate memory usage during training.

It is great to see gradient checkpointing can significantly reduce memory without increasing much training time! Well done, @NikZak

@mingxingtan makes sense ,got it now.

@NikZak
Copy link
Collaborator Author

NikZak commented Aug 29, 2020

@mingxingtan thanks for the comments! I will rectify on Monday

@williamhyin
Copy link

williamhyin commented Sep 10, 2020

@williamhyin
Did you pull the new code from master? Could you provide an example in colab environment?
Could you try with following option?

clip_gradients_norm: 5.0

H Nikzak,

I have tried your method with clip_gradients_norm: 5.0.
But the result is the same. I prepare a subset of total datasets for you.
The following is colab notebook, which you can direct acess the github repo and datasets.

https://colab.research.google.com/drive/1wQBc2ukZ4gU9PryPOQKmUUuqyRPW7tJa?usp=sharing

I find that many place in this Efiicientdet official repo was written with detault coco classes number 90(100), if the classes number more then 100, maybe it will cause problem, such as the problem in eval modul.

File "/home/automl/efficientdet/coco_metric.py", line 132, in result
    self.metric_values = self.evaluate()

  File "/home/automl/efficientdet/coco_metric.py", line 125, in evaluate
    ap_perclass[c] = ap_c

I need change Initialization of ap_perclasses:
from
ap_perclass = [0] * 100 # assumeing at most 100 classes.
to
ap_perclass = [0] * 201 # assumeing at most 201 classes for dfg.

Then it worked.

So i dont know whether they have the similar reason(classes number more then 100).
Because in the previor vehicle-open-datasets(6 classes) performance comparision, i did not see the same situation.

@NikZak

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 10, 2020

@williamhyin
Thanks a lot for you colab example! Great job!
I briefly ran your colab with

gradient_checkpointing: True

and I saw cls_loss = 10.0, det_loss = 10.02 the first few hundreds of steps.
Then I briefly ran your colab with

#gradient_checkpointing: True

which is same as gradient_checkpointing: False
and I also saw cls_loss = 10.0, det_loss = 10.03 the first few hundreds of steps

So shall I wait longer? Does it start to converge earlier with gradient_checkpointing set to False? At what step does it start to converge with gradient_checkpointing: True and at what step does it start to converge with gradient_checkpointing: False?

@williamhyin
Copy link

williamhyin commented Sep 10, 2020

@williamhyin
Thanks a lot for you colab example! Great job!
I briefly ran your colab with

gradient_checkpointing: True

and I saw cls_loss = 10.0, det_loss = 10.02 the first few hundreds of steps.
Then I briefly ran your colab with

#gradient_checkpointing: True

which is same as gradient_checkpointing: False
and I also saw cls_loss = 10.0, det_loss = 10.03 the first few hundreds of steps

So shall I wait longer? Does it start to converge earlier with gradient_checkpointing set to False? At what step does it start to converge with gradient_checkpointing: True and at what step does it start to converge with gradient_checkpointing: False?

Should wait more than 9000 steps... If gradient_checkpointing: false, it will start converge from begining.
image

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 10, 2020

@williamhyin thanks.
As I said I tried your colab with gradient_checkpointing: false and
it does not start to converge from the beginning

Same cls_loss = 10.0, det_loss = 10.03

Could you create two identical colabs then with only difference in the dfg.yaml file and gradient_checkpoint set to true or false. Then run them one by one and share the result

At the moment I did not see a difference from the beginning

@williamhyin
Copy link

williamhyin commented Sep 10, 2020

@williamhyin thanks.
As I said I tried your colab with gradient_checkpointing: false and
it does not start to converge from the beginning

Same cls_loss = 10.0, det_loss = 10.03

Could you create two identical colabs then with only difference in the dfg.yaml file and gradient_checkpoint set to true or false. Then run them one by one and share the result

At the moment I did not see a difference from the beginning

  1. gradient_checkpointing: false batch_size=2
    start converge from 2600 step
    https://colab.research.google.com/drive/14HeYSkRC_ObcnLOuuxPkDzP_8kRTruqL?usp=sharing
INFO:tensorflow:loss = 8.370411, step = 2600 (73.714 sec)
I0910 07:04:32.894324 140577495857024 basic_session_run_hooks.py:260] loss = 8.370411, step = 2600 (73.714 sec)
INFO:tensorflow:box_loss = 0.0021931177, cls_loss = 8.11686, det_loss = 8.226517, step = 2600 (73.714 sec)
I0910 07:04:32.894533 140577495857024 basic_session_run_hooks.py:260] box_loss = 0.0021931177, cls_loss = 8.11686, det_loss = 8.226517, step = 2600 (73.714 sec)
INFO:tensorflow:GPU memory used: 8805 MiB = 58.4% of total GPU memory: 15079 MiB
  1. gradient_checkpointing: true batch_size=4
    start converge from 10200 step

https://colab.research.google.com/drive/1wQBc2ukZ4gU9PryPOQKmUUuqyRPW7tJa?usp=sharing

I0910 11:17:06.202978 140103810455360 basic_session_run_hooks.py:702] global_step/sec: 1.02813
INFO:tensorflow:loss = 3.6903973, step = 10200 (97.264 sec)
I0910 11:17:06.204520 140103810455360 basic_session_run_hooks.py:260] loss = 3.6903973, step = 10200 (97.264 sec)
INFO:tensorflow:box_loss = 0.0013172865, cls_loss = 3.4850454, det_loss = 3.5509098, step = 10200 (97.263 sec)
I0910 11:17:06.204860 140103810455360 basic_session_run_hooks.py:260] box_loss = 0.0013172865, cls_loss = 3.4850454, det_loss = 3.5509098, step = 10200 (97.263 sec)
INFO:tensorflow:memory total = 11016.0, memory used = 10819.0, memory used % = 98.21169 (97.263 sec)

@LucasSloan
Copy link
Collaborator

Is this compatible with XLA? XLA gives massive speedups on efficientdet (2-3x) and I'd hate to lose that.

@williamhyin
Copy link

Is this compatible with XLA? XLA gives massive speedups on efficientdet (2-3x) and I'd hate to lose that.

Hi LucasSloan,

Thank your for your great suggestion! It works.
But i am confused about the use_xla command.

I have setted use_xla as true. It retured a warning info.

2020-09-12 09:51:48.487690: I tensorflow/compiler/jit/xla_compilation_cache.cc:314] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

2020-09-12 09:51:51.446572: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1641] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

I thought xla is not opened.
But the training speed is actually decreased from 29s to 17s.

Before :

I0911 21:27:43.641861 140484951627584 basic_session_run_hooks.py:260] box_loss = 0.00021900874, cls_loss = 0.22539198, det_loss = 0.23634242, step = 130300 (29.326 sec)
INFO:tensorflow:GPU memory used: 9897 MiB = 89.8% of total GPU memory: 11016 MiB

After:

I0912 10:00:21.039118 139902292236096 basic_session_run_hooks.py:260] box_loss = 0.0008860264, cls_loss = 0.1465769, det_loss = 0.19087821, step = 280500 (16.903 sec)
INFO:tensorflow:GPU memory used: 10134 MiB = 92.0% of total GPU memory: 11016 MiB
I0912 10:00:21.039428 139902292236096 basic_session_run_hooks.py:254] GPU memory used: 10134 MiB = 92.0% of total GPU memory: 11016 MiB

I'm confused about whether XLA is turned on or not, and if not, how should I set it.
Hope to receive your reply soon.
@LucasSloan

Thanks

@NikZak
Copy link
Collaborator Author

NikZak commented Sep 12, 2020

@LucasSloan
Thanks a lot for the suggestion.

In short, XLA and gradient checkpoint work together like a charm and provide a little bit of both worlds: reduced memory consumption and increased speed.

Switching on XLA and gradient checkpointing uses a little bit more memory compared to pure gradient checkpointing without XLA but significantly less memory compared to pure XLA without gradient checkpointing

I will probably provide more detailed stats next week

Copy link
Member

@mingxingtan mingxingtan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Thanks @NikZak

| D4(640) [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d4-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d4-640.tar.gz) | 45.7 | 21.7ms |
| D5(640 [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d5-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d5-640.tar.gz) | 46.6 | 26.6ms |
| D6(640) [h5](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d6-640.h5), [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d6-640.tar.gz) | 47.9 | 33.8ms |
| D2(640) [ckpt](https://storage.googleapis.com/cloud-tpu-checkpoints/efficientdet/coco640/efficientdet-d2-640.tar.gz) | 41.7 | 14.8ms |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these changes intended?

If set to True, strings defined by gradient_checkpointing_list (["Add"] by default) are searched in the tensors names and any tensors that match a string from the list are kept as checkpoints. When this option is used the standard tensorflow.python.ops.gradients method is being replaced with a custom method.

Testing shows that:
* On d4 network with batch-size of 1 (mixed precision enabled) it takes only 1/3.2 of memory with roughly 32% slower computation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice document!

from third_party.grad_checkpoint \
import memory_saving_gradients # pylint: disable=g-import-not-at-top
from tensorflow.python.ops \
import gradients # pylint: disable=g-import-not-at-top
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These imports can probably fit into a single line (try to avoid "").

if params["nvgpu_logging"]:
try:
from third_party.tools import nvgpu # pylint: disable=g-import-not-at-top
from functools import reduce # pylint: disable=g-import-not-at-top
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just import functools, and use functools.reduce

from functools import reduce # pylint: disable=g-import-not-at-top

def get_nested_value(d, path):
return reduce(dict.get, path, d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move most of the code to nvgpu, so this file can be clean? thanks.

For example: nvgpu_gpu_info and commonsize and formatter_log can be moved to nvgpu.

@@ -161,6 +161,8 @@ def as_dict(self):
else:
config_dict[k] = copy.deepcopy(v)
return config_dict


# pylint: enable=protected-access

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you can move "# pylint: enable=protected-access" right after return (with same indent), to avoid too many empty lines.

@@ -281,6 +283,13 @@ def default_detection_configs():
h.dataset_type = None
h.positives_momentum = None

# Reduces memory during training
h.gradient_checkpointing = False
h.gradient_checkpointing_list = ["Add"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment to explain what values can be used other than "Add"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding more details. Could you explain a little bit more: what's the impact of this list?

If I use ["Add"], does it mean it would automatically checkpoint all "Add" operation?
If I use ['Add', 'Sigmoid'], does it mean it would automatically checkpoint all 'Add' and 'Sigmoid" ops?

If so, what's the pros and cons for adding more ops, and why the default is 'Add'?

Sorry if these questions annoy you, but I am hoping to make it clear as this is a greatly useful feature. Thanks!

@@ -117,6 +116,38 @@
'run in a separate process for train and eval and memory will be cleared.'
'Drawback: need to kill 2 processes if trainining needs to be interrupted.')

flags.DEFINE_bool(
'gradient_checkpointing', False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to define flags here since they are already in hparams.

Copy link
Member

@mingxingtan mingxingtan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comment. Overall looks good. Thanks!

@@ -281,6 +283,13 @@ def default_detection_configs():
h.dataset_type = None
h.positives_momentum = None

# Reduces memory during training
h.gradient_checkpointing = False
h.gradient_checkpointing_list = ["Add"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding more details. Could you explain a little bit more: what's the impact of this list?

If I use ["Add"], does it mean it would automatically checkpoint all "Add" operation?
If I use ['Add', 'Sigmoid'], does it mean it would automatically checkpoint all 'Add' and 'Sigmoid" ops?

If so, what's the pros and cons for adding more ops, and why the default is 'Add'?

Sorry if these questions annoy you, but I am hoping to make it clear as this is a greatly useful feature. Thanks!



def commonsize(inp):
"""Convert all to MiB."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about a more informative name such as 'input_size'? Similar, you can rename 'inp_' to 'converted_size' or 'output_size'

@mingxingtan mingxingtan merged commit 6ab70e1 into google:master Sep 21, 2020
@NikZak NikZak deleted the gradient_checkpoint branch September 21, 2020 02:41
glenvorel pushed a commit to glenvorel/automl that referenced this pull request Apr 14, 2021
* add option each_epoch_in_separate_process

* typos in description

* comments wording

* h.each_epoch_in_separate_process = True in default

* renamed option to run_epoch_in_child_process to avoid confusion

* flags.run_epoch_in_child_process also set to True in default

* h.run_epoch_in_child_process = True : don't need this config

* replaced lambda function with functools.partial to get read of pylint warning

* gradient checkpointing

* gradient checkpointing

* gradient checkpointing

* remove .ropeproject

* description enhancement

* description cleanup

* gradient checkpoint libraries

* deleted graph edtor and gradient checkpointing libraris from this branch

* log message

* remove BUILD

* added back to master

* logging

* graph_editor and gradient checkpointing libs

* deleted:    graph_editor/BUILD

* readme

* readme

* Copyright of gradient checkpointing

* redo

* redo

* third_party linted

* README

* README

* merge conflict typo

* merge conflict typo

* renaming

* no log level reset

* no log level reset

* logging of step per epoch is no longer correct in the latest train_and_eval mode

* add a bit of verbosity to avoid frustration during graph rebuld

* readme

* readme

* less user discretion

* replaced third party nvgpu with intenal module

* replaced third party nvgpu with intenal module

* replaced third party nvgpu with intenal module

* comments added

* carve out toposort and include it here

* refactor toposort based on this repo reqs

* checkout third party

* minor typo

* cleanup

* cleanup, comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes CLA has been signed.
Projects
None yet
6 participants