Removed distributed training support #1988

tilmankamp · 2019-03-28T18:07:00Z

This PR also adds support for automatically checkpointing the epoch-model with best dev-loss so far.
@reuben Main activity is in the training-loop and I tried to keep changes there - apart from trivial removal of stuff (coordinator and config/FLAGS etc.). Especially the feeder is kept as it was/is. So rebasing your big data-PR is hard for the training part but should still be possible.

reuben · 2019-03-28T19:34:07Z

I won’t have time or enough internet access to review this until Monday, but could you make sure you remove all references in READMEs and docs? Maybe nix the docs/ folder altogether? It’s severely out of date at this point.

reuben

Overall this looks great, just a couple of comments. Really good to clean some of this code!

reuben · 2019-03-30T20:19:49Z

DeepSpeech.py

@@ -388,9 +384,11 @@ def train(server=None):
    train_set = DataSet(train_data,
                        FLAGS.train_batch_size,
                        limit=FLAGS.limit_train,
-                        next_index=lambda i: coord.get_next_index('train'))
+                        next_index=train_index.inc)


nit: next_index=lambda i: train_index += 1 and then we can avoid defining the SampleIndex class and just make this variable an integer.

This is what I first tried. In Python x += 1 is not resulting in a number and there is no non-hacky way to have multiple instructions in lambdas. If you go for an inline function instead, you've to make train_index global, as in Python non-local scope variables are only mutable if they are global. You'd have to do all of this twice for train and dev and it will get more verbose than the existing (cleaner) solution. This is Python...

Ah, that's unfortunate. Alright!

DeepSpeech.py

reuben · 2019-03-30T20:22:49Z

DeepSpeech.py


-            current_epoch = coord._epoch-1
+    # Checkpointing
+    epoch_saver = tf.train.Saver(max_to_keep=FLAGS.max_to_keep)


Previously we had checkpoints saved every 10 minutes so that we could recover from a crash without losing too much training time. This is now only saving on every epoch end. Saving checkpoints at the end of epochs is nice to have, but I think saving every 10 minutes is a good safety feature to keep.

Will address this...

reuben · 2019-03-30T20:33:08Z

DeepSpeech.py

+            test()
+
+    if FLAGS.export_dir:
+        export()


This change is just lovely :D

reuben · 2019-04-01T19:54:33Z

Test failures were just version checks due to missing tags in your fork, merging.

lock · 2019-05-01T20:37:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

tilmankamp requested a review from reuben March 28, 2019 18:07

reuben suggested changes Mar 30, 2019

View reviewed changes

reuben approved these changes Apr 1, 2019

View reviewed changes

Fix mozilla#1986 - Remove distributed training support

a179a23

tilmankamp force-pushed the remdist branch from 7cf8a2a to a179a23 Compare April 1, 2019 16:47

reuben merged commit b7b44f3 into mozilla:master Apr 1, 2019

lock bot locked and limited conversation to collaborators May 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed distributed training support #1988

Removed distributed training support #1988

tilmankamp commented Mar 28, 2019

reuben commented Mar 28, 2019

reuben left a comment

reuben Mar 30, 2019

tilmankamp Apr 1, 2019

reuben Apr 1, 2019

reuben Mar 30, 2019

tilmankamp Apr 1, 2019

reuben Mar 30, 2019

reuben commented Apr 1, 2019

lock bot commented May 1, 2019

Removed distributed training support #1988

Removed distributed training support #1988

Conversation

tilmankamp commented Mar 28, 2019

reuben commented Mar 28, 2019

reuben left a comment

Choose a reason for hiding this comment

reuben Mar 30, 2019

Choose a reason for hiding this comment

tilmankamp Apr 1, 2019

Choose a reason for hiding this comment

reuben Apr 1, 2019

Choose a reason for hiding this comment

reuben Mar 30, 2019

Choose a reason for hiding this comment

tilmankamp Apr 1, 2019

Choose a reason for hiding this comment

reuben Mar 30, 2019

Choose a reason for hiding this comment

reuben commented Apr 1, 2019

lock bot commented May 1, 2019