Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removed distributed training support #1988

Merged
merged 1 commit into from
Apr 1, 2019
Merged

Conversation

tilmankamp
Copy link
Contributor

This PR also adds support for automatically checkpointing the epoch-model with best dev-loss so far.
@reuben Main activity is in the training-loop and I tried to keep changes there - apart from trivial removal of stuff (coordinator and config/FLAGS etc.). Especially the feeder is kept as it was/is. So rebasing your big data-PR is hard for the training part but should still be possible.

@tilmankamp tilmankamp requested a review from reuben March 28, 2019 18:07
@reuben
Copy link
Contributor

reuben commented Mar 28, 2019

I won’t have time or enough internet access to review this until Monday, but could you make sure you remove all references in READMEs and docs? Maybe nix the docs/ folder altogether? It’s severely out of date at this point.

Copy link
Contributor

@reuben reuben left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great, just a couple of comments. Really good to clean some of this code!

@@ -388,9 +384,11 @@ def train(server=None):
train_set = DataSet(train_data,
FLAGS.train_batch_size,
limit=FLAGS.limit_train,
next_index=lambda i: coord.get_next_index('train'))
next_index=train_index.inc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: next_index=lambda i: train_index += 1 and then we can avoid defining the SampleIndex class and just make this variable an integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what I first tried. In Python x += 1 is not resulting in a number and there is no non-hacky way to have multiple instructions in lambdas. If you go for an inline function instead, you've to make train_index global, as in Python non-local scope variables are only mutable if they are global. You'd have to do all of this twice for train and dev and it will get more verbose than the existing (cleaner) solution. This is Python...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's unfortunate. Alright!

DeepSpeech.py Show resolved Hide resolved
DeepSpeech.py Outdated

current_epoch = coord._epoch-1
# Checkpointing
epoch_saver = tf.train.Saver(max_to_keep=FLAGS.max_to_keep)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we had checkpoints saved every 10 minutes so that we could recover from a crash without losing too much training time. This is now only saving on every epoch end. Saving checkpoints at the end of epochs is nice to have, but I think saving every 10 minutes is a good safety feature to keep.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will address this...

test()

if FLAGS.export_dir:
export()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just lovely :D

@reuben
Copy link
Contributor

reuben commented Apr 1, 2019

Test failures were just version checks due to missing tags in your fork, merging.

@lock
Copy link

lock bot commented May 1, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators May 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants