Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LSTM node #729

Merged
merged 45 commits into from
Jul 24, 2017
Merged

LSTM node #729

merged 45 commits into from
Jul 24, 2017

Conversation

msperber
Copy link
Collaborator

@msperber msperber commented Jul 21, 2017

This attempts to reduce the per-timestep LSTM memory consumption to a minimum, by avoiding creation of intermediate nodes. The memory allocated is now 6*hidden_dim for forward & backward pass respectively: hidden state, cell state, and the result of the 4 matrix multiplies + nonlinearities. In an ideal case we wouldn’t have to save the latter in the backward pass, but due to the complex LSTM dependencies I don’t see a good way to accomplish that without hardcoding the complete loop over the sequence, which would make things very inflexible.



Implemented is the vanilla LSTM, dropout, and weight noise. Dropout and weight noise require no additional memory. Preliminary experiments indicate smaller overall memory footprint at slightly slower speed.



The above is achieved via 3 new dynet nodes:

gates_t = vanilla_lstm_gates(x_t, h_tm1, Wx, Wh, b, dropout_mask_x, dropout_mask_h, weightnoise_std)

c_t = vanilla_lstm_c(c_tm1, gates_t)

h_t = vanilla_lstm_h(c_t, gates_t)

@jcyk
Copy link
Contributor

jcyk commented Jul 22, 2017

Great! Excited about it!
@msperber What is the memory allocation in current version?

@msperber
Copy link
Collaborator Author

If I counted correctly, it should currently be at least 18hidden_dim, and 20hidden_dim if using dropout.

}

// non-linearities
fx.tbvec().slice(indices_i, sizes_3).device(*dev.edevice) = fx.tbvec().slice(indices_i, sizes_3).unaryExpr(scalar_logistic_sigmoid_op<float>());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor comment: I think you can do .sigmoid() here

@pmichel31415
Copy link
Collaborator

Thanks matthias!

I have a quick question: any guess as to why is it slower? is this fixable?

@msperber
Copy link
Collaborator Author

Hmm, I don't see a particular reason why it should be slower, so yeah I assume that should be just a matter of doing some more profiling / code-level optimization. If anything we could probably make it slightly faster because we have more opportunity for batchings things together, such as the 3 sigmoids.

@msperber
Copy link
Collaborator Author

Actually, now that you made me think about it, Paul: A likely reason is that the i,f,o,g parts are no longer separated in memory, and so the pointwise operations are performed by striding over the memory. If that slows things down, it should be easy to fix by copying them to separate regions of scratch memory (same as the currently implemented LSTM is doing, except we release the memory afterwards).
In any case, I guess we can save this for a future commit as I won't have time to look into this for the next couple of weeks, and speed differences are not super big anyways.

@neubig
Copy link
Contributor

neubig commented Jul 24, 2017

Thanks, this is great! I'm going to merge this for now, and we can take a look at the remaining speed issues in the future.

@neubig neubig merged commit c2fefdf into clab:master Jul 24, 2017
shuheik added a commit to shuheik/dynet that referenced this pull request Sep 14, 2017
neubig pushed a commit that referenced this pull request Sep 14, 2017
* fixed file paths

* Apply changes in #547 and #729 to Scala bindings
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants