-
Notifications
You must be signed in to change notification settings - Fork 704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LSTM node #729
LSTM node #729
Conversation
Moved LSTM node to it's own source files
Fix several compile errors and extra scratch memory
Great! Excited about it! |
If I counted correctly, it should currently be at least 18hidden_dim, and 20hidden_dim if using dropout. |
} | ||
|
||
// non-linearities | ||
fx.tbvec().slice(indices_i, sizes_3).device(*dev.edevice) = fx.tbvec().slice(indices_i, sizes_3).unaryExpr(scalar_logistic_sigmoid_op<float>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very minor comment: I think you can do .sigmoid()
here
Thanks matthias! I have a quick question: any guess as to why is it slower? is this fixable? |
Hmm, I don't see a particular reason why it should be slower, so yeah I assume that should be just a matter of doing some more profiling / code-level optimization. If anything we could probably make it slightly faster because we have more opportunity for batchings things together, such as the 3 sigmoids. |
Actually, now that you made me think about it, Paul: A likely reason is that the i,f,o,g parts are no longer separated in memory, and so the pointwise operations are performed by striding over the memory. If that slows things down, it should be easy to fix by copying them to separate regions of scratch memory (same as the currently implemented LSTM is doing, except we release the memory afterwards). |
Thanks, this is great! I'm going to merge this for now, and we can take a look at the remaining speed issues in the future. |
This attempts to reduce the per-timestep LSTM memory consumption to a minimum, by avoiding creation of intermediate nodes. The memory allocated is now 6*hidden_dim for forward & backward pass respectively: hidden state, cell state, and the result of the 4 matrix multiplies + nonlinearities. In an ideal case we wouldn’t have to save the latter in the backward pass, but due to the complex LSTM dependencies I don’t see a good way to accomplish that without hardcoding the complete loop over the sequence, which would make things very inflexible.
Implemented is the vanilla LSTM, dropout, and weight noise. Dropout and weight noise require no additional memory. Preliminary experiments indicate smaller overall memory footprint at slightly slower speed.
The above is achieved via 3 new dynet nodes:
gates_t = vanilla_lstm_gates(x_t, h_tm1, Wx, Wh, b, dropout_mask_x, dropout_mask_h, weightnoise_std)
c_t = vanilla_lstm_c(c_tm1, gates_t)
h_t = vanilla_lstm_h(c_t, gates_t)