-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weight Sharing #546
Weight Sharing #546
Conversation
Have this any relation with Composite layer with routing capabilities? |
@bhack Not that I can see, although I could have missed it since I've only just taken a cursory glance at the Composite layer. Caffe already understands DAG models by inserting split layers at forks where a top blob is the bottom blob of more than one layer. With shared weights, one can do the same convolutions, inner products, or whatever on different inputs by defining the layer with the same That said, it would be nice to have a shorthand in our model definitions with this kind of structure instead of redefining the shared layer over and over with different bottoms and tops. A multiresolution model is a good example: the same convolutions should be done at every level of the pyramid, and it would be more concise to not write this down exhaustively. The Composite layer raises another point: Caffe could execute layers at the same topological depth simultaneously. At the moment execution is totally serial. |
I made the changes I suggested at #500 (comment). @jeffdonahue please review -- if you don't like my follow-up renaming commits, feel free to drop them. This is otherwise ready for merge IMHO. |
Yes with shared weight it think we cover almost all. I agree with you that composite layer other that parallel execution let also a more simple network notation in yaml that probably could be adopted in some form in caffe protobuf. |
Cool, thanks for the rebase and name cleanup! |
LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); | ||
} | ||
} | ||
// Now, update the owned parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shelhamer, If I understood correctly, you add up all the diffs to the owner_diff and then update the parameters accordingly, right? Does it imply if the layer owning this blob is not contributing to the loss, the updated diff from other layers that use this blob but don't own it, will not be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We either must change the ownership from the first layer that mentions the param to the first layer that mentions the param and participates in the loss, or we need to fix the layer_need_backward_[layer_id]
accordingly. Is my concern valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashafaei Was this fixed in the current version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ashafaei @abhi2610 This was never actually an issue although it was worth raising since you need to know how the Net
, Solver
and Blob
s cooperate. The loss / backward logic only decides whether backward is computed for a layer or not. Net::Update()
is always called by the solver, all of the shared weight params will accumulate their diff with the owner in the loop at 478, and then Blob::Update()
will always be called for every weight owner as in the loop at 501. This does bring up why Blob::Update()
is unconditional when it could be skipped but that's another matter. Thanks @jeffdonahue for discussion.
I see that this pull request adds weight sharing, which recurrent nets will need, however caffe takes fixed protocol buffer net descriptors. What would be the best way to implement a mechanism to allow dynamic repetition of sets of layers for input sequences? For instance, if I want to train a recurrent net for part of speech tagging, for a given sentence with n words, I'll have the following inputs and the following outputs: inputs outputs If the recurrent part has several layers that are repeated for each input word, such as a dropout layer, a concatenation layer, an inner product layer for the recurrent output, a rectified linear layer for the recurrent output, an inner product layer for the classification output, and a softmax layer for the classification output, you can group these into virtual layers that take the following inputs and outputs: inputs outputs Then, for a given sentence, you could chain together several of these modules made of several layers to feed the recurrent state from one to the next. The repeated configuration could be described by a protocol buffer. Finally, the entire thing could be seen as one giant virtual layer with a very wide input and a very wide output, the concatenations of the initial recurrent state followed by all the word vectors, and the concatenations of all the tag probabilities followed by the final recurrent state. This example only works for sequence classification. Recurring over a tree structure would need a different approach. I've drawn up what I have in mind here: Another issue is that caffe seems to send training and test data through the layers as matrices of examples rather than individually, for performance reasons. So, with sentences of varying lengths it would probably be best to group them by word count and send these minibatches through appropriately instantiated recurrent nets. Also, does caffe already have any support for converting text to word vectors for training data? |
Weight Sharing
[Original PR notes copied from #500. This PR replaces it.]
This adds the ability to share parameters between layers, which has a number of applications, the canonical one perhaps being recurrent neural network (RNN) training.
To share weights between two or more layers with parameters (currently just
InnerProductLayer
s andConvolutionLayer
s), specify the sameparam
for all of these layers. (You can also name the biases with a secondparam
, as in theblobs_lr
andweight_decay
parameters.) You can see a very simple example of this insrc/caffe/test/test_net.cpp
: see the unit test namedInitDiffDataSharedWeightsNet
:This means layers innerproduct1 and innerproduct2 are sharing the same set of weights as they've both specified
param: 'sharedweights'
. And in this case they also take the same bottom blob, (data
), so their outputs, top blobsinnerproduct1
andinnerproduct2
, should be identical (so this is not actually something you'd ever want to do; I do it there just for testing purposes).Note that in this case we specify only one blob name because we've set
bias_term: false
; if we didn't havebias_term: false
we'd need to specify twoparam
s, but probably the second one should be empty unless we actually want to share biases. (Specifying the empty string as aparam
is equivalent to not specifying aparam
in my implementation.)The entire implementation is in
Net::Init
,Net::AppendParam
, andNet::Update
.Init
figures out which layer will actually "own" the shared param (the first one to list itsparam
), andUpdate
adds the non-owned layers' computed diffs into the diff of the owner blob, then only actually performs updates on owned blobs. Memory-wise, all shared blobs actually point to the same memory location for the parameter's data, but still have separately allocated diff blobs, as the logic to handle learning rate, weight decay, etc is still handled by the Solver (which is blissfully unaware that parameters can be shared).Open to hearing feedback on the interface, implementation, etc. I'm not sure I'm happy with
param
as the name of the field, I think it would be less ambiguous to useparam_name
or something, but would be inconsistent with the other per-parameter fieldblobs_lr
(and actually to be consistent with that it should beblobs_name
, but I strongly prefer the singular here..).