You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
at least that is the impression I got when reading the "Attention is all you need" paper.
Or is there some new research finding that multiplying is better?
Noted that this scaling was in a different place than the scaling in "Attention is all you need"
When rereading the paper, I noticed this hidden in 3.4 Embeddings and Softmax text
"In the embedding layers, we multiply those weights by √dmodel."
tutorials/beginner_source/translation_transformer.py
Line 135 in 5e772fa
src = self.embedding(src) * math.sqrt(self.d_model)
shouln't this be
src = self.embedding(src) / math.sqrt(self.d_model)
at least that is the impression I got when reading the "Attention is all you need" paper.
Or is there some new research finding that multiplying is better?
cc @sekyondaMeta @svekars @kit1980 @subramen @albanD
The text was updated successfully, but these errors were encountered: