Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transformer tutorial multiplying with sqrt(d_model) #2849

Closed
RogerJL opened this issue Apr 27, 2024 · 3 comments
Closed

Transformer tutorial multiplying with sqrt(d_model) #2849

RogerJL opened this issue Apr 27, 2024 · 3 comments

Comments

@RogerJL
Copy link

RogerJL commented Apr 27, 2024

return self.embedding(tokens.long()) * math.sqrt(self.emb_size)

src = self.embedding(src) * math.sqrt(self.d_model)

shouln't this be

src = self.embedding(src) / math.sqrt(self.d_model)

at least that is the impression I got when reading the "Attention is all you need" paper.
Or is there some new research finding that multiplying is better?

cc @sekyondaMeta @svekars @kit1980 @subramen @albanD

@svekars svekars added intro core Tutorials of any level of difficulty related to the core pytorch functionality labels May 15, 2024
@sekyondaMeta sekyondaMeta added easy docathon-h1-2024 and removed intro core Tutorials of any level of difficulty related to the core pytorch functionality labels Jun 4, 2024
@IvanLauLinTiong
Copy link

/assigntome

@kit1980
Copy link
Member

kit1980 commented Jun 10, 2024

Looking online, there are some discussions and looks like * is actually correct.
https://stackoverflow.com/questions/56930821/why-does-embedding-vector-multiplied-by-a-constant-in-transformer-model

I'm closing this, if someone more knowledgeable wants to correct me - please re-open.

@kit1980 kit1980 closed this as completed Jun 10, 2024
@RogerJL
Copy link
Author

RogerJL commented Jun 11, 2024

Noted that this scaling was in a different place than the scaling in "Attention is all you need"
When rereading the paper, I noticed this hidden in 3.4 Embeddings and Softmax text

"In the embedding layers, we multiply those weights by √dmodel."

Closing is correct!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants