-
Notifications
You must be signed in to change notification settings - Fork 974
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gemma 2: 9b
and 27b
versions
#1545
Conversation
Nice summary. I think this touches all the main points. The others (knowledge distillation for the small models; tied embeddings) would not affect the architecture, it's more of a pretraining method. So yeah, looks great! Many thanks for taking this on! |
Cool! We can also add that to the existing Mistral/Mixtral models then 😊 |
I believe only Mistral v0.1 supported |
I think you are right.
Nice! |
Gemma 2 9b/9b-it now has an initial support (with a lot of “scaffolding”). Generation returns plausible results, but chat does a couple of strange things:
Anyway, there is a lot of work that needs to be done (besides what I've mentioned above) before I can open this PR for a review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great PR. It's crazy that you pulled this off. Really awesome.
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>
One more thing. Due to time constraints, I didn't test Gemma v2 27b version. @rasbt could you do this? |
@rasbt could you do this? Yes, I am happy to do this. The other thing is I will also generate config files for the smaller models |
Yep, let's merge. |
Awesome, this is great! Thanks for this amazing PR! |
Hi there 👋
Fixes #1535
Google released the latest and greatest Gemma model - v2.
This time it comes in three sizes:
Based on the technical report and the official implementation here are the main changes that I've spotted:
head_size
, but rathern_embd/n_head
. In case of Gemmahead_size
might not be equal ton_embd/n_head
.flash attention
doesn't support soft-caping, it needs to be disabled if not in training mode.norm
->attn
->residual
->norm
->MLP
->residual
. Now:norm
->attn
->norm
->residual
->norm
->MLP
->norm
->residual
.9b
and27b
use grouped query attention. In Gemma v17b
had a regular multi-head attention, while2b
variant had multi-query attention (single key-value pair is shared across all query heads).