Gemma 2: `9b` and `27b` versions #1545

Andrei-Aksionov · 2024-07-02T16:21:58Z

Hi there 👋

Google released the latest and greatest Gemma model - v2.
This time it comes in three sizes:

2b (not yet released)
9b
27b

Based on the technical report and the official implementation here are the main changes that I've spotted:

Embeddings scaler needs to be casted down before applying.
Needs to be careful with attention scores scaler: it's not equal to head_size, but rather n_embd/n_head. In case of Gemma head_size might not be equal to n_embd/n_head.
Logits soft-caping for attention scores and for final logits. Soft-caping is needed only for training (looks like it's more important for a larger model) and not so much for inference. Since flash attention doesn't support soft-caping, it needs to be disabled if not in training mode.
Sliding window attention is used instead of a global window on every odd (idx) layer with half the size.
RMSNorm does downcasting right at the end. That was the behavior before I added support for Gemma v1.
Transformer block now has two more normalization layers: right after attention layer (before residual connection) and right after MLP (also before residual connection). Previously we had norm -> attn -> residual -> norm -> MLP -> residual. Now: norm -> attn -> norm -> residual -> norm -> MLP -> norm -> residual.
Both 9b and 27b use grouped query attention. In Gemma v1 7b had a regular multi-head attention, while 2b variant had multi-query attention (single key-value pair is shared across all query heads).

rasbt · 2024-07-02T16:27:48Z

Nice summary. I think this touches all the main points. The others (knowledge distillation for the small models; tied embeddings) would not affect the architecture, it's more of a pretraining method. So yeah, looks great! Many thanks for taking this on!

rasbt · 2024-07-05T11:38:40Z

@Andrei-Aksionov
Sliding window attention (an ugly one, but hey, it works)

Cool! We can also add that to the existing Mistral/Mixtral models then 😊

Andrei-Aksionov · 2024-07-05T11:44:23Z

Cool! We can also add that to the existing Mistral/Mixtral models then

I believe only Mistral v0.1 supported sliding window attention, all the subsequent models by Mistral.ai don't use it.
But after this PR is merged, adding SLA would be just a matter of an additional line in a config.

rasbt · 2024-07-05T11:56:54Z

I believe only Mistral v0.1 supported sliding window attention, all the subsequent models by Mistral.ai don't use it.

I think you are right.

But after this PR is merged, adding SLA would be just a matter of an additional line in a config.

Nice!

Andrei-Aksionov · 2024-07-06T17:11:30Z

Gemma 2 9b/9b-it now has an initial support (with a lot of “scaffolding”).

Generation returns plausible results, but chat does a couple of strange things:

OOM. Don't understand why if a regular generate script consumes ~20 GB.
Update: It's not a Gemma specific problem Chat consumes more VRAM than Generate #1558, so it's not a blocker.
Had to use quantization (bnb.nf4) and the model was very restrictive, often didn't want to respond and instead asked to rephrase the question. I know that LlaMA 3, because of a very long training, has saturated bf16 dtype up to the very last digit and thus quantization affects more than other models. Maybe here we have the same (thanks to a “teacher”)? 🤷
Update # 1: if to use generate script with quantization, then I get a proper output. Something else is broken in chat script, besides a higher memory consumption.
Update # 2: KV-cache needs to be change to support sliding window attention. Chat script pre-allocates too much memory (up to model.max_seq_length), so a layer with sliding window has a wrong kv-cache.

Anyway, there is a lot of work that needs to be done (besides what I've mentioned above) before I can open this PR for a review:

1. Only final_softcapping affects tests. Need to make tests fail if attention_logit_softcapping is messed up.
2. Deal with all TODOs. The code works, but is very ugly and non-performant.
3. Use torch profiler to make sure that there are no shady device syncs happen in the background.
4. Add code and a test for LitGPT --> HF format conversion.
5. Does the torch.compile work with softcapping? If not, a clear error message needs to be printed.
6. Do a short training as a sanity-check.
7. Figure out what to do with CausalSelfAttention from adapter.py. Tests for adapter don't fail because of # 1.
8. Add support for 27b variant.

rasbt

This is a great PR. It's crazy that you pulled this off. Really awesome.

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Andrei-Aksionov · 2024-07-19T09:40:36Z

One more thing. Due to time constraints, I didn't test Gemma v2 27b version.
Tests are running fine, but it would be nice to check the generated output.

@rasbt could you do this?

rasbt · 2024-07-19T13:48:12Z

One more thing. Due to time constraints, I didn't test Gemma v2 27b version.
Tests are running fine, but it would be nice to check the generated output.

@rasbt could you do this?

Yes, I am happy to do this. The other thing is I will also generate config files for the smaller models

rasbt · 2024-07-19T15:04:36Z

Works great!

litgpt/config.py

rasbt · 2024-07-22T20:39:36Z

Based on the config file run, the train and val loss look great. It's a surprisingly low MMLU though. There's nothing wrong with the finetuned model though and it works fine during chat:

(Not 100% sure, but maybe the MMLU scores in the README were created with --num_fewshot greater than 1.

Anyways, I think everything else seems to be fine though and good to merge now right?

Andrei-Aksionov · 2024-07-22T21:32:59Z

Yep, let's merge.

rasbt · 2024-07-22T22:15:08Z

Awesome, this is great! Thanks for this amazing PR!

Initial version of a config for 9b version

5b03ac1

Andrei-Aksionov added 8 commits July 3, 2024 13:26

test_model.py: gemma 2 9b

6b2b17c

HF --> LitGPT conversion script for Gemma 2 9b

4402e43

Scaler for query before attention

08f71a6

Use the same dtype for embeddings scaler as for wte

cdd61ed

RMSNorm: downcast right at the end

75f15d3

Sliding window attention (an ugly one, but hey, it works)

889049d

Sliding window attention for Adapters and LoRA + Tests

1423654

Merge branch 'main' into gemma_2

af21b7f

Put mask on the same device as qkv

982880b

rasbt mentioned this pull request Jul 5, 2024

Mistral v0.1 sliding window attention #1552

Open

Andrei-Aksionov added 5 commits July 6, 2024 18:52

Softcapping

faddbdd

Gemma expects a prompt to start with bos token

eeb8379

Update tests to support softcapping

83d401c

Support progress bar during weights conversion

40e1e5a

Mask should be one the same device as qkv

0540118

Andrei-Aksionov added 5 commits July 8, 2024 11:42

Merge branch 'main' into gemma_2

9fdde5f

Make test fail if attention_logit_softcapping is missing

8706cc0

A proper name for attention scores scalar

421aae9

Gemma-2 27b

9380212

Tokenizer returns bos by config, so no need to add it in a prompt

83e7cba

Boy124578 approved these changes Jul 11, 2024

View reviewed changes

Andrei-Aksionov added 3 commits July 11, 2024 19:37

Merge branch 'main' into gemma_2

7afb723

Sliding window attention is applied on 0, 2, 4, ... layers

d3bf8a4

Create attention masks in a better way

f8fa152

rasbt approved these changes Jul 18, 2024

View reviewed changes

Andrei-Aksionov and others added 9 commits July 19, 2024 12:07

Update litgpt/adapter.py

31a0b68

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Update litgpt/adapter.py

d19c2f3

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Update litgpt/adapter_v2.py

26ce141

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Update litgpt/adapter_v2.py

cdf4756

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Update litgpt/config.py

734313c

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

sliding_window_mask --> sliding_window_bias

99c2d3f

Update comments for attention scaler

9f3714e

Update litgpt/lora.py

4604bad

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Update litgpt/lora.py

a876e46

Co-authored-by: Sebastian Raschka <mail@sebastianraschka.com>

Andrei-Aksionov added 3 commits July 19, 2024 16:24

Make SWA more configurable

8494db5

Add self.apply_swa to all PEFT modules

6ffbf28

Update readme files

e238196

Andrei-Aksionov requested a review from williamFalcon as a code owner July 19, 2024 13:41

Put Gemma v2 in the main list of supported models

4309e4b

Fix small bug in assigning int value to swa_apply_to_layers

215fac2

Andrei-Aksionov commented Jul 19, 2024

View reviewed changes

litgpt/config.py Outdated Show resolved Hide resolved

Andrei-Aksionov and others added 5 commits July 20, 2024 13:46

Update attn_scalar for 9b version

f645d2e

swa_apply_to_layers --> sliding_window_layer_placing

c0b9d28

Merge branch 'main' into gemma_2

7db56c3

add Gemma 2 configs

4057d65

Merge branch 'main' into gemma_2

90a7247

rasbt merged commit 916a84c into main Jul 22, 2024
9 checks passed

rasbt deleted the gemma_2 branch July 22, 2024 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 2: `9b` and `27b` versions #1545

Gemma 2: `9b` and `27b` versions #1545

Andrei-Aksionov commented Jul 2, 2024 •

edited

Loading

rasbt commented Jul 2, 2024

rasbt commented Jul 5, 2024

Andrei-Aksionov commented Jul 5, 2024

rasbt commented Jul 5, 2024

Andrei-Aksionov commented Jul 6, 2024 •

edited

Loading

rasbt left a comment

Andrei-Aksionov commented Jul 19, 2024 •

edited

Loading

rasbt commented Jul 19, 2024

rasbt commented Jul 19, 2024

rasbt commented Jul 22, 2024

Andrei-Aksionov commented Jul 22, 2024

rasbt commented Jul 22, 2024

Gemma 2: 9b and 27b versions #1545

Gemma 2: 9b and 27b versions #1545

Conversation

Andrei-Aksionov commented Jul 2, 2024 • edited Loading

rasbt commented Jul 2, 2024

rasbt commented Jul 5, 2024

Andrei-Aksionov commented Jul 5, 2024

rasbt commented Jul 5, 2024

Andrei-Aksionov commented Jul 6, 2024 • edited Loading

rasbt left a comment

Choose a reason for hiding this comment

Andrei-Aksionov commented Jul 19, 2024 • edited Loading

rasbt commented Jul 19, 2024

rasbt commented Jul 19, 2024

rasbt commented Jul 22, 2024

Andrei-Aksionov commented Jul 22, 2024

rasbt commented Jul 22, 2024

Gemma 2: `9b` and `27b` versions #1545

Gemma 2: `9b` and `27b` versions #1545

Andrei-Aksionov commented Jul 2, 2024 •

edited

Loading

Andrei-Aksionov commented Jul 6, 2024 •

edited

Loading

Andrei-Aksionov commented Jul 19, 2024 •

edited

Loading