LLaMA-2 Perplexities #2352

ikawrakow · 2023-07-23T20:13:43Z

ikawrakow
Jul 23, 2023

Here are some results of LLaMA-2 perplexities for the 7B and 13B models. Before someone tells me that I should have used HellaSwag(ish) scores instead: a) many of the results were computed before HellaSwag became a thing here, and b) I basically don't see anything in HellaSwag scores that I don't already see in perplexity scores. So, here we go.

For context lengths of up to 2048 tokens, the difference between LLaMA-1 and LLaMA-2 is small but noticeable. See Figures 1, 2 below
LLaMA-2 can be extended to (at least) 8k tokens for free (no finetunig required) via --rope-freq-base. At 8k tokens the difference in perplexity between LLaMA-1 and LLaMA-2 is very significant (see Figures 1 and 2)
The 7B LLaMA model shows a peculiar behavior at 4-bit quantization: the perplexity of Q4_1 is higher than Q4_0. This also affects k-quants. Q4_K being "type-1" quantization like Q4_1 has a higher perplexity than Q4_0 and even Q3_K_L. See Figure 3.
For the 13B model we are back to the normal situation where k-quants have lower perplexities at the same model size compared to Q4_0/1 and Q5_0/1. See Figure 4.
A comparison between k-quants perplexities for the 13B LLaMA-1 and LLaMA-2 models is shown in Figure 5. We see LLaMA-2 Q4_K_S perplexity is lower than the fp16 perplexity of LLaMA-1.

Figure 1 Perplexity as a function of context size for the LLaMA-1 (black) and LLaMA-2 (red) 7B models. Results were computed using Q6_K quantization and the --rope-freq-base option for extending beyond the respective training context size

Figure 2 Perplexity as a function of context size for the LLaMA-1 (black) and LLaMA-2 (red) 13B models. Results were computed using Q6_K quantization and the --rope-freq-base option for extending beyond the respective training context size

Figure 3 Perplexity as a function of model size for 7B LLaMA-2 using different quantizations.

Figure 4 Perplexity as a function of model size for 13B LLaMA-2 using different quantizations.

Figure 5 Comparison between 13B LLaMA-1 and LLaMA-2 perplexity for a context size of 512 using different quantizations.

Green-Sky · 2023-07-23T20:35:50Z

Green-Sky
Jul 23, 2023
Collaborator

Very nice analysis.

Note: For cross model comparisons, where the training data differs, using a single test can be very misleading. It's ok to compare between models with the same training data, but llama-2 was trained on a "diffrent" training set. eg, just adding a little more wiki can significantly shift the ppl scores for wikitest perplexity, so there is value in having multiple test sets being evaluated.

LLaMA-2 can be extended to (at least) 8k tokens for free (no finetunig required) via --rope-freq-base. At 8k tokens the difference in perplexity between LLaMA-1 and LLaMA-2 is very significant (see Figures 1 and 2)

that is VERY nice to see 🥳

The 7B LLaMA model shows a peculiar behavior at 4-bit quantization: the perplexity of Q4_1 is higher than Q4_0. This also affects k-quants. Q4_K being "type-1" quantization like Q4_1 has a higher perplexity than Q4_0 and even Q3_K_L. See Figure 3.

this is somewhat concerning and important to know. 👀

A comparison between k-quants perplexities for the 13B LLaMA-1 and LLaMA-2 models is shown in Figure 5. We see LLaMA-2 Q4_K_S perplexity is lower than the fp16 perplexity of LLaMA-1.

see my note above, but might still be a good indicator. 😄

0 replies

ikawrakow · 2023-07-23T21:16:03Z

ikawrakow
Jul 23, 2023
Author

Note: For cross model comparisons, where the training data differs, using a single test can be very misleading. It's ok to compare between models with the same training data, but llama-2 was trained on a "diffrent" training set. eg, just adding a little more wiki can significantly shift the ppl scores for wikitest perplexity, so there is value in having multiple test sets being evaluated.

My outsider observation related to this: every new paper that comes out claims better results than anything published before. They make the claims using multiple scores, so it looks really convincing. Until the reality hits...

Anyhow, when it comes to comparing LLaMA-1 with LLaMA-2: same architecture, LLaMA-2 trained on the same data + 40% extra data. What can we expect? Some improvement, but not really a dramatic improvement. Do we see this in the perplexity score? Yes, we do. LLaMA-2 was trained with a larger context size, so we expect bigger differences for larger contexts. Do we see this? Yes, we do. On the other hand, what is that we learn from 7B LLaMA-1 having a HellSwag result of 62.6 vs 61.4 for LLaMA-2? Does it mean 7B LLaMA-1 is actually better than 7B LLaMA-2?

0 replies

klosax · 2023-07-23T22:27:25Z

klosax
Jul 23, 2023
Collaborator

Very interesting charts!

Before someone tells me that I should have used HellaSwag(ish) scores instead

I agree. Basic perplexity testing like you are doing here have much greater value than hellaswag when going down to a detail level like you are doing here. Hellaswag can not properly measure different ctx sizes since it measures the ppl of sentences one by one.

Hellaswag should be a better choice when comparing models of different architectures/tokenizers that have different levels of evaluation probabilties (ppl) since it makes use of both lower-is-better and higher-is-better ppl measurements.

0 replies

xx205 · 2023-07-24T14:00:59Z

xx205
Jul 24, 2023

The PPLs of LLaMA-1 7B Q6_K quantized model are 5.9110, 5.4351 and 5.2856 for 512, 1024 and 2048 context size.

Current llama.cpp code has an inconsistent rms_norm_eps for LLaMA-2, hence the estimated PPLs are higher than what they could be. #2373

For LLaMA-2 7B, the actual PPLs are 5.8120, 5.2923 and 5.1320.

0 replies

ikawrakow · 2023-07-24T15:16:48Z

ikawrakow
Jul 24, 2023
Author

OK, based on @xx205's finding, here is a new plot comparing 7B perplexities. The red LLaMA-2 curve is from Figure 1 above, the blue LLaMA-2 curve is the result of changing eps in the rms_norm kernel from 1e-6 to 1e-5.

0 replies

ikawrakow · 2023-07-25T12:55:24Z

ikawrakow
Jul 25, 2023
Author

Here some more updates. It turns out that using epsilon = 5e-6 works for LLaMA-1 and LLaMA-2. I just submitted a PR (see #2384), but posting the graphs in the PR also here.

The graphs show perplexity as a function of context size for the 7B and 13B LLaMA models computed using different values of the rms_norm epsilon. The black line+circles represent LLaMA-1 results with epsilon = 1e-6 (value used during training in LLaMA-1 and hard-coded in ggml until yesterday). The orange squares depict LLaMA-1 with epislon = 5e-6 (proposed default value for both models). Red line+circles are for LLaMA-2 with epsilon = 1e-6. They were computed before we had realized the rms_norm epsilon issue and are included in the graph to illustrate the magnitude of the perplexity loss due to using the wrong epsilon value. The blue line/circles show LLaMA-2 results for epsilon = 1e-5 (value used during LLaMA-2 training), and the magenta line/circles are for LLaMA-2 with epsilon = 5e-6. The calculations beyond the maximum training context were run with base RoPE frequency selected to minimize the perplexity score.

0 replies

mariansiwiak · 2024-01-05T16:44:12Z

mariansiwiak
Jan 5, 2024

@ikawrakow Would you, by any chance, be able to share the code you used to calculate perplexities and draw these charts? I must compare two llama2-13B-Q6 models (one fine-tuned and one raw). Having an example of a working code would make my life tremendously easier.
Is it just perplexity.cpp found in examples, or did you use something different?

2 replies

ikawrakow Jan 5, 2024
Author

I simply use the perplexity tool from this repository:

./perplexity -m your_model_file -f path_to_test_data/wiki.test.raw -c context_length

If you have a GPU available, have built llama.cpp with CUDA support, and the model fits in VRAM, then

./perplexity -m your_model_file -f path_to_test_data/wiki.test.raw -c context_length -t 1 -ngl 100

I then gather the values collected that way in a text file and, being old school, use the Grace program to make the plot.
If I have to run many calculations, then I just type a script on my command line. E.g., to compute all quantization types, I would use something like

for q in Q2_K Q3_K_S Q3_K_M Q4_K_S Q4_K_M Q5_K_S Q5_K_M Q6_K; do ./quantize path_to_model_to quantize junk.gguf $q; ./perplexity -m junk.gguf -f wiki.test.raw -t 1 -ngl 100 >>results.txt; done

(I'm notoriously short on disk space, so I don't keep quantized models on disk but just use them and then throw them away/overwrite them).

mariansiwiak Jan 6, 2024

@ikawrakow thank you very much! You saved me a lot of time and effort! Greatly appreciated!

taratt · 2024-05-08T07:09:41Z

taratt
May 8, 2024

Hi.
I am not sure if this is the best place to ask this but I have a question regrading llama 2 7b perplexity. I am using the huggingface model meta-llama/Llama-2-7b-hf and with a context size of 4096, no matter what I do I get a perplexity of 7.84 on Wikitext2. I am using the wanda codebase (https://github.com/locuslab/wanda/tree/main) for this. ANY help would be appreciated.

2 replies

shivmgg May 24, 2024

Were you able to resolve this issue?

taratt May 24, 2024

Hi. Yes, for me, the issue was that I was mistakenly getting the results on WikiText103 instead of WikiText2. If you're using python's datasets, the older versions automatically download WikiText103 even if you explicitly indicate WikiText2. So if that's the case for you, update datasets and make sure you download the datasets correctly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLaMA-2 Perplexities #2352

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

LLaMA-2 Perplexities #2352

ikawrakow Jul 23, 2023

Replies: 8 comments · 4 replies

Green-Sky Jul 23, 2023 Collaborator

ikawrakow Jul 23, 2023 Author

klosax Jul 23, 2023 Collaborator

xx205 Jul 24, 2023

ikawrakow Jul 24, 2023 Author

ikawrakow Jul 25, 2023 Author

mariansiwiak Jan 5, 2024

ikawrakow Jan 5, 2024 Author

mariansiwiak Jan 6, 2024

taratt May 8, 2024

shivmgg May 24, 2024

taratt May 24, 2024

ikawrakow
Jul 23, 2023

Replies: 8 comments 4 replies

Green-Sky
Jul 23, 2023
Collaborator

ikawrakow
Jul 23, 2023
Author

klosax
Jul 23, 2023
Collaborator

xx205
Jul 24, 2023

ikawrakow
Jul 24, 2023
Author

ikawrakow
Jul 25, 2023
Author

mariansiwiak
Jan 5, 2024

ikawrakow Jan 5, 2024
Author

taratt
May 8, 2024