Replies: 8 comments 4 replies
-
Very nice analysis. Note: For cross model comparisons, where the training data differs, using a single test can be very misleading. It's ok to compare between models with the same training data, but llama-2 was trained on a "diffrent" training set. eg, just adding a little more wiki can significantly shift the ppl scores for wikitest perplexity, so there is value in having multiple test sets being evaluated.
that is VERY nice to see 🥳
this is somewhat concerning and important to know. 👀
see my note above, but might still be a good indicator. 😄 |
Beta Was this translation helpful? Give feedback.
-
My outsider observation related to this: every new paper that comes out claims better results than anything published before. They make the claims using multiple scores, so it looks really convincing. Until the reality hits... Anyhow, when it comes to comparing LLaMA-1 with LLaMA-2: same architecture, LLaMA-2 trained on the same data + 40% extra data. What can we expect? Some improvement, but not really a dramatic improvement. Do we see this in the perplexity score? Yes, we do. LLaMA-2 was trained with a larger context size, so we expect bigger differences for larger contexts. Do we see this? Yes, we do. On the other hand, what is that we learn from 7B LLaMA-1 having a HellSwag result of 62.6 vs 61.4 for LLaMA-2? Does it mean 7B LLaMA-1 is actually better than 7B LLaMA-2? |
Beta Was this translation helpful? Give feedback.
-
Very interesting charts!
I agree. Basic perplexity testing like you are doing here have much greater value than hellaswag when going down to a detail level like you are doing here. Hellaswag can not properly measure different ctx sizes since it measures the ppl of sentences one by one. Hellaswag should be a better choice when comparing models of different architectures/tokenizers that have different levels of evaluation probabilties (ppl) since it makes use of both lower-is-better and higher-is-better ppl measurements. |
Beta Was this translation helpful? Give feedback.
-
The PPLs of LLaMA-1 7B Q6_K quantized model are 5.9110, 5.4351 and 5.2856 for 512, 1024 and 2048 context size. Current llama.cpp code has an inconsistent rms_norm_eps for LLaMA-2, hence the estimated PPLs are higher than what they could be. #2373 For LLaMA-2 7B, the actual PPLs are 5.8120, 5.2923 and 5.1320. |
Beta Was this translation helpful? Give feedback.
-
OK, based on @xx205's finding, here is a new plot comparing 7B perplexities. The red LLaMA-2 curve is from Figure 1 above, the blue LLaMA-2 curve is the result of changing |
Beta Was this translation helpful? Give feedback.
-
Here some more updates. It turns out that using The graphs show perplexity as a function of context size for the 7B and 13B LLaMA models computed using different values of the |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow Would you, by any chance, be able to share the code you used to calculate perplexities and draw these charts? I must compare two llama2-13B-Q6 models (one fine-tuned and one raw). Having an example of a working code would make my life tremendously easier. |
Beta Was this translation helpful? Give feedback.
-
Hi. |
Beta Was this translation helpful? Give feedback.
-
Here are some results of LLaMA-2 perplexities for the 7B and 13B models. Before someone tells me that I should have used HellaSwag(ish) scores instead: a) many of the results were computed before HellaSwag became a thing here, and b) I basically don't see anything in HellaSwag scores that I don't already see in perplexity scores. So, here we go.
--rope-freq-base
. At 8k tokens the difference in perplexity between LLaMA-1 and LLaMA-2 is very significant (see Figures 1 and 2)Q4_1
is higher thanQ4_0
. This also affects k-quants.Q4_K
being "type-1" quantization likeQ4_1
has a higher perplexity thanQ4_0
and evenQ3_K_L
. See Figure 3.Q4_0/1
andQ5_0/1
. See Figure 4.Q4_K_S
perplexity is lower than thefp16
perplexity of LLaMA-1.Figure 1 Perplexity as a function of context size for the LLaMA-1 (black) and LLaMA-2 (red) 7B models. Results were computed using Q6_K quantization and the --rope-freq-base option for extending beyond the respective training context size
Figure 2 Perplexity as a function of context size for the LLaMA-1 (black) and LLaMA-2 (red) 13B models. Results were computed using Q6_K quantization and the --rope-freq-base option for extending beyond the respective training context size
Figure 3 Perplexity as a function of model size for 7B LLaMA-2 using different quantizations.
Figure 4 Perplexity as a function of model size for 13B LLaMA-2 using different quantizations.
Figure 5 Comparison between 13B LLaMA-1 and LLaMA-2 perplexity for a context size of 512 using different quantizations.
Beta Was this translation helpful? Give feedback.
All reactions