Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perplexity mismatch with GPTQ #1877

Closed
JianbangZ opened this issue Jun 15, 2023 · 10 comments
Closed

perplexity mismatch with GPTQ #1877

JianbangZ opened this issue Jun 15, 2023 · 10 comments
Labels

Comments

@JianbangZ
Copy link

GPTQ is reporting 5.68 PPL on wikitext2 for FP16 baseline. Yet llama.cpp reports 5.9.
What's the mismatch?

@TheBloke
Copy link
Contributor

TheBloke commented Jun 15, 2023

Slightly different algorithms, and probably also slightly different preparation of the dataset.

A while ago I made a perplexity calculation that 100% replicated llama.cpp's method, but for use with GPTQ and pytorch models. You can find that code here: AutoGPTQ/AutoGPTQ#70

I planned to do a perplexity comparison project with it comparing permutations of GPTQ with llama.cpp quant formats, but I still haven't finished it. But the code is still available.

One thing I found was that with wikitext, I had to slightly manipulate the dataset in order to get results that matched llama.cpp's. That's because llama.cpp loads the text straight into memory, with no processing. But in Python, when loading the dataset using the Hugging Face datasets library, it splits it into rows and some characters get stripped.

In these lines of code https://github.com/PanQiWei/AutoGPTQ/pull/70/files#diff-9724e9bf653714443b2205985e5412245d4472dc7e5367ce1531568104b663adR21-R23 I adjust the output of the wikitext dataset so that the text will exactly match the text that llama.cpp loads in its perplexity tool.

I didn't check if I needed to do the same with C4, as I primarily tested with wikitext.

But if you run the above code with wikitext it will give you an apples-to-apples comparison with llama.cpp

@ggerganov
Copy link
Owner

It's a bit unfortunate that llama.cpp ended up using a "non-standard" perplexity evaluation.
Wondering what would be the implications of updating the method to match the "standard" one

@JianbangZ
Copy link
Author

It's a bit unfortunate that llama.cpp ended up using a "non-standard" perplexity evaluation. Wondering what would be the implications of updating the method to match the "standard" one

It would make sense to switch to a "standard" whatever we call it, as people are kinda using that one. There is also some discussion around the MMLU benchmark here https://twitter.com/Tim_Dettmers/status/1666913630523367429

@JianbangZ
Copy link
Author

Slightly different algorithms, and probably also slightly different preparation of the dataset.

A while ago I made a perplexity calculation that 100% replicated llama.cpp's method, but for use with GPTQ and pytorch models. You can find that code here: PanQiWei/AutoGPTQ#70

I planned to do a perplexity comparison project with it comparing permutations of GPTQ with llama.cpp quant formats, but I still haven't finished it. But the code is still available.

One thing I found was that with wikitext, I had to slightly manipulate the dataset in order to get results that matched llama.cpp's. That's because llama.cpp loads the text straight into memory, with no processing. But in Python, when loading the dataset using the Hugging Face datasets library, it splits it into rows and some characters get stripped.

In these lines of code https://github.com/PanQiWei/AutoGPTQ/pull/70/files#diff-9724e9bf653714443b2205985e5412245d4472dc7e5367ce1531568104b663adR21-R23 I adjust the output of the wikitext dataset so that the text will exactly match the text that llama.cpp loads in its perplexity tool.

I didn't check if I needed to do the same with C4, as I primarily tested with wikitext.

But if you run the above code with wikitext it will give you an apples-to-apples comparison with llama.cpp

Thanks for the insights.
Did you happen to re evaluate the GPTQ-4b-128g perplexity with your code?

@TheBloke
Copy link
Contributor

Yeah:

  • Llama 7B 4bit 128g no act-order: 6.3850
  • Llama 7B 4bit 128g with act-order: 6.0653
  • Llama 13B 4bit 128g no act-order: 5.3370
  • Llama 13B 4bit 128g with act-order: 5.3319

Note that although the group_size + act-order case gives an improved perplexity, this config is not actually used by most people. The reason is that with most current GPTQ implementations, using group_size and act-order together will significantly lower performance. So 128g + without act-order is what most users are actually using when they use a 7B or 13B GPTQ.

Spreadsheet of everything I analysed is here: https://docs.google.com/spreadsheets/d/1ugN8EGlT-7rSYMBAD4dcq6TCtuL_XS1gSuOhkNA7abs/edit?usp=sharing

It's not finished. I also did quantisations for 30B and 65B, and many other GPTQ permutations like 3bit, different damp_percents (an advanced GPTQ parameter), and more.

Then I got busy with other things and never finished it off! I really should.

@JianbangZ
Copy link
Author

JianbangZ commented Jun 16, 2023 via email

@ggerganov
Copy link
Owner

@TheBloke or anyone else

Do you know what with act-order means?
From what I get, it means to sort the "activations" before quantizing them. But the trouble I have is I don't know what "activations" means. Is this some of the tensors of the input model (e.g. w1, w2) or something else?

@casper-hansen
Copy link

casper-hansen commented Jul 3, 2023

@TheBloke or anyone else

Do you know what with act-order means? From what I get, it means to sort the "activations" before quantizing them. But the trouble I have is I don't know what "activations" means. Is this some of the tensors of the input model (e.g. w1, w2) or something else?

The reordering in GPTQ refers to the fact that they reorder weights in order of least quantization error to avoid performance degradation. Mostly a heuristic method found by experiment it seems.

I also implemented llama.cpp perplexity in a PR in the AutoGPTQ repository so that we can more officially compare but it seems that the maintainer has gone inactive.

AutoGPTQ/AutoGPTQ#166

@oobabooga
Copy link
Contributor

An alternative way to compare llama.cpp/AutoGPTQ perplexities would be to create a "llamacpp_HF" wrapper that would turn llama.cpp into a transformers model, allowing eg this code to be used for the evaluation. A similar wrapper was done by @Larryvrh for ExLlama here. I briefly tried the same for llama.cpp but had no luck.

The data that I could find was:

In the first table, we see the following for llama-13b:

Model Perplexity
AutoGPTQ 4bit-128g 5.3370
llama.cpp q4_1 5.3607

In the second table, we see that the +ppl for q4_1 is 0.1065, and 0.0459 for q4_K_M. Based on this, the table would expand to:

Model Perplexity
llama.cpp q4_K_M 5.30 (estimated)
AutoGPTQ 4bit-128g 5.3370
llama.cpp q4_1 5.3607

So llama.cpp would perform better than AutoGPTQ. It would be interesting to have this data for all possible quantizations and sizes, including in particular llama-65b with q3_K_M and q4_K_M quantizations, since those seem to be the state of the art for llama-65b inference on consumer GPUs.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants