Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raise Exception (the loss increases to NAN ) when quantilizing DeepSeek-V2-chat using the new version of AutoAWQ in the sub-iteration (18/60) #535

Open
BinFuPKU opened this issue Jul 9, 2024 · 4 comments

Comments

@BinFuPKU
Copy link

BinFuPKU commented Jul 9, 2024

In the start stage, it runs well!

It raises an exception when the quantilization process reaches 30% (quantilizing DeepSeek-V2-chat with AutoAWQ about 1~2 hours)

We can find the loss increases to NAN in this sub-iteration (18/60), which is abnormal.

`
AWQ: 30%|███ | 18/60 [4:07:51<10:14:05, 877.27s/it]
...
Computing Loss (loss: nan): 0%| | 0/1 [00:00<?, ?it/s]�[A�[A�[A

                                                             �[A�[A�[A

Grid Search (Best: -1): 90%|█████████ | 18/20 [00:19<00:02, 1.02s/it]�[A�[A

Grid Search (Best: -1): 95%|█████████▌| 19/20 [00:19<00:01, 1.02s/it]�[A�[A

Computing Loss: 0%| | 0/1 [00:00<?, ?it/s]�[A�[A�[A

Computing Loss (loss: nan): 0%| | 0/1 [00:00<?, ?it/s]�[A�[A�[A

                                                             �[A�[A�[A

Grid Search (Best: -1): 95%|█████████▌| 19/20 [00:20<00:01, 1.02s/it]�[A�[A

Grid Search (Best: -1): 100%|██████████| 20/20 [00:20<00:00, 1.02s/it]�[A�[A

                                                                   �[A�[A

AWQ: 30%|███ | 18/60 [4:08:27<9:39:44, 828.20s/it]
Traceback (most recent call last):
File "/home/xiaoi/dq/fubin/alignment/Quantization.py", line 14, in
model.quantize(tokenizer, quant_config=quant_config)
File "/opt/nlp/anaconda3/envs/moe_new_2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/xiaoi/dq/download/AutoAWQ-main/awq/models/base.py", line 230, in quantize
self.quantizer.quantize()
File "/home/xiaoi/dq/download/AutoAWQ-main/awq/quantize/quantizer.py", line 166, in quantize
scales_list = [
File "/home/xiaoi/dq/download/AutoAWQ-main/awq/quantize/quantizer.py", line 167, in
self._search_best_scale(self.modules[i], **layer)
File "/opt/nlp/anaconda3/envs/moe_new_2/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/xiaoi/dq/download/AutoAWQ-main/awq/quantize/quantizer.py", line 332, in _search_best_scale
best_scales = self._compute_best_scale(
File "/home/xiaoi/dq/download/AutoAWQ-main/awq/quantize/quantizer.py", line 413, in _compute_best_scale
raise Exception
Exception
`

maybe there is a bug in AwqQuantizer class?

@casper-hansen
Copy link
Owner

Hi @BinFuPKU, thanks for raising the issue. I will need to further investigate what causes this, but I can see it will not be easy to debug since the model is so large. Do you have any smaller models that you have observed NaN values in loss on?

@WanBenLe
Copy link
Contributor

For large parameter LLMs, increasing the length of individual text in the calibration dataset can help avoid this problem, but it is still recommended to adjust the calibration data to fit the model.

@Kk1984up
Copy link

i got the same issue when quantiliziing qwen2-7b-chat using version 0.2.3 autoawq。how to fix it ?

@casper-hansen
Copy link
Owner

@Kk1984up try upgrading to the newest version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants