Add LoRA support to HQQ Quantization #1618

fahadh4ilyas · 2024-04-03T15:23:41Z

No description provided.

BenjaminBossan · 2024-04-03T15:37:54Z

Thanks a lot for adding this PR. Would you be so kind to add some tests? It should be enough to copy+paste some existing tests and slightly adjust them for HQQ, see e.g. this one.

fahadh4ilyas · 2024-04-03T16:17:47Z

@BenjaminBossan is this okay? Sorry if my code is kinda a mess.

BenjaminBossan · 2024-04-03T16:35:59Z

is this okay? Sorry if my code is kinda a mess.

Looks great, thanks, not messy at all -- or at least, not messier than the rest of the PEFT code ;-)

I'll give this a test tomorrow and will be back with a proper review.

BenjaminBossan

Thanks a lot @fahadh4ilyas for this contribution. It already looks quite good and feature complete and from my testing, it also seems to generally work. There are still a few corners that need improvement. Please check my comments. On top of that, I have some more general comments:

Code style

Could you please always run make style on your changes before committing?

DoRA

Thanks for also taking DoRA into consideration. However, currently, it fails because DoRA searches for self.get_base_layer().weight. We can add a check there to search for W_q instead. Below is the diff that worked for me:

modified   src/peft/tuners/lora/layer.py
@@ -183,9 +183,11 @@ class LoraLayer(BaseTunerLayer):
         lora_A = self.lora_A[adapter_name]
         lora_B = self.lora_B[adapter_name]
         scaling = self.scaling[adapter_name]
-        with gather_params_ctx(self.get_base_layer()):
-            weight = self.get_base_layer().weight
-            quant_state = getattr(self.get_base_layer(), "state", None)
+        base_layer = self.get_base_layer()
+        with gather_params_ctx(base_layer):
+            # W_q is for HQQ
+            weight = base_layer.W_q if hasattr(base_layer, "W_q") else base_layer.weight
+            quant_state = getattr(base_layer, "state", None)
             weight = dequantize_bnb_weight(weight, state=quant_state)  # no-op if not bnb
             if weight.data.ndim == 4:  # For handling LoRAs applied to Conv2Ds.
                 lora_weight = torch.mm(lora_B.weight.flatten(start_dim=1), lora_A.weight.flatten(start_dim=1))
@@ -212,7 +214,9 @@ class LoraLayer(BaseTunerLayer):
         """
         lora_weight = lora_B.weight @ lora_A.weight
         magnitude = self.lora_magnitude_vector[active_adapter]
-        weight = self.get_base_layer().weight
+        base_layer = self.get_base_layer()
+        # W_q is for HQQ
+        weight = base_layer.W_q if hasattr(base_layer, "W_q") else base_layer.weight
         quant_state = getattr(self.get_base_layer(), "state", None)
         weight = dequantize_bnb_weight(weight, state=quant_state)  # no-op if not bnb
         weight = weight.to(x.dtype)

Additionally, right now we only dequantize bnb weights for DoRA, but this needs to be extended to check for HQQ weights, check the function dequantize_bnb_weight in utils/integrations.py. Of course, if we add HQQ there, the function should also be renamed to a more generic name. This new function could then also be used inside the unmerge method.

Calculating the number of params

Do we need to update get_nb_trainable_parameters? I don't know how HQQ packs the weights, but I get very different results when I print this for a normal model vs an HQQ model. For instance, using the unit test you added:

>>> # LoRA normal model
>>> model.print_trainable_parameters()
trainable params: 2,252,800 || all params: 1,102,301,184 || trainable%: 0.20437245579516677
>>> # HQQ LoRA model
>>> model.print_trainable_parameters()
trainable params: 2,252,800 || all params: 617,859,072 || trainable%: 0.36461389046335796

Docs

Let's add an entry to the docs here.

Docker

To ensure that HQQ is run in our GPU tests, we need to install it in the corresponding Dockerfile. Could you please add it here? For me, for some reason, my PyTorch version was downgraded when I installed HQQ, but maybe that's just an issue with my personal environment. Did you observe the same thing?

Additional testing

I wrote an additional test to check for merging/unmerging/unloading/saving+loading. Could you please add it?

    def test_hqq_lora_model_outputs(self):
        # check that the outputs generated by HQQ with LoRA are similar to those without HQQ
        from hqq.engine.hf import HQQModelForCausalLM
        from hqq.core.quantize import BaseQuantizeConfig

        device = 'cuda'
        compute_dtype = torch.float16

        # first load the model without HQQ
        model = AutoModelForCausalLM.from_pretrained(
            self.causal_lm_model_id,
            device_map=device,
            torch_dtype=compute_dtype,
        )
        config = LoraConfig(
            target_modules=["q_proj", "v_proj"],
            task_type="CAUSAL_LM",
            init_lora_weights=False,
        )
        torch.manual_seed(0)
        model = get_peft_model(model, config).eval()
        inputs = self.tokenizer("The meaning of unit tests is", return_tensors="pt").to(model.device)

        with torch.inference_mode():
            output_normal = model(**inputs).logits
        assert torch.isfinite(output_normal).all()

        del model
        gc.collect()
        torch.cuda.empty_cache()

        # now load with HQQ
        model = HQQModelForCausalLM.from_pretrained(
            self.causal_lm_model_id,
            device_map=device,
            torch_dtype=compute_dtype,
        )
        quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
        model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device)
        torch.manual_seed(0)
        model = get_peft_model(model, config).eval()
        with torch.inference_mode():
            output_hqq = model(**inputs).logits

        # check that outputs of HQQ are highly correlated; there are outliers, so don't check for equality
        cc_matrix = torch.corrcoef(torch.stack((output_normal.flatten(), output_hqq.flatten())))
        assert cc_matrix.min() > 0.97

        # check that outputs are the same after merging
        cc_matrix = torch.corrcoef(torch.stack((output_normal.flatten(), output_hqq.flatten())))
        assert cc_matrix.min() > 0.97

        # check outputs are the same after unmerging
        model.unmerge_adapter()
        with torch.inference_mode():
            output_unmerged = model(**inputs).logits
        cc_matrix = torch.corrcoef(torch.stack((output_normal.flatten(), output_unmerged.flatten())))
        assert cc_matrix.min() > 0.97

        # check that the results are the same after saving and loading
        with tempfile.TemporaryDirectory() as tmp_dir:
            model.save_pretrained(tmp_dir)
            del model
            gc.collect()
            torch.cuda.empty_cache()

            model = HQQModelForCausalLM.from_pretrained(
                self.causal_lm_model_id,
                device_map=device,
                torch_dtype=compute_dtype,
            )
            quant_config = BaseQuantizeConfig(nbits=4, group_size=64)
            model.quantize_model(quant_config=quant_config, compute_dtype=compute_dtype, device=device)
            model = PeftModel.from_pretrained(model, tmp_dir)
            with torch.inference_mode():
                output_loaded = model(**inputs).logits

            # for loading, we expect high precision, so check for equality and not just correlation
            atol, rtol = 1e-6, 1e-6
            assert torch.allclose(output_hqq, output_loaded, atol=atol, rtol=rtol)

        # check that outputs are the same after merge_and_unload
        model = model.merge_and_unload()
        with torch.inference_mode():
            output_merged_unloaded = model(**inputs).logits
        cc_matrix = torch.corrcoef(torch.stack((output_normal.flatten(), output_merged_unloaded.flatten())))
        assert cc_matrix.min() > 0.97

Quantization error

While working on the additional tests, I plotted the LoRA outputs of the normal model vs HQQ model used in the test and they seem to be highly correlated, but there are a big deviations for individual data points

>>> torch.testing.assert_close(output_normal, output_hqq)
*** AssertionError: Tensor-likes are not close!
Mismatched elements: 223632 / 224000 (99.8%)
Greatest absolute difference: 3.56396484375 at index (0, 3, 1141) (up to 1e-05 allowed)
Greatest relative difference: 41549.37890625 at index (0, 0, 15916) (up to 1.3e-06 allowed)

Probably this is fine, just wanting to check if this is in line with expectations.

src/peft/tuners/lora/hqq.py

tests/test_gpu_examples.py

src/peft/tuners/lora/hqq.py

tests/test_gpu_examples.py

src/peft/tuners/lora/hqq.py

fahadh4ilyas · 2024-04-05T04:53:25Z

@BenjaminBossan I'm done fixing it. Could you please test it again?

fahadh4ilyas · 2024-04-05T05:01:25Z

Quantization error

While working on the additional tests, I plotted the LoRA outputs of the normal model vs HQQ model used in the test and they seem to be highly correlated, but there are a big deviations for individual data points
>>> torch.testing.assert_close(output_normal, output_hqq)
*** AssertionError: Tensor-likes are not close!
Mismatched elements: 223632 / 224000 (99.8%)
Greatest absolute difference: 3.56396484375 at index (0, 3, 1141) (up to 1e-05 allowed)
Greatest relative difference: 41549.37890625 at index (0, 0, 15916) (up to 1.3e-06 allowed)
Probably this is fine, just wanting to check if this is in line with expectations.

I was wondering. The lora that you used for testing, is it trained with base model or with HQQ quantized model? If it's the latter, do other method of quantization also suffer from this kind of error?

BenjaminBossan

Thanks for the additional changes. This looks pretty good already, there are only a few small things, please check my comments.

Furthermore, could you please ensure that make style was run on your code? Also make sure that the ruff version is 0.2.2.

Finally, it would be great to have an entry in the docs for this new quantization method.

src/peft/tuners/lora/hqq.py

tests/test_gpu_examples.py

src/peft/tuners/lora/layer.py

HuggingFaceDocBuilderDev · 2024-04-05T16:30:20Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan · 2024-04-05T16:32:58Z

@fahadh4ilyas Could you please run ruff again? We just need an extra new line before dispatch_hqq.

fahadh4ilyas · 2024-04-05T16:42:07Z

@BenjaminBossan it's done. I think I have to get used to run make style every commit.

BenjaminBossan · 2024-04-05T16:48:27Z

I think I have to get used to run make style every commit.

You may want to set up a pre-commit hook as explained here.

BenjaminBossan

Thanks a lot for adding HQQ. From my point of view, this looks good to be merged. Waiting for @younesbelkada's review.

tests/test_gpu_examples.py

younesbelkada

Nice work ! looks very clean overall ! I just have minor comments !
1- can you update the tests so that it'll leverage AutoModelForCausalLM once huggingface/transformers#29637 is merged ?
2- Can you also add few lines in the documentation mentioning HQQ
3- Could you add hqq in our docker image we use for testing: https://github.com/huggingface/peft/blob/main/docker/peft-gpu/Dockerfile
Thanks a lot !

src/peft/tuners/lora/layer.py

src/peft/import_utils.py

thincal · 2024-04-23T00:38:07Z

Any progress so far, amazing work for the hqq support we want to have a try this feature.

BenjaminBossan · 2024-05-03T08:16:57Z

With huggingface/transformers#29637 being merged, let's not forget about this PR. There are some merge conflicts due to the EETQ PR, but they should be easy to fix.

fahadh4ilyas · 2024-05-03T09:00:04Z

@BenjaminBossan and @younesbelkada please check the current update. Is it okay if the test checking the compatibility of transformers with HQQ first?

younesbelkada · 2024-05-03T09:03:12Z

Thanks so much @fahadh4ilyas !
Can you elaborate more? What do you mean exactly by

Is it okay if the test checking the compatibility of transformers with HQQ first?

younesbelkada

Great work thank you ! I just left one comment about tests,
I also made: huggingface/transformers#30632 that we can merge after this PR gets merged

tests/test_gpu_examples.py

src/peft/tuners/lora/layer.py

mobicham · 2024-05-03T09:08:00Z

Awesome work @fahadh4ilyas 🙏!

younesbelkada

Looks good thankks ! One last comment left, can you make sure the tests you designed pass with transformers built on main?

tests/test_gpu_examples.py

fahadh4ilyas · 2024-05-03T11:42:44Z

Looks good thankks ! One last comment left, can you make sure the tests you designed pass with transformers built on main?

Sorry for my stupid mistake. Here is the revision.

younesbelkada

Great work, thanks !

fahadh4ilyas added 11 commits April 3, 2024 15:39

Add HQQ Lora

e48b8ac

fix error weight load

b412495

Remove unused

f6595b4

Add quantized lora

12f202a

fix make HQQLinear

7abdd1b

Fix dtype

021114a

Revert back quantize lora

38e4a3a

Add prepare training for hqq quantization

65435c6

Forget revert hqq

fb43017

Remove warnings

fe2f25c

Other ways to check hqq quantization

91d8d99

fahadh4ilyas mentioned this pull request Apr 3, 2024

Support HQQ method. #1616

Closed

Add unit test for training

5eb8cc1

change bfloat16 to float16

59aa632

BenjaminBossan requested changes Apr 4, 2024

View reviewed changes

fahadh4ilyas added 7 commits April 5, 2024 11:38

Fix load weight when applied dora

bc706a8

Move import hqq inside if clause

a7ee47f

Naming using CamelCase

e1e7675

Remove unused function and fix naming convention

db11108

Pop offload_meta

19fc4c5

Add use_dora params

46511e0

Remove confusing comments

aec43b7

Additional test for checking output from HQQ

872d8cf

BenjaminBossan requested changes Apr 5, 2024

View reviewed changes

src/peft/tuners/lora/hqq.py Show resolved Hide resolved

tests/test_gpu_examples.py Show resolved Hide resolved

src/peft/tuners/lora/layer.py Outdated Show resolved Hide resolved

src/peft/tuners/lora/layer.py Outdated Show resolved Hide resolved

Add license notice

97abfff

make style all scripts

fabe859

BenjaminBossan approved these changes Apr 9, 2024

View reviewed changes

younesbelkada reviewed Apr 11, 2024

View reviewed changes

tests/test_gpu_examples.py Outdated Show resolved Hide resolved

younesbelkada reviewed Apr 11, 2024

View reviewed changes

src/peft/tuners/lora/layer.py Outdated Show resolved Hide resolved

src/peft/import_utils.py Outdated Show resolved Hide resolved

fahadh4ilyas added 5 commits April 23, 2024 12:15

Add comment for explanation

caee0eb

Mention HQQ to docs

a0328ff

Merge branch 'main' into hqq-lora

0d41431

Add HQQ to Dockerfile

55d7f66

Fix styling

0e0d3ac

fahadh4ilyas added 3 commits May 3, 2024 15:30

Merge branch 'main' into hqq-lora

18aaa42

Styling scripts

26f2cd6

Comply with transformers HQQ integration

4a73a3c

younesbelkada mentioned this pull request May 3, 2024

HQQ: PEFT support for HQQ huggingface/transformers#30632

Merged

younesbelkada reviewed May 3, 2024

View reviewed changes

tests/test_gpu_examples.py Outdated Show resolved Hide resolved

src/peft/tuners/lora/layer.py Outdated Show resolved Hide resolved

fahadh4ilyas added 2 commits May 3, 2024 17:17

Test fully using transformers

5db3577

Add comments handling HQQ

a818023

younesbelkada reviewed May 3, 2024

View reviewed changes

tests/test_gpu_examples.py Outdated Show resolved Hide resolved

Fix naming problem

53f2701

younesbelkada approved these changes May 3, 2024

View reviewed changes

younesbelkada merged commit 02ae6bc into huggingface:main May 3, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LoRA support to HQQ Quantization #1618

Add LoRA support to HQQ Quantization #1618

fahadh4ilyas commented Apr 3, 2024

BenjaminBossan commented Apr 3, 2024

fahadh4ilyas commented Apr 3, 2024

BenjaminBossan commented Apr 3, 2024

BenjaminBossan left a comment

fahadh4ilyas commented Apr 5, 2024

fahadh4ilyas commented Apr 5, 2024

Quantization error

BenjaminBossan left a comment

HuggingFaceDocBuilderDev commented Apr 5, 2024

BenjaminBossan commented Apr 5, 2024

fahadh4ilyas commented Apr 5, 2024

BenjaminBossan commented Apr 5, 2024

BenjaminBossan left a comment

younesbelkada left a comment

thincal commented Apr 23, 2024

BenjaminBossan commented May 3, 2024

fahadh4ilyas commented May 3, 2024

younesbelkada commented May 3, 2024

younesbelkada left a comment

mobicham commented May 3, 2024

younesbelkada left a comment

fahadh4ilyas commented May 3, 2024

younesbelkada left a comment

Add LoRA support to HQQ Quantization #1618

Add LoRA support to HQQ Quantization #1618

Conversation

fahadh4ilyas commented Apr 3, 2024

BenjaminBossan commented Apr 3, 2024

fahadh4ilyas commented Apr 3, 2024

BenjaminBossan commented Apr 3, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

Code style

DoRA

Calculating the number of params

Docs

Docker

Additional testing

Quantization error

fahadh4ilyas commented Apr 5, 2024

fahadh4ilyas commented Apr 5, 2024

Quantization error

BenjaminBossan left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Apr 5, 2024

BenjaminBossan commented Apr 5, 2024

fahadh4ilyas commented Apr 5, 2024

BenjaminBossan commented Apr 5, 2024

BenjaminBossan left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

thincal commented Apr 23, 2024

BenjaminBossan commented May 3, 2024

fahadh4ilyas commented May 3, 2024

younesbelkada commented May 3, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

mobicham commented May 3, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

fahadh4ilyas commented May 3, 2024

younesbelkada left a comment

Choose a reason for hiding this comment