-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the dtype of trainable params #1249
Comments
Could you please provide the training code and the full error? |
I used a modified version of https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_clm.py Here is the training code: https://gist.github.com/hiyouga/361bc114960672115446050857895dbb The major differences are the model loading dtype and the lora adapters: L448 torch_dtype=torch.float16
L462 lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=16,
lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
modules_to_save=["embed_tokens", "lm_head"]
)
model = get_peft_model(model, lora_config) https://www.diffchecker.com/auUsf6ZO/ We ran this script by: Full error:
|
Thanks for providing the example. I could reproduce the error, but it was not related to PEFT. Using the normal script, without PEFT, and only with |
Thanks for replying. This problem is also not related to Trainer. Generally, we should make the trainable params in float32 in order to perform mixed precision training. The default dtype of PEFT adapters remains float16 if the base model was loaded in float16. So we cannot directly use these adapters in fp16 training (but we can use them in bf16 training). peft/src/peft/tuners/lora/layer.py Lines 96 to 102 in 21c304f
Coincidently, if we use Lines 99 to 103 in 21c304f
|
Yes, you are correct. What I meant is that when using
Yes, good point, this function is not only useful for QLoRA, but the name might suggest so. |
I thought the trainer could not handle the model dtype for LoRA training. By default, the model is loaded with 32-bit precision, which consumes a high amount of GPU memory. (e.g. a 7B model requires 28GB GRAM) from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("llama2-7b", low_cpu_mem_usage=True, device_map="cuda")
print(model.dtype)
# torch.float32 However, during PEFT training, we want the model to be loaded directly in 16 bits or lower precision, in this case, it is necessary to explicitly specify the precision type in order to save more GPU memory. (e.g. a 7B model requires 16GB GRAM) ReproductionTraining code: https://gist.github.com/hiyouga/5b139f3d4d41a6cc49382c9e79e177ea Without CUDA_VISIBLE_DEVICES=0 python run_clm.py --model_name_or_path llama2-7b --low_cpu_mem_usage True --train_file wikipedia.json --block_size 128 --output_dir test --do_train --per_device_train_batch_size 1 --fp16 GRAM used: 33GB With torch_dtype=float16 CUDA_VISIBLE_DEVICES=0 python run_clm.py --model_name_or_path llama2-7b --low_cpu_mem_usage True --train_file wikipedia.json --block_size 128 --output_dir test --do_train --per_device_train_batch_size 1 --fp16 --torch_dtype float16 GRAM used: 17GB |
I was able to do float16 finetuning with peft==0.6.2. Are there any dependency changes that might lead to this issue? |
Yes, there is. In PEFT 0.6.2, they used https://github.com/huggingface/peft/blob/v0.6.2/src/peft/tuners/lora/layer.py#L63-L89 peft/src/peft/tuners/lora/layer.py Lines 82 to 89 in 32357c2
Instead, in PEFT 0.7.0, they implemented https://github.com/huggingface/peft/blob/v0.7.0/src/peft/tuners/lora/layer.py#L74-L103 peft/src/peft/tuners/lora/layer.py Lines 96 to 103 in 2665f80
|
@hiyouga Thanks for the explanation! So the float16 finetuning capability in peft 0.6.2 is actually a bug that got fixed in 0.7.0 🥲 |
A related issue: #1090 |
Could you try if applying this to the PEFT model would work after loading the model in 16bit: def cast_lora_to_float(model):
for name, mod in model.named_modules():
if ("lora_" in name) and hasattr(mod, "weight"):
mod.weight.data = mod.weight.data.float()
if ("lora_" in name) and hasattr(mod, "bias") and (mod.bias is not None):
mod.bias.data = mod.bias.data.float() |
@BenjaminBossan It works when |
Yes, I forgot about it: def cast_lora_to_float(model):
for name, mod in model.named_modules():
if ("lora_" in name) and hasattr(mod, "weight"):
mod.weight.data = mod.weight.data.float()
if ("lora_" in name) and hasattr(mod, "bias") and (mod.bias is not None):
mod.bias.data = mod.bias.data.float()
if ("modules_to_save" in name) and isinstance(mod, nn.Linear):
mod.weight.data = mod.weight.data.float()
if mod.bias is not None:
mod.bias.data = mod.bias.data.float() |
@BenjaminBossan It gives
Because I use |
Ah yes, please add an |
I prefer to use for param in filter(lambda p: p.requires_grad, model.parameters()):
param.data = param.data.to(torch.float32) |
That should work too. It is a bit more coarse-grained, so there could be model architectures where this modifies data that it shouldn't, but for most cases it should be fine. |
A similar discussion here: huggingface/transformers#28142 |
Some users ran into the issue of trying to use a model loaded in float16 with mixed precision, e.g. these issues: huggingface#341, huggingface#1249. This PR documents a workaround to solve the issue. I also added tests that demonstrate the issue, as well as the workaround. Notes This is not strictly a PEFT issue, but more a general error when using AMP with float16. Still, since PEFT users encounter this sometimes, it is useful to document it. When we discussed this issue in the past, I think we concluded that it's not as straightforward as PEFT automatically casting the weights to float32, though I cannot remember anymore what the drawbacks were. In any case, should we ever add an automatic solution for this in PEFT, the added test should fail, which alerts us to the fact that we need to update the documentation.
Some users ran into the issue of trying to use a model loaded in float16 with mixed precision, e.g. these issues: #341, #1249. This PR documents a workaround to solve the issue. I also added tests that demonstrate the issue, as well as the workaround. Notes This is not strictly a PEFT issue, but more a general error when using AMP with float16. Still, since PEFT users encounter this sometimes, it is useful to document it. When we discussed this issue in the past, I think we concluded that it's not as straightforward as PEFT automatically casting the weights to float32, though I cannot remember anymore what the drawbacks were. In any case, should we ever add an automatic solution for this in PEFT, the added test should fail, which alerts us to the fact that we need to update the documentation.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
System Info
peft 0.7.0
transformers 4.34.0
torch 2.0.1
Who can help?
@pacman100 @younesbelkada @sayakpaul
Information
Tasks
examples
folderReproduction
Expected behavior
If we load the model with half-precision and use fp16 mixed precision training, it will throw "ValueError: Attempting to unscale FP16 gradients."
Should we manually cast them in float32?
The text was updated successfully, but these errors were encountered: