Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On the issue of Continuous Fine-tuning #82

Closed
Gary2018X opened this issue May 22, 2024 · 20 comments
Closed

On the issue of Continuous Fine-tuning #82

Gary2018X opened this issue May 22, 2024 · 20 comments

Comments

@Gary2018X
Copy link

Thanks for your work
I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

@Gary2018X
Copy link
Author

I tried, but there was an error while merging the models

Traceback (most recent call last):
  File "/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3447, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1769, in _get_no_split_modules
    raise ValueError(
ValueError: SiglipVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

How should I solve it

@Isaachhh
Copy link
Collaborator

What is the merging command you use?

@Gary2018X
Copy link
Author

python script/merge_lora_weights.py \
  	--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-qwen1.5-1.8b \
  	--model-base ./models/Qwen1.5-1.8B \
  	--model-type qwen1.5-1.8b \
  	--save-model-path ./models/model

@Gary2018X
Copy link
Author

model config

{
  "_name_or_path": "./models/Qwen1.5-1.8B",
  "architectures": [
    "BunnyQwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
    "auto_map": {
    "AutoConfig": "configuration_bunny_qwen2.BunnyQwen2Config",
    "AutoModelForCausalLM": "modeling_bunny_qwen2.BunnyQwen2ForCausalLM"
  },
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 5504,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "mm_hidden_size": 1152,
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_tower": "./models/siglip-so400m-patch14-384",
  "model_type": "bunny-qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.39.1",
  "tune_mm_mlp_adapter": false,
  "use_cache": true,
  "use_mm_proj": true,
  "use_sliding_window": false,
  "continuous_training":true,
  "vocab_size": 151646
}

train.sh

#!/bin/bash

MODEL_TYPE=qwen1.5-1.8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-ct-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path ./models/merged_model \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path ./data/Bunny.json \
    --image_folder ./data/image \
    --vision_tower ./models/siglip-so400m-patch14-384 \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

@Gary2018X
Copy link
Author

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.

@Gary2018X
Copy link
Author

I know why the error was occured
Merge models after training is completed
The base model should be specified as ./models/merged_model instead of Qwen1.5-1.8B

@Isaachhh
Copy link
Collaborator

Great!

And I realize that when evalutaing the final model (continuously trained), continuous_training should be set to false.
Please pay attention to 28e761d.

@Gary2018X
Copy link
Author

Ok, thank you very much for your patient answer

@Gary2018X
Copy link
Author

Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

Is there any answer to this question?

@Gary2018X
Copy link
Author

I conducted an experiment:
Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A.
This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B
Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions
Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

@basteran
Copy link

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.

Hi I have the same issue but with different size because of the pad_token_id:
ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect.

How did you solve it?

@Isaachhh
Copy link
Collaborator

@basteran Does it relate to eos_token_id of llama-3? #75

@basteran
Copy link

No, it is related to some other implementation that introduces the pad_token_id as here.

I managed to train the model with LoRA and now I want to merge the adapters back but I get the error above..

@Isaachhh
Copy link
Collaborator

Isaachhh commented Jun 2, 2024

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

@basteran
Copy link

basteran commented Jun 3, 2024

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

What do you mean you didn't try to expand the vocabulary? I see these lines in your code. Aren't you adding the new pad_token_id to the vocabulary and overwriting the old one if it is not defined? Am I missing something?

Thanks for the help!

@Isaachhh
Copy link
Collaborator

Isaachhh commented Jun 3, 2024

@basteran
What you mentioned is the code for evaluating not training.
When training, we just use 128001 <|end_of_text|> as the padding token as here. But LLaVA++ seems defining a new token <pad> as the padding token here. So, the vocabulary size of Bunny should be 128256 but that of LLaVA++ should be 128257 (128256+<pad>).

@basteran
Copy link

basteran commented Jun 3, 2024

Ok, I got it. So you add the pad_token_id only at "run" time, but you don't save it in the vocabulary.

Thank you very much for the help! Now I understand what's going on.. I am considering switching to your Bunny repository instead of LLaVA++ 😄

@Isaachhh
Copy link
Collaborator

Isaachhh commented Jun 3, 2024

@basteran
Well, maybe need to distinguish token and token_id.

When training and running, Bunny uses an existing token <|end_of_text|> as the padding token. I'm not sure whether I can "save it". Because <|end_of_text|> is 128001 in the vocabulary, can I define a new token 128257 which is also <|end_of_text|>?

So, I just pick up an existing token serving as the padding token without modifying the tokenizer a lot.

@Isaachhh
Copy link
Collaborator

Isaachhh commented Jul 6, 2024

I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

@Gary2018X

There exists a complex and comprehensive influence of different kinds of data and the fraction of each. So it's hard to give a simple principle. The performance may be related to the knowledge area of each kind of data, the conflicts and cooperations. Whether to unfreeze the vision tower and the hype-parameters may also matter.

From my own perspective, fine-tuning multiple instructions at once (e.g. Bunny-695K + your own data) may be better.

@Isaachhh
Copy link
Collaborator

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants