On the issue of Continuous Fine-tuning #82

Gary2018X · 2024-05-22T01:31:36Z

Thanks for your work
I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

Gary2018X · 2024-05-22T05:34:54Z

I tried, but there was an error while merging the models

Traceback (most recent call last):
  File "/Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "/Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "/Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3447, in from_pretrained
    no_split_modules = model._get_no_split_modules(device_map)
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1769, in _get_no_split_modules
    raise ValueError(
ValueError: SiglipVisionModel does not support `device_map='auto'`. To implement support, the model class needs to implement the `_no_split_modules` attribute.

How should I solve it

Isaachhh · 2024-05-22T09:06:30Z

What is the merging command you use?

Gary2018X · 2024-05-22T09:18:21Z

python script/merge_lora_weights.py \
  	--model-path ./checkpoints-qwen1.5-1.8b/bunny-lora-qwen1.5-1.8b \
  	--model-base ./models/Qwen1.5-1.8B \
  	--model-type qwen1.5-1.8b \
  	--save-model-path ./models/model

Gary2018X · 2024-05-23T01:11:35Z

model config

{
  "_name_or_path": "./models/Qwen1.5-1.8B",
  "architectures": [
    "BunnyQwen2ForCausalLM"
  ],
  "attention_dropout": 0.0,
    "auto_map": {
    "AutoConfig": "configuration_bunny_qwen2.BunnyQwen2Config",
    "AutoModelForCausalLM": "modeling_bunny_qwen2.BunnyQwen2ForCausalLM"
  },
  "bos_token_id": 151643,
  "eos_token_id": 151643,
  "freeze_mm_mlp_adapter": false,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "image_aspect_ratio": "pad",
  "initializer_range": 0.02,
  "intermediate_size": 5504,
  "max_position_embeddings": 32768,
  "max_window_layers": 21,
  "mm_hidden_size": 1152,
  "mm_projector_lr": 2e-05,
  "mm_projector_type": "mlp2x_gelu",
  "mm_vision_tower": "./models/siglip-so400m-patch14-384",
  "model_type": "bunny-qwen2",
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_key_value_heads": 16,
  "rms_norm_eps": 1e-06,
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "tie_word_embeddings": false,
  "tokenizer_model_max_length": 2048,
  "tokenizer_padding_side": "right",
  "torch_dtype": "float16",
  "transformers_version": "4.39.1",
  "tune_mm_mlp_adapter": false,
  "use_cache": true,
  "use_mm_proj": true,
  "use_sliding_window": false,
  "continuous_training":true,
  "vocab_size": 151646
}

train.sh

#!/bin/bash

MODEL_TYPE=qwen1.5-1.8b

PRETRAIN_DIR=bunny-$MODEL_TYPE-pretrain
OUTPUT_DIR=bunny-lora-ct-$MODEL_TYPE

mkdir -p ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR

deepspeed bunny/train/train.py \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --deepspeed ./script/deepspeed/zero3.json \
    --model_name_or_path ./models/merged_model \
    --model_type $MODEL_TYPE \
    --version bunny \
    --data_path ./data/Bunny.json \
    --image_folder ./data/image \
    --vision_tower ./models/siglip-so400m-patch14-384 \
    --mm_projector_type mlp2x_gelu \
    --image_aspect_ratio pad \
    --group_by_modality_length False \
    --bf16 True \
    --output_dir ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 500 \
    --save_total_limit 1 \
    --learning_rate 2e-4 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --dataloader_num_workers 4 \
    --lazy_preprocess True \
    --report_to none | tee 2>&1 ./checkpoints-$MODEL_TYPE/$OUTPUT_DIR/log.txt

Gary2018X · 2024-05-23T01:22:28Z

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.

Gary2018X · 2024-05-23T05:50:59Z

I know why the error was occured
Merge models after training is completed
The base model should be specified as ./models/merged_model instead of Qwen1.5-1.8B

Isaachhh · 2024-05-23T06:02:24Z

Great!

And I realize that when evalutaing the final model (continuously trained), continuous_training should be set to false.
Please pay attention to 28e761d.

Gary2018X · 2024-05-23T06:25:23Z

Ok, thank you very much for your patient answer

Gary2018X · 2024-05-23T06:41:03Z

Thanks for your work I would like to know which effect would be better between continuous fine-tuning and fine-tuning multiple instructions at once?

Is there any answer to this question?

Gary2018X · 2024-05-23T07:52:04Z

I conducted an experiment:
Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A.
This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B
Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions
Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

basteran · 2024-05-28T17:59:25Z

When I update Transformers to the latest version, there is an new error

Traceback (most recent call last):
  File "./Bunny/script/merge_lora_weights.py", line 26, in <module>
    merge_lora(args)
  File "./Bunny/script/merge_lora_weights.py", line 10, in merge_lora
    tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, args.model_base, model_name,
  File "./Bunny/bunny/model/builder.py", line 58, in load_pretrained_model
    model = BunnyQwen2ForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained,
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.9/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 348, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([151936, 2048]) in "weight" (which has shape torch.Size([151646, 2048])), this look incorrect.

Hi I have the same issue but with different size because of the pad_token_id:
ValueError: Trying to set a tensor of shape torch.Size([128257, 4096]) in "weight" (which has shape torch.Size([128256, 4096])), this look incorrect.

How did you solve it?

Isaachhh · 2024-05-28T18:06:35Z

@basteran Does it relate to eos_token_id of llama-3? #75

basteran · 2024-05-28T18:38:05Z

No, it is related to some other implementation that introduces the pad_token_id as here.

I managed to train the model with LoRA and now I want to merge the adapters back but I get the error above..

Isaachhh · 2024-06-02T13:08:04Z

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

basteran · 2024-06-03T09:52:14Z

@basteran We didn't try to expand the vocabulary, so maybe we couldn't help you.

What do you mean you didn't try to expand the vocabulary? I see these lines in your code. Aren't you adding the new pad_token_id to the vocabulary and overwriting the old one if it is not defined? Am I missing something?

Thanks for the help!

Isaachhh · 2024-06-03T10:00:20Z

@basteran
What you mentioned is the code for evaluating not training.
When training, we just use 128001 <|end_of_text|> as the padding token as here. But LLaVA++ seems defining a new token <pad> as the padding token here. So, the vocabulary size of Bunny should be 128256 but that of LLaVA++ should be 128257 (128256+<pad>).

basteran · 2024-06-03T10:08:38Z

Ok, I got it. So you add the pad_token_id only at "run" time, but you don't save it in the vocabulary.

Thank you very much for the help! Now I understand what's going on.. I am considering switching to your Bunny repository instead of LLaVA++ 😄

Isaachhh · 2024-06-03T10:14:36Z

@basteran
Well, maybe need to distinguish token and token_id.

So, I just pick up an existing token serving as the padding token without modifying the tokenizer a lot.

Isaachhh · 2024-07-06T10:06:19Z

I conducted an experiment: Firstly, I used dataset A and prompt A to continuous fine-tuning model Bunny v1.0-2B-zh to obtain model A. This result is 0.5% worse than the result of direct instruction fine-tuning,This is acceptable to me.

Then I used dataset B and prompt B to continuous fine-tuning model A to obtain model B Next, use the B model to evaluate the results of tasks A and B separately, The result of task B is 5 points worse than direct fine-tuning, Basic loss of ability for task A

This is also worse than directly fine-tuning multiple instructions Directly fine tune tasks A and B, The A task has no difference in results compared to fine-tuning A separately,Task B is 10% worse than fine-tuning task B separately

Is there any trick that can guide you? Or is it that my approach is not appropriate?

@Gary2018X

There exists a complex and comprehensive influence of different kinds of data and the fraction of each. So it's hard to give a simple principle. The performance may be related to the knowledge area of each kind of data, the conflicts and cooperations. Whether to unfreeze the vision tower and the hype-parameters may also matter.

From my own perspective, fine-tuning multiple instructions at once (e.g. Bunny-695K + your own data) may be better.

Isaachhh · 2024-07-23T01:39:31Z

Close the issue for now if there's no further discussions. Feel free to reopen it if there's any other questions.

Isaachhh closed this as completed Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On the issue of Continuous Fine-tuning #82

On the issue of Continuous Fine-tuning #82

Gary2018X commented May 22, 2024

Gary2018X commented May 22, 2024

Isaachhh commented May 22, 2024

Gary2018X commented May 22, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Isaachhh commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

basteran commented May 28, 2024

Isaachhh commented May 28, 2024

basteran commented May 28, 2024

Isaachhh commented Jun 2, 2024

basteran commented Jun 3, 2024

Isaachhh commented Jun 3, 2024

basteran commented Jun 3, 2024

Isaachhh commented Jun 3, 2024 •

edited

Loading

Isaachhh commented Jul 6, 2024 •

edited

Loading

Isaachhh commented Jul 23, 2024

On the issue of Continuous Fine-tuning #82

On the issue of Continuous Fine-tuning #82

Comments

Gary2018X commented May 22, 2024

Gary2018X commented May 22, 2024

Isaachhh commented May 22, 2024

Gary2018X commented May 22, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Isaachhh commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

Gary2018X commented May 23, 2024

basteran commented May 28, 2024

Isaachhh commented May 28, 2024

basteran commented May 28, 2024

Isaachhh commented Jun 2, 2024

basteran commented Jun 3, 2024

Isaachhh commented Jun 3, 2024

basteran commented Jun 3, 2024

Isaachhh commented Jun 3, 2024 • edited Loading

Isaachhh commented Jul 6, 2024 • edited Loading

Isaachhh commented Jul 23, 2024

Isaachhh commented Jun 3, 2024 •

edited

Loading

Isaachhh commented Jul 6, 2024 •

edited

Loading