Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

Open
6 of 8 tasks
RicardoDominguez opened this issue Dec 10, 2023 · 13 comments
Open
6 of 8 tasks
Labels
bug Something isn't working

Comments

@RicardoDominguez
Copy link
Contributor

RicardoDominguez commented Dec 10, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I fine-tune a Mistral model with the default zero3.json and

Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

My accelerate config is

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Current behaviour

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

yields the error

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3756, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

and

  model = transformers.AutoModelForCausalLM.from_pretrained('test',
                                                            device_map='auto',
                                                            torch_dtype=torch.bfloat16,
                                                            trust_remote_code=True,
                                                            low_cpu_mem_usage=True)

yields the error

Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32002, 4096])), this look incorrect.

Steps to reproduce

accelerate launch -m axolotl.cli.train mistral_config.yml  --deepspeed deepspeed/zero3.json

and thereafter

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

Config yaml

base_model: model_dir/mistral-7b-v0.1/
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
    - path: dset_dir/slim-orca/slim-orca.jsonl
      type: sharegpt
      ds_type: json
      conversation: chatml
      
dataset_prepared_path: prep-datasets/
val_set_size: 0
output_dir: test/
sequence_len: 8192 
sample_packing: true
pad_to_sequence_len: true

wandb_project: orca
wandb_entity:
wandb_watch:
wandb_run_id: mistral-slimorca
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 6
num_epochs: 4
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0 # gradient clipping max norm
lr_scheduler: cosine
learning_rate: 0.00002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens:
save_steps: 0.9999
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Seems related to #705 and #709

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main/3e3229e2d99bb509784ac72e6589f8a8e406247f

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@RicardoDominguez RicardoDominguez added the bug Something isn't working label Dec 10, 2023
@winglian
Copy link
Collaborator

Are you using a model from a checkpoint folder or the output folder?

@RicardoDominguez
Copy link
Contributor Author

RicardoDominguez commented Dec 17, 2023

From the output folder

  File "<stdin>", line 1, in <module>
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3931, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

@RicardoDominguez
Copy link
Contributor Author

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

@maxidl
Copy link

maxidl commented Jan 13, 2024

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue.

@mgoulao
Copy link

mgoulao commented Feb 1, 2024

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

@winglian
Copy link
Collaborator

winglian commented Feb 1, 2024

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

@mgoulao is this a transformers regression then? That particular commit works with zero3 ?

@mgoulao
Copy link

mgoulao commented Feb 2, 2024

Yes, it does work with ZeRO 3 however you will get this problem: #1035

@luijait
Copy link

luijait commented Feb 2, 2024

I had the same error, the transformer library fixes it, but now I get this one.

new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 813, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!

@tcapelle
Copy link
Contributor

tcapelle commented Mar 9, 2024

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

@maxidl
Copy link

maxidl commented Mar 9, 2024

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

The post is old, I think there is no solution, you simply cannot use Qlora + DeepSpeed3 Zero. Fortunately, there is now a quite good alternative that has been recently implemented in Axolotl, which involves FSDP (Full Shard + Qlora). Link

The solution I found most viable was to use a non-quantized Lora with DeepSpeed 3.

Apart from that, I believe that as of today, there is no way with DeepSpeed Stage 3 to load Qloras.

I hope I'm wrong, but all the final answers I found on the internet were basically these.

This issue is about full finetune, no lora involved.

@tcapelle
Copy link
Contributor

I am doing full tine tune, no qlora.

@0-hero
Copy link
Contributor

0-hero commented Mar 13, 2024

+1 Zero3_bf16 + Full-finetune

RuntimeError: Error(s) in loading state_dict for MistralModel:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32006, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

EDIT - Can confirm zero2 works

@JCRPaquin
Copy link

JCRPaquin commented Apr 28, 2024

I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a model.safetensors that HF seems to load by default, even though it's not included in the index.json. Once I (re)moved the model.safetensors file the model seems to have loaded successfully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants