RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

RicardoDominguez · 2023-12-10T15:14:45Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I fine-tune a Mistral model with the default zero3.json and

Training finishes without error. Afterwards, I expect to be able to load the fine-tuned model using

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

My accelerate config is

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: false
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Current behaviour

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

yields the error

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with `model.to('cuda')`.
Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3756, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

and

  model = transformers.AutoModelForCausalLM.from_pretrained('test',
                                                            device_map='auto',
                                                            torch_dtype=torch.bfloat16,
                                                            trust_remote_code=True,
                                                            low_cpu_mem_usage=True)

yields the error

Traceback (most recent call last):
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 159, in <module>
    tokenizer, model = load_tokenizer_model(args.model_dir, use_flash_attention_2=True)
  File "/lustre/home/rolmedo/lllm/longcontext/evaluate_winning_single.py", line 21, in load_tokenizer_model
    model = transformers.AutoModelForCausalLM.from_pretrained(model_name,
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/home/rolmedo/miniconda3/envs/tf34/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([0]) in "weight" (which has shape torch.Size([32002, 4096])), this look incorrect.

Steps to reproduce

accelerate launch -m axolotl.cli.train mistral_config.yml  --deepspeed deepspeed/zero3.json

and thereafter

 model = transformers.AutoModelForCausalLM.from_pretrained('test')

Config yaml

base_model: model_dir/mistral-7b-v0.1/
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
    - path: dset_dir/slim-orca/slim-orca.jsonl
      type: sharegpt
      ds_type: json
      conversation: chatml
      
dataset_prepared_path: prep-datasets/
val_set_size: 0
output_dir: test/
sequence_len: 8192 
sample_packing: true
pad_to_sequence_len: true

wandb_project: orca
wandb_entity:
wandb_watch:
wandb_run_id: mistral-slimorca
wandb_log_model:

gradient_accumulation_steps: 1
micro_batch_size: 6
num_epochs: 4
optimizer: adamw_torch_fused
adam_beta1: 0.9
adam_beta2: 0.95
adam_epsilon: 0.00001
max_grad_norm: 1.0 # gradient clipping max norm
lr_scheduler: cosine
learning_rate: 0.00002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 100
eval_steps: 0
eval_table_size:
eval_table_max_new_tokens:
save_steps: 0.9999
debug:
deepspeed:
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "<|im_end|>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

Seems related to #705 and #709

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main/3e3229e2d99bb509784ac72e6589f8a8e406247f

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

winglian · 2023-12-10T16:25:22Z

Are you using a model from a checkpoint folder or the output folder?

RicardoDominguez · 2023-12-17T23:35:52Z

From the output folder

  File "<stdin>", line 1, in <module>
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 566, in from_pretrained
    return model_class.from_pretrained(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3480, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/lustre/home/rolmedo/axo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3931, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

RicardoDominguez · 2023-12-18T02:56:31Z

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

maxidl · 2024-01-13T10:28:01Z

I can confirm that I only experience this issue when using Zero3, and Zero 2 works fine.

I just ran into the same error, can confirm switching from zero3 to zero2 "solved" the issue.

mgoulao · 2024-02-01T19:09:02Z

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

winglian · 2024-02-01T19:24:39Z

Using transformers @ git+https://github.com/huggingface/transformers.git@3cefac1d974db5e2825a0cb2b842883a628be7a0 seems to work.

@mgoulao is this a transformers regression then? That particular commit works with zero3 ?

mgoulao · 2024-02-02T11:03:52Z

Yes, it does work with ZeRO 3 however you will get this problem: #1035

luijait · 2024-02-02T14:24:49Z

I had the same error, the transformer library fixes it, but now I get this one.

new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 813, in _load_state_dict_into_meta_model
set_module_quantized_tensor_to_device(model, param_name, param_device, value=param)
File "/usr/local/lib/python3.10/dist-packages/transformers/integrations/bitsandbytes.py", line 128, in set_module_quantized_tensor_to_device
new_value = value.to(device)
NotImplementedError: Cannot copy out of meta tensor; no data!

tcapelle · 2024-03-09T13:29:52Z

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.

loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

maxidl · 2024-03-09T13:41:27Z

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3.
loading model
Traceback (most recent call last):
  File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in <module>
    model = AutoModelForCausalLM.from_pretrained(config.model_path, torch_dtype=getattr(torch, config.torch_dtype))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/miniforge3/envs/pt/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3977, in _load_pretrained_model
    raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
RuntimeError: Error(s) in loading state_dict for MistralForCausalLM:
        size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32002, 4096]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
The post is old, I think there is no solution, you simply cannot use Qlora + DeepSpeed3 Zero. Fortunately, there is now a quite good alternative that has been recently implemented in Axolotl, which involves FSDP (Full Shard + Qlora). Link

The solution I found most viable was to use a non-quantized Lora with DeepSpeed 3.

Apart from that, I believe that as of today, there is no way with DeepSpeed Stage 3 to load Qloras.

I hope I'm wrong, but all the final answers I found on the internet were basically these.

This issue is about full finetune, no lora involved.

tcapelle · 2024-03-10T14:09:24Z

I am doing full tine tune, no qlora.

0-hero · 2024-03-13T04:49:22Z

+1 Zero3_bf16 + Full-finetune

RuntimeError: Error(s) in loading state_dict for MistralModel:
	size mismatch for model.embed_tokens.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([32006, 4096]).
	You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

EDIT - Can confirm zero2 works

JCRPaquin · 2024-04-28T14:42:22Z

I encountered this, although mine was with llama 3 + zero3. The model safe tensors were being output as shards, but there was also a model.safetensors that HF seems to load by default, even though it's not included in the index.json. Once I (re)moved the model.safetensors file the model seems to have loaded successfully.

RicardoDominguez added the bug Something isn't working label Dec 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

RicardoDominguez commented Dec 10, 2023 •

edited

Loading

winglian commented Dec 10, 2023

RicardoDominguez commented Dec 17, 2023 •

edited

Loading

RicardoDominguez commented Dec 18, 2023

maxidl commented Jan 13, 2024

mgoulao commented Feb 1, 2024

winglian commented Feb 1, 2024

mgoulao commented Feb 2, 2024

luijait commented Feb 2, 2024

tcapelle commented Mar 9, 2024

maxidl commented Mar 9, 2024

tcapelle commented Mar 10, 2024

0-hero commented Mar 13, 2024 •

edited

Loading

JCRPaquin commented Apr 28, 2024 •

edited

Loading

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

Comments

RicardoDominguez commented Dec 10, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

winglian commented Dec 10, 2023

RicardoDominguez commented Dec 17, 2023 • edited Loading

RicardoDominguez commented Dec 18, 2023

maxidl commented Jan 13, 2024

mgoulao commented Feb 1, 2024

winglian commented Feb 1, 2024

mgoulao commented Feb 2, 2024

luijait commented Feb 2, 2024

tcapelle commented Mar 9, 2024

maxidl commented Mar 9, 2024

tcapelle commented Mar 10, 2024

0-hero commented Mar 13, 2024 • edited Loading

JCRPaquin commented Apr 28, 2024 • edited Loading

RicardoDominguez commented Dec 10, 2023 •

edited

Loading

RicardoDominguez commented Dec 17, 2023 •

edited

Loading

0-hero commented Mar 13, 2024 •

edited

Loading

JCRPaquin commented Apr 28, 2024 •

edited

Loading