Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

tokestermw · 2023-10-09T06:38:08Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Loading the model trained with DeepSpeed Zero3 will load the entire model weights.

Current behaviour

Only part of the model weights exist in pytorch_model.bin.

from transformers import pipeline

# errors here
pipeline('text-generation', '...')

The following error occurs:

File ~/venv3.8/lib/python3.8/site-packages/transformers/modeling_utils.py:3756, in PreTrainedModel._load_pretrained_model(cls, model, state_dic
t, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage,
 device_map, offload_folder, offload_state_dict, dtype, is_quantized, keep_in_fp32_modules)
   3752     if "size mismatch" in error_msg:
   3753         error_msg += (
   3754             "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
   3755         )
-> 3756     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
   3758 if is_quantized:
   3759     unexpected_keys = [elem for elem in unexpected_keys if "SCB" not in elem]

RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:
...
        size mismatch for transformer.h.33.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.33.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for transformer.h.34.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]).
        size mismatch for transformer.h.34.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]).
        size mismatch for transformer.h.34.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.34.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for transformer.h.35.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]).
        size mismatch for transformer.h.35.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]).
        size mismatch for transformer.h.35.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.35.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50257, 1280]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Steps to reproduce

Train

accelerate launch -m axolotl.cli.train config.yaml --output_dir here

Noting in the DeepSpeed config, we set stage3_gather_16bit_weights_on_model_save to true.

Load

from transformers import pipeline

# should load without issue
_ = pipeline('text-generation', 'here')

Config yaml

base_model: gpt2
base_model_config: gpt2
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: wikitext
    name: wikitext-2-v1
    type: completion
    train_on_split: test
dataset_prepared_path:
val_set_size: 0.01
adapter:
lora_model_dir:
sequence_len: 1024
max_packed_sequence_len:
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id: wikitext-test-1
wandb_log_model:
output_dir: ./wikitext-test-1
gradient_accumulation_steps: 16
micro_batch_size: 6
eval_batch_size:
num_epochs: 1
optimizer: paged_adamw_8bit
torchdistx_path:
lr_scheduler: linear
learning_rate: 0.0001
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 500
save_steps:
debug:
deepspeed: axolotl/deepspeed/zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"

Possible solution

There is a doc in accelerate.

https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.10

axolotl branch-commit

main/77c84e02fd1a7eef25cccc5b8104178d980851c7

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

seungduk-yanolja · 2023-10-09T13:41:38Z

Hmm actually I experienced the same issue before.

tokestermw · 2023-10-09T17:34:02Z

ok i think this should work: #709

Issue axolotl-ai-cloud#705

winglian · 2023-10-19T23:19:17Z

resolved w #709

tokestermw added the bug Something isn't working label Oct 9, 2023

tokestermw mentioned this issue Oct 9, 2023

Fix DeepSpeed Zero 3 Saving #709

Merged

joey00072 added a commit to joey00072/axolotl that referenced this issue Oct 16, 2023

train.py DeepSpeed Zero3 save fixes

83c7710

Issue axolotl-ai-cloud#705

joey00072 mentioned this issue Oct 16, 2023

DeepSpeed Zero3 save fixes #736

Closed

winglian closed this as completed Oct 19, 2023

RicardoDominguez mentioned this issue Dec 10, 2023

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

tokestermw commented Oct 9, 2023 •

edited

Loading

seungduk-yanolja commented Oct 9, 2023

tokestermw commented Oct 9, 2023

winglian commented Oct 19, 2023

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

Comments

tokestermw commented Oct 9, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Train

Load

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

seungduk-yanolja commented Oct 9, 2023

tokestermw commented Oct 9, 2023

winglian commented Oct 19, 2023

tokestermw commented Oct 9, 2023 •

edited

Loading