Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

Closed
6 of 8 tasks
tokestermw opened this issue Oct 9, 2023 · 3 comments
Closed
6 of 8 tasks

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

tokestermw opened this issue Oct 9, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@tokestermw
Copy link
Contributor

tokestermw commented Oct 9, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Loading the model trained with DeepSpeed Zero3 will load the entire model weights.

Current behaviour

Only part of the model weights exist in pytorch_model.bin.

from transformers import pipeline

# errors here
pipeline('text-generation', '...')

The following error occurs:

File ~/venv3.8/lib/python3.8/site-packages/transformers/modeling_utils.py:3756, in PreTrainedModel._load_pretrained_model(cls, model, state_dic
t, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage,
 device_map, offload_folder, offload_state_dict, dtype, is_quantized, keep_in_fp32_modules)
   3752     if "size mismatch" in error_msg:
   3753         error_msg += (
   3754             "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method."
   3755         )
-> 3756     raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}")
   3758 if is_quantized:
   3759     unexpected_keys = [elem for elem in unexpected_keys if "SCB" not in elem]

RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel:
...
        size mismatch for transformer.h.33.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.33.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for transformer.h.34.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]).
        size mismatch for transformer.h.34.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]).
        size mismatch for transformer.h.34.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.34.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for transformer.h.35.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]).
        size mismatch for transformer.h.35.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]).
        size mismatch for transformer.h.35.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for transformer.h.35.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]).
        size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50257, 1280]).
        You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.

Steps to reproduce

Train

accelerate launch -m axolotl.cli.train config.yaml --output_dir here

Noting in the DeepSpeed config, we set stage3_gather_16bit_weights_on_model_save to true.

Load

from transformers import pipeline

# should load without issue
_ = pipeline('text-generation', 'here')

Config yaml

base_model: gpt2
base_model_config: gpt2
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: wikitext
    name: wikitext-2-v1
    type: completion
    train_on_split: test
dataset_prepared_path:
val_set_size: 0.01
adapter:
lora_model_dir:
sequence_len: 1024
max_packed_sequence_len:
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id: wikitext-test-1
wandb_log_model:
output_dir: ./wikitext-test-1
gradient_accumulation_steps: 16
micro_batch_size: 6
eval_batch_size:
num_epochs: 1
optimizer: paged_adamw_8bit
torchdistx_path:
lr_scheduler: linear
learning_rate: 0.0001
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 500
save_steps:
debug:
deepspeed: axolotl/deepspeed/zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"

Possible solution

There is a doc in accelerate.

https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main/77c84e02fd1a7eef25cccc5b8104178d980851c7

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@tokestermw tokestermw added the bug Something isn't working label Oct 9, 2023
@seungduk-yanolja
Copy link
Contributor

Hmm actually I experienced the same issue before.

@tokestermw
Copy link
Contributor Author

ok i think this should work: #709

joey00072 added a commit to joey00072/axolotl that referenced this issue Oct 16, 2023
@winglian
Copy link
Collaborator

resolved w #709

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants