We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loading the model trained with DeepSpeed Zero3 will load the entire model weights.
Only part of the model weights exist in pytorch_model.bin.
pytorch_model.bin
from transformers import pipeline # errors here pipeline('text-generation', '...')
The following error occurs:
File ~/venv3.8/lib/python3.8/site-packages/transformers/modeling_utils.py:3756, in PreTrainedModel._load_pretrained_model(cls, model, state_dic t, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, is_quantized, keep_in_fp32_modules) 3752 if "size mismatch" in error_msg: 3753 error_msg += ( 3754 "\n\tYou may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method." 3755 ) -> 3756 raise RuntimeError(f"Error(s) in loading state_dict for {model.__class__.__name__}:\n\t{error_msg}") 3758 if is_quantized: 3759 unexpected_keys = [elem for elem in unexpected_keys if "SCB" not in elem] RuntimeError: Error(s) in loading state_dict for GPT2LMHeadModel: ... size mismatch for transformer.h.33.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]). size mismatch for transformer.h.33.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]). size mismatch for transformer.h.34.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]). size mismatch for transformer.h.34.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]). size mismatch for transformer.h.34.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]). size mismatch for transformer.h.34.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]). size mismatch for transformer.h.35.attn.c_attn.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 3840]). size mismatch for transformer.h.35.attn.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 1280]). size mismatch for transformer.h.35.mlp.c_fc.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([1280, 5120]). size mismatch for transformer.h.35.mlp.c_proj.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 1280]). size mismatch for lm_head.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([50257, 1280]). You may consider adding `ignore_mismatched_sizes=True` in the model `from_pretrained` method.
accelerate launch -m axolotl.cli.train config.yaml --output_dir here
Noting in the DeepSpeed config, we set stage3_gather_16bit_weights_on_model_save to true.
stage3_gather_16bit_weights_on_model_save
true
from transformers import pipeline # should load without issue _ = pipeline('text-generation', 'here')
base_model: gpt2 base_model_config: gpt2 load_in_8bit: false load_in_4bit: false strict: false push_dataset_to_hub: datasets: - path: wikitext name: wikitext-2-v1 type: completion train_on_split: test dataset_prepared_path: val_set_size: 0.01 adapter: lora_model_dir: sequence_len: 1024 max_packed_sequence_len: lora_r: lora_alpha: lora_dropout: lora_target_modules: lora_target_linear: lora_fan_in_fan_out: wandb_project: axolotl wandb_entity: wandb_watch: wandb_run_id: wikitext-test-1 wandb_log_model: output_dir: ./wikitext-test-1 gradient_accumulation_steps: 16 micro_batch_size: 6 eval_batch_size: num_epochs: 1 optimizer: paged_adamw_8bit torchdistx_path: lr_scheduler: linear learning_rate: 0.0001 train_on_inputs: false group_by_length: false bf16: true fp16: false tf32: true gradient_checkpointing: false early_stopping_patience: resume_from_checkpoint: local_rank: logging_steps: 1 xformers_attention: true flash_attention: gptq_groupsize: gptq_model_v1: warmup_steps: 10 eval_steps: 500 save_steps: debug: deepspeed: axolotl/deepspeed/zero3.json weight_decay: 0.1 fsdp: fsdp_config: special_tokens: pad_token: "<|endoftext|>"
There is a doc in accelerate.
accelerate
https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading
3.10
main/77c84e02fd1a7eef25cccc5b8104178d980851c7
The text was updated successfully, but these errors were encountered:
Hmm actually I experienced the same issue before.
Sorry, something went wrong.
ok i think this should work: #709
train.py DeepSpeed Zero3 save fixes
83c7710
Issue axolotl-ai-cloud#705
resolved w #709
No branches or pull requests
Please check that this issue hasn't been reported before.
Expected Behavior
Loading the model trained with DeepSpeed Zero3 will load the entire model weights.
Current behaviour
Only part of the model weights exist in
pytorch_model.bin
.The following error occurs:
Steps to reproduce
Train
Noting in the DeepSpeed config, we set
stage3_gather_16bit_weights_on_model_save
totrue
.Load
Config yaml
Possible solution
There is a doc in
accelerate
.https://github.com/huggingface/accelerate/blob/5ae611118057232f441055f7ef9ba0b0f2b8d533/docs/source/usage_guides/deepspeed.md#saving-and-loading
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/77c84e02fd1a7eef25cccc5b8104178d980851c7
Acknowledgements
The text was updated successfully, but these errors were encountered: