Fix DeepSpeed Zero 3 Saving #709

tokestermw · 2023-10-09T17:31:22Z

Related Issue

Fix

To use accelerate's recommendation here to run stage3_gather_16bit_weights_on_model_save.

Test

Config file

base_model: gpt2
base_model_config: gpt2
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: wikitext
    name: wikitext-2-v1
    type: completion
    train_on_split: test
dataset_prepared_path:
val_set_size: 0.01
adapter:
lora_model_dir:
sequence_len: 1024
max_packed_sequence_len:
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id: wikitext-test-1
wandb_log_model:
output_dir: ./wikitext-test-1
gradient_accumulation_steps: 16
micro_batch_size: 6
eval_batch_size:
num_epochs: 1
optimizer: paged_adamw_8bit
torchdistx_path:
lr_scheduler: linear
learning_rate: 0.0001
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 500
save_steps:
debug:
deepspeed: axolotl/deepspeed/zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"

Run

accelerate launch -m axolotl.cli.train config.yaml

winglian · 2023-10-09T18:31:37Z

src/axolotl/train.py

@@ -134,6 +134,22 @@ def terminate_handler(_, __, model):
    # only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
    if cfg.fsdp:
        trainer.save_model(cfg.output_dir)
+    elif cfg.deepspeed:


does this apply to all zero* stages, or just zero3 as listed in the pr title?

this error applies only for zero3. this code change will run for all zero* stages. should work for all zero* stages however.

if we want a smaller change, i think something like this should work:
https://github.com/lm-sys/FastChat/pull/1457/files#diff-82b734e9eda6b4bac9a28b1056d4e0e0676f904e43cede16d7aa6e2d1da3e61bR155

perhaps elif cfg.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3(): just to minimize the blast radius on this change?

tokestermw · 2023-10-17T06:51:37Z

Hi @winglian sorry for the delay. This is fixed and tested.

I am using is_deepspeed_zero3_enabled, which is used in both FastChat and accelerate.

After training

accelerate launch -m axolotl.cli.train config.yaml

We can load the model and use it normally.

from transformers import pipeline

model = pipeline('text-generation', '...')
model('hello')

* Update train.py * add zero3 check * chore: lint --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>

hahmad2008 · 2024-01-28T16:10:40Z

@RicardoDominguez @winglian @tokestermw
For TinyLLama, for full finetune, the model is not saved in the model directory.

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Update train.py

fa61055

tokestermw mentioned this pull request Oct 9, 2023

Training With DeepSpeed Zero3 Does Not Save The Whole Model #705

Closed

8 tasks

winglian reviewed Oct 9, 2023

View reviewed changes

joey00072 mentioned this pull request Oct 16, 2023

DeepSpeed Zero3 save fixes #736

Closed

add zero3 check

2004904

chore: lint

50413b2

winglian merged commit e4d1585 into axolotl-ai-cloud:main Oct 19, 2023
4 checks passed

RicardoDominguez mentioned this pull request Dec 10, 2023

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3) #933

Open

8 tasks

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023

Fix DeepSpeed Zero 3 Saving (axolotl-ai-cloud#709)

4f14bc5

* Update train.py * add zero3 check * chore: lint --------- Co-authored-by: Wing Lian <wing.lian@gmail.com>

hahmad2008 mentioned this pull request Jan 28, 2024

Model is not saved for full finetune with Deepspeed Zero3 #1223

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DeepSpeed Zero 3 Saving #709

Fix DeepSpeed Zero 3 Saving #709

tokestermw commented Oct 9, 2023 •

edited

Loading

winglian Oct 9, 2023

tokestermw Oct 10, 2023

winglian Oct 10, 2023

tokestermw commented Oct 17, 2023

hahmad2008 commented Jan 28, 2024

Fix DeepSpeed Zero 3 Saving #709

Fix DeepSpeed Zero 3 Saving #709

Conversation

tokestermw commented Oct 9, 2023 • edited Loading

Related Issue

Fix

Test

winglian Oct 9, 2023

Choose a reason for hiding this comment

tokestermw Oct 10, 2023

Choose a reason for hiding this comment

winglian Oct 10, 2023

Choose a reason for hiding this comment

tokestermw commented Oct 17, 2023

hahmad2008 commented Jan 28, 2024

Config

Final Model

tokestermw commented Oct 9, 2023 •

edited

Loading