Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DeepSpeed Zero 3 Saving #709

Merged
merged 3 commits into from
Oct 19, 2023

Conversation

tokestermw
Copy link
Contributor

@tokestermw tokestermw commented Oct 9, 2023

Related Issue

#705

Fix

To use accelerate's recommendation here to run stage3_gather_16bit_weights_on_model_save.

Test

Config file

base_model: gpt2
base_model_config: gpt2
load_in_8bit: false
load_in_4bit: false
strict: false
push_dataset_to_hub:
datasets:
  - path: wikitext
    name: wikitext-2-v1
    type: completion
    train_on_split: test
dataset_prepared_path:
val_set_size: 0.01
adapter:
lora_model_dir:
sequence_len: 1024
max_packed_sequence_len:
lora_r:
lora_alpha:
lora_dropout:
lora_target_modules:
lora_target_linear:
lora_fan_in_fan_out:
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_run_id: wikitext-test-1
wandb_log_model:
output_dir: ./wikitext-test-1
gradient_accumulation_steps: 16
micro_batch_size: 6
eval_batch_size:
num_epochs: 1
optimizer: paged_adamw_8bit
torchdistx_path:
lr_scheduler: linear
learning_rate: 0.0001
train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: true
gradient_checkpointing: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention: true
flash_attention:
gptq_groupsize:
gptq_model_v1:
warmup_steps: 10
eval_steps: 500
save_steps:
debug:
deepspeed: axolotl/deepspeed/zero3.json
weight_decay: 0.1
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|endoftext|>"

Run

accelerate launch -m axolotl.cli.train config.yaml

@@ -134,6 +134,22 @@ def terminate_handler(_, __, model):
# only save on rank 0, otherwise it corrupts output on multi-GPU when multiple processes attempt to write the same file
if cfg.fsdp:
trainer.save_model(cfg.output_dir)
elif cfg.deepspeed:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this apply to all zero* stages, or just zero3 as listed in the pr title?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this error applies only for zero3. this code change will run for all zero* stages. should work for all zero* stages however.

if we want a smaller change, i think something like this should work:
https://github.com/lm-sys/FastChat/pull/1457/files#diff-82b734e9eda6b4bac9a28b1056d4e0e0676f904e43cede16d7aa6e2d1da3e61bR155

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps elif cfg.deepspeed and trainer.hf_deepspeed_config_orig.is_zero3(): just to minimize the blast radius on this change?

@tokestermw
Copy link
Contributor Author

Hi @winglian sorry for the delay. This is fixed and tested.

I am using is_deepspeed_zero3_enabled, which is used in both FastChat and accelerate.

After training

accelerate launch -m axolotl.cli.train config.yaml

We can load the model and use it normally.

from transformers import pipeline

model = pipeline('text-generation', '...')
model('hello')

@winglian winglian merged commit e4d1585 into axolotl-ai-cloud:main Oct 19, 2023
4 checks passed
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Update train.py

* add zero3 check

* chore: lint

---------

Co-authored-by: Wing Lian <wing.lian@gmail.com>
@hahmad2008
Copy link

@RicardoDominguez @winglian @tokestermw
For TinyLLama, for full finetune, the model is not saved in the model directory.

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants