Model is not saved for full finetune with Deepspeed Zero3 #1223

hahmad2008 · 2024-01-28T16:32:23Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

With full finetune, expect a model with size 2G in the output directory however, the model directory size is 1M!
The checkpoint should be 2G since we enable fp16.

Current behaviour

For TinyLLama, for full finetune, the model is not saved in the model directory.

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Steps to reproduce

For TinyLLama, for full finetune, the model is not saved in the model directory.

Command

accelerate launch --config_file accelerate-config.yaml scripts/finetune.py axolotl/config.yaml

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Config yaml

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I checked this issue, however the latest branch doesn't solve the problem.

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

NanoCode012 · 2024-01-31T04:31:44Z

Was there any stack trace error? Did you run out of space? Did the run abruptly quit?

hahmad2008 · 2024-01-31T16:43:28Z

@NanoCode012 No at all.

antonpolishko · 2024-06-14T19:06:39Z

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Not sure how axolotl should behave, but from other experience optimizer state (gradients) might be part of the weights ballooning the space. We always do an import->save weights_only=True trick for the weights we actually keep.

As for the model folder being small, we ran into similar issue that we needed to use the checkpoint folder as the point to launch inference playground instead of the model folder, like so

python -m axolotl.cli.inference config.yaml --base_model='model-finetuned/checkpoint-4130/' --gradio

hahmad2008 added the bug Something isn't working label Jan 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model is not saved for full finetune with Deepspeed Zero3 #1223

Model is not saved for full finetune with Deepspeed Zero3 #1223

hahmad2008 commented Jan 28, 2024

NanoCode012 commented Jan 31, 2024

hahmad2008 commented Jan 31, 2024

antonpolishko commented Jun 14, 2024 •

edited

Loading

Model is not saved for full finetune with Deepspeed Zero3 #1223

Model is not saved for full finetune with Deepspeed Zero3 #1223

Comments

hahmad2008 commented Jan 28, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Config

Final Model

Steps to reproduce

Command

Config

Final Model

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

NanoCode012 commented Jan 31, 2024

hahmad2008 commented Jan 31, 2024

antonpolishko commented Jun 14, 2024 • edited Loading

antonpolishko commented Jun 14, 2024 •

edited

Loading