Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model is not saved for full finetune with Deepspeed Zero3 #1223

Open
6 of 8 tasks
hahmad2008 opened this issue Jan 28, 2024 · 3 comments
Open
6 of 8 tasks

Model is not saved for full finetune with Deepspeed Zero3 #1223

hahmad2008 opened this issue Jan 28, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@hahmad2008
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

With full finetune, expect a model with size 2G in the output directory however, the model directory size is 1M!
The checkpoint should be 2G since we enable fp16.

Current behaviour

For TinyLLama, for full finetune, the model is not saved in the model directory.

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Steps to reproduce

For TinyLLama, for full finetune, the model is not saved in the model directory.

Command

accelerate launch --config_file accelerate-config.yaml scripts/finetune.py axolotl/config.yaml

Config

accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  deepspeed_config_file: deepspeed/zero3.json
  zero3_init_flag: true
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
downcast_bf16: 'no'
dynamo_backend: 'NO'
command_file: null
commands: null
fsdp_config: {}
tpu_name: null
tpu_zone: null
use_cpu: false

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

deepspeed/zero3.json

{
  "zero_optimization": {
    "stage": 3,
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 0,
    "reduce_bucket_size": "auto",
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 0,
    "stage3_max_reuse_distance": 0,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Final Model

ls -lh model-finetuned

total 936K
-rw-r--r-- 1 root root 3.4K Jan 28 13:22 README.md
drwxr-xr-x 3 root root 4.0K Jan 28 13:22 checkpoint-4130
-rw-r--r-- 1 root root  697 Jan 28 13:22 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:22 generation_config.json
-rw-r--r-- 1 root root 415K Jan 28 13:22 pytorch_model.bin
drwxr-xr-x 4 root root 4.0K Jan 28 10:54 runs
-rw-r--r-- 1 root root  437 Jan 28 10:54 special_tokens_map.json
-rw-r--r-- 1 root root 489K Jan 28 10:54 tokenizer.model
-rw-r--r-- 1 root root 1012 Jan 28 10:54 tokenizer_config.json

ls -lh model-finetuned/checkpoint-4130/

total 4.1G
-rw-r--r-- 1 root root  697 Jan 28 13:19 config.json
-rw-r--r-- 1 root root  145 Jan 28 13:19 generation_config.json
drwxr-xr-x 2 root root 4.0K Jan 28 13:20 global_step4130
-rw-r--r-- 1 root root   15 Jan 28 13:22 latest
-rw-r--r-- 1 root root 4.1G Jan 28 13:20 model.safetensors
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_0.pth
-rw-r--r-- 1 root root  16K Jan 28 13:22 rng_state_1.pth
-rw-r--r-- 1 root root  627 Jan 28 13:22 scheduler.pt
-rw-r--r-- 1 root root 489K Jan 28 13:22 trainer_state.json
-rw-r--r-- 1 root root 6.4K Jan 28 13:20 training_args.bin
-rwxr--r-- 1 root root  24K Jan 28 13:22 zero_to_fp32.py

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Config yaml

config.yaml

base_model: TinyLlama/TinyLlama-1.1B-step-50K-105b
base_model_config: TinyLlama/TinyLlama-1.1B-step-50K-105b
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    type: completion
    field: text
dataset_prepared_path: prepared-dataset
val_set_size: 0.08
output_dir: model-finetuned

adapter: 
lora_model_dir:

sequence_len: 512
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:
wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 1
eval_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 10
evals_per_epoch: 1
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"

Possible solution

I checked this issue, however the latest branch doesn't solve the problem.

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@hahmad2008 hahmad2008 added the bug Something isn't working label Jan 28, 2024
@NanoCode012
Copy link
Collaborator

Was there any stack trace error? Did you run out of space? Did the run abruptly quit?

@hahmad2008
Copy link
Author

@NanoCode012 No at all.

@antonpolishko
Copy link

antonpolishko commented Jun 14, 2024

Also fp16 is not applied here, the checkpoint is 4G! not 2G.

Not sure how axolotl should behave, but from other experience optimizer state (gradients) might be part of the weights ballooning the space. We always do an import->save weights_only=True trick for the weights we actually keep.

As for the model folder being small, we ran into similar issue that we needed to use the checkpoint folder as the point to launch inference playground instead of the model folder, like so

python -m axolotl.cli.inference config.yaml --base_model='model-finetuned/checkpoint-4130/' --gradio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants