Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't resume from checkpoint for multi-node fine-tuning #884

Open
6 of 8 tasks
hahmad2008 opened this issue Nov 21, 2023 · 1 comment
Open
6 of 8 tasks

Can't resume from checkpoint for multi-node fine-tuning #884

hahmad2008 opened this issue Nov 21, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@hahmad2008
Copy link

hahmad2008 commented Nov 21, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

  • path: test/data.json
    type: sharegpt
    dataset_prepared_path: test/prepared-dataset
    val_set_size: 0.02
    output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""
tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

  • Master node-1

node-1:~# ls test/model/
README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/
README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

  • Worker node-2

node-12~# ls test/model/
README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json
adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/
rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Expected behavior to resume successfully from checkpoint on multi-node fine-tuning

Current behaviour

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Steps to reproduce

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

  • path: test/data.json
    type: sharegpt
    dataset_prepared_path: test/prepared-dataset
    val_set_size: 0.02
    output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""
tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

  • Master node-1

node-1:~# ls test/model/
README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/
README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

  • Worker node-2

node-12~# ls test/model/
README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json
adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/
rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

Config yaml

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

  • path: test/data.json
    type: sharegpt
    dataset_prepared_path: test/prepared-dataset
    val_set_size: 0.02
    output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
resume_from_checkpoint: test/model/checkpoint-45
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: "
"
unk_token: ""
tokens: null

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

python3.10

axolotl branch-commit

a045db0

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@hahmad2008 hahmad2008 added the bug Something isn't working label Nov 21, 2023
@casper-hansen
Copy link
Collaborator

Please update your axolotl version as this was fixed after the commit that you are using. #795 fixed this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants