Can't resume from checkpoint for multi-node fine-tuning #884

hahmad2008 · 2023-11-21T15:01:58Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

path: test/data.json
type: sharegpt
dataset_prepared_path: test/prepared-dataset
val_set_size: 0.02
output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: ""
unk_token: ""
tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

Master node-1

node-1:~# ls test/model/
README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/
README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

Worker node-2

node-12~# ls test/model/
README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json
adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/
rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Expected behavior to resume successfully from checkpoint on multi-node fine-tuning

Current behaviour

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

What I have this error on worker:

[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45

and this error on master:

[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.

Steps to reproduce

I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.

node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes

I run fine-tune successfully using these configuration on both nodes:

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

path: test/data.json
type: sharegpt
dataset_prepared_path: test/prepared-dataset
val_set_size: 0.02
output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: ""
unk_token: ""
tokens: null

on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint

Master node-1

node-1:~# ls test/model/
README.md adapter_config.json adapter_model.bin checkpoint-45 checkpoint-50 checkpoint-55 checkpoint-60 config.json special_tokens_map.json tokenizer.model tokenizer_config.json

node-1:~# ls test/model/checkpoint-45/
README.md adapter_config.json adapter_model.bin optimizer.pt rng_state_0.pth scheduler.pt trainer_state.json training_args.bin

Worker node-2

node-12~# ls test/model/
README.md adapter_model.bin checkpoint-15 checkpoint-25 checkpoint-35 checkpoint-45 checkpoint-50 checkpoint-60 special_tokens_map.json tokenizer_config.json
adapter_config.json checkpoint-10 checkpoint-20 checkpoint-30 checkpoint-40 checkpoint-5 checkpoint-55 config.json tokenizer.model

inside checkpoint:

node-2:~# ls test/model/checkpoint-45/
rng_state_1.pth

Resume from checkpoint on multi-nodes

for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint

fine-tune-config.yaml

resume_from_checkpoint: test/model/checkpoint-45

Config yaml

node-1: accelerate.config

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

node-2: accelerate-config.yaml

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 1
main_process_ip: MASTER-IP
main_process_port: 12345
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the fine-tune-config.yaml

base_model: openlm-research/open_llama_3b_v2
base_model_config: openlm-research/open_llama_3b_v2
model_type: LlamaForCausalLM
tokenizer_type: LlamaTokenizer
tokenizer_legacy: false
is_llama_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:

path: test/data.json
type: sharegpt
dataset_prepared_path: test/prepared-dataset
val_set_size: 0.02
output_dir: test/model

adapter: qlora
lora_model_dir:

sequence_len: 128
sample_packing: false
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

wandb_project:
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 60
optimizer: paged_adamw_32bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: false
fp16: true
tf32: false

gradient_checkpointing: true
early_stopping_patience:
auto_resume_from_checkpoint: true
resume_from_checkpoint: test/model/checkpoint-45
local_rank:
logging_steps: 1
xformers_attention:
flash_attention:

warmup_steps: 2
eval_steps: 10
eval_table_size:
save_steps: 5
debug:
deepspeed:
weight_decay: 0.0
fsdp: null
fsdp_config: null
special_tokens:
bos_token: ""
eos_token: ""
unk_token: ""
tokens: null

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

python3.10

axolotl branch-commit

a045db0

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

casper-hansen · 2023-11-21T16:14:01Z

Please update your axolotl version as this was fixed after the commit that you are using. #795 fixed this

hahmad2008 added the bug Something isn't working label Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't resume from checkpoint for multi-node fine-tuning #884

Can't resume from checkpoint for multi-node fine-tuning #884

hahmad2008 commented Nov 21, 2023 •

edited

Loading

casper-hansen commented Nov 21, 2023

Can't resume from checkpoint for multi-node fine-tuning #884

Can't resume from checkpoint for multi-node fine-tuning #884

Comments

hahmad2008 commented Nov 21, 2023 • edited Loading

Please check that this issue hasn't been reported before.

Expected Behavior

Resume from checkpoint on multi-nodes

Expected behavior to resume successfully from checkpoint on multi-node fine-tuning

Current behaviour

Resume from checkpoint on multi-nodes

Steps to reproduce

Resume from checkpoint on multi-nodes

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

casper-hansen commented Nov 21, 2023

hahmad2008 commented Nov 21, 2023 •

edited

Loading