You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please check that this issue hasn't been reported before.
I searched previous Bug Reports didn't find any similar reports.
Expected Behavior
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
node-2:~# ls test/model/checkpoint-45/
rng_state_1.pth
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
resume_from_checkpoint: test/model/checkpoint-45
What I have this error on worker:
[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45
and this error on master:
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Expected behavior to resume successfully from checkpoint on multi-node fine-tuning
Current behaviour
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
resume_from_checkpoint: test/model/checkpoint-45
What I have this error on worker:
[2023-11-21 14:25:52,690] [INFO] [axolotl.train.train:54] [PID:65] [RANK:0] loading model and (optionally) peft_config...
[2023-11-21 14:26:03,872] [INFO] [axolotl.load_model:410] [PID:65] [RANK:0] GPU memory usage after model load: 1.967GB (+0.105GB cache, +0.610GB misc)
[2023-11-21 14:26:03,876] [INFO] [axolotl.load_model:427] [PID:65] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2023-11-21 14:26:03,880] [INFO] [axolotl.load_model:438] [PID:65] [RANK:0] converting modules to torch.float16 for flash attention
[2023-11-21 14:26:03,883] [INFO] [axolotl.load_lora:547] [PID:65] [RANK:0] found linear modules: ['o_proj', 'down_proj', 'k_proj', 'up_proj', 'v_proj', 'gate_proj', 'q_proj']
trainable params: 50,851,840 || all params: 3,477,325,440 || trainable%: 1.4623836876194136
[2023-11-21 14:26:04,610] [INFO] [axolotl.load_model:474] [PID:65] [RANK:0] GPU memory usage after adapters: 2.178GB (+0.771GB cache, +0.610GB misc)
[2023-11-21 14:26:05,014] [INFO] [axolotl.train.train:82] [PID:65] [RANK:0] Pre-saving adapter config to test/model
[2023-11-21 14:26:05,017] [INFO] [axolotl.train.train:106] [PID:65] [RANK:0] Starting trainer...
/opt/conda/lib/python3.10/site-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
Traceback (most recent call last):
File "/axolotl/scripts/finetune.py", line 54, in
fire.Fire(do_cli)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tools/axolotl/scripts/finetune.py", line 50, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/tools/axolotl/src/axolotl/train.py", line 116, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 1531, in train
self._load_from_checkpoint(resume_from_checkpoint)
File "/opt/conda/lib/python3.10/site-packages/transformers/trainer.py", line 2064, in _load_from_checkpoint
raise ValueError(f"Can't find a valid checkpoint at {resume_from_checkpoint}")
ValueError: Can't find a valid checkpoint at test/model/checkpoint-45
and this error on master:
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
Steps to reproduce
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
Please check that this issue hasn't been reported before.
Expected Behavior
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
node-1: accelerate.config
node-2: accelerate-config.yaml
the fine-tune-config.yaml
on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint
inside checkpoint:
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
What I have this error on worker:
and this error on master:
Expected behavior to resume successfully from checkpoint on multi-node fine-tuning
Current behaviour
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
What I have this error on worker:
and this error on master:
Steps to reproduce
I have two nodes and i want to do fine-tuning on them, first of all I tried to fine-tune on two nodes and it works perfectly. But when I tried to resume from a checkpoint, I couldn't the worker node can't find a valid checkpoint at the selected path.
node-1: master (1GPU)
node-2: worker (1GPU)
NOTE: No shared storage between the nodes
I run fine-tune successfully using these configuration on both nodes:
node-1: accelerate.config
node-2: accelerate-config.yaml
the fine-tune-config.yaml
on master node-1 all checkpoints are saved with model inside it, but for the worker node-2 it ha empty checkpoint
inside checkpoint:
Resume from checkpoint on multi-nodes
for the previous fine-tuning, I need to again use the same configuration for all except the fine-tune-config.yaml i need to add resume from checkpoint
fine-tune-config.yaml
Config yaml
node-1: accelerate.config
node-2: accelerate-config.yaml
the fine-tune-config.yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
python3.10
axolotl branch-commit
a045db0
Acknowledgements
The text was updated successfully, but these errors were encountered: