You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
0%| | 0/33 [00:00<?, ?it/s
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/train.py", line 28, in <module>
[rank1]: main()
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/train.py", line 19, in main
[rank1]: run_exp()
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank1]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 79, in run_dpo
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
[rank1]: return inner_training_loop(
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
[rank1]: loss = self.compute_loss(model, inputs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1257, in compute_loss
[rank1]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 228, in get_batch_loss_metrics
[rank1]: reference_chosen_logps, reference_rejected_logps = self.compute_reference_log_probs(model, batch)
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 206, in compute_reference_log_probs
[rank1]: reference_chosen_logps, reference_rejected_logps, *_ = self.concatenated_forward(ref_model, batch)
[rank1]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 177, in concatenated_forward
[rank1]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).[logits.to](http://logits.to/)(torch.float32)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
[rank1]: return model_forward(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
[rank1]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
[rank1]: return func(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1174, in forward
[rank1]: outputs = self.model(
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 931, in forward
[rank1]: inputs_embeds = self.embed_tokens(input_ids)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank1]: return forward_call(*args, **kwargs)
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank1]: return F.embedding(
[rank1]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank1]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/train.py", line 28, in <module>
[rank0]: main()
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/train.py", line 19, in main
[rank0]: run_exp()
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/tuner.py", line 56, in run_exp
[rank0]: run_dpo(model_args, data_args, training_args, finetuning_args, callbacks)
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/workflow.py", line 79, in run_dpo
[rank0]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 1932, in train
[rank0]: return inner_training_loop(
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 2268, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/trainer.py", line 3307, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/trl/trainer/dpo_trainer.py", line 1257, in compute_loss
[rank0]: loss, metrics = self.get_batch_loss_metrics(model, inputs, train_eval="train")
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 228, in get_batch_loss_metrics
[rank0]: reference_chosen_logps, reference_rejected_logps = self.compute_reference_log_probs(model, batch)
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 206, in compute_reference_log_probs
[rank0]: reference_chosen_logps, reference_rejected_logps, *_ = self.concatenated_forward(ref_model, batch)
[rank0]: File "/home/xxx/llm_projects/LLaMA-Factory/src/llamafactory/train/dpo/trainer.py", line 177, in concatenated_forward
[rank0]: all_logits: "torch.Tensor" = model(**batch, return_dict=True, use_cache=False).[logits.to](http://logits.to/)(torch.float32)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 822, in forward
[rank0]: return model_forward(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/utils/operations.py", line 810, in __call__
[rank0]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 16, in decorate_autocast
[rank0]: return func(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1174, in forward
[rank0]: outputs = self.model(
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 931, in forward
[rank0]: inputs_embeds = self.embed_tokens(input_ids)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
[rank0]: return F.embedding(
[rank0]: File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/nn/functional.py", line 2264, in embedding
[rank0]: return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
0%| | 0/33 [00:28<?, ?it/s]
W0628 16:52:43.518000 139824488286016 torch/distributed/elastic/multiprocessing/api.py:851] Sending process 11966 closing signal SIGTERM
E0628 16:52:47.337000 139824488286016 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 11965) of binary: /home/xxx/anaconda3/envs/llama-factory/bin/python
Traceback (most recent call last):
File "/home/xxx/anaconda3/envs/llama-factory/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1084, in launch_command
multi_gpu_launcher(args)
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/accelerate/commands/launch.py", line 733, in multi_gpu_launcher
distrib_run.run(args)
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
elastic_launch(
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/xxx/anaconda3/envs/llama-factory/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-28_16:52:43
host : [vision14.local](http://vision14.local/)
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 11965)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Reminder
System Info
pass
Reproduction
Error message:
[rank1]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:1! (when checking argument for argument index in method wrapper_CUDA__index_select)
[rank0]: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)
Expected behavior
No response
Others
llama3_full_dpo_fsdp.yaml
The text was updated successfully, but these errors were encountered: