ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

l3utterfly · 2024-06-13T02:16:19Z

Please check that this issue hasn't been reported before.

I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Works with DP3 out of the box

Current behaviour

Got this error:

[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire                                                                                       [45/1935]
[rank5]:     component, remaining_args = _CallAndUpdateTrace(
[rank5]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank5]:     component = fn(*varargs, **kwargs)
[rank5]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank5]:     return do_train(parsed_cfg, parsed_cli_args)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank5]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/train.py", line 170, in train
[rank5]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
[rank5]:     return inner_training_loop(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
[rank5]:     self.accelerator.backward(loss)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2117, in backward
[rank5]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank5]:     self.engine.step()
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
[rank5]:     self._take_model_step(lr_kwargs)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
[rank5]:     self.optimizer.step()
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:               ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
[rank5]:     self._optimizer_step(sub_group_id)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 939, in _optimizer_step
[rank5]:     cpu_loss = self.optimizer.step()
[rank5]:                ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/optim/lr_scheduler.py", line 129, in wrapper
[rank5]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/optim/optimizer.py", line 483, in wrapper
[rank5]:     out = func(*args, **kwargs)
[rank5]:           ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 287, in step
[rank5]:     self.update_step(group, p, gindex, pindex)
[rank5]:   File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 542, in update_step
[rank5]:     F.optimizer_update_8bit_blockwise(
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 1770, in optimizer_update_8bit_blockwise
[rank5]:     prev_device = pre_call(g.device)
[rank5]:                   ^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 459, in pre_call
[rank5]:     torch.cuda.set_device(device)
[rank5]:   File "/home/layla/src/pytorch/torch/cuda/__init__.py", line 414, in set_device
[rank5]:     device = _get_device_index(device)
[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/cuda/_utils.py", line 34, in _get_device_index
[rank5]:     raise ValueError(f"Expected a cuda device, but got: {device}")
[rank5]: ValueError: Expected a cuda device, but got: cpu

Steps to reproduce

Use deepspeed zero3 config in this repo
installed fixed deepspeed: pip install "deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@bc48371c5e1fb8fd70fc79285e66201dbb65679b"

Config yaml

base_model: models/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    ds_type: json # see other options below
    type: sharegpt
    conversation: chatml
    roles:
      input: ['User', 'Information']
      output: ['Layla']

chat_template: chatml
default_system_message: The following is a conversation. Embody the character and personality completely.

dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-7

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: false
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 5
evals_per_epoch: 10
eval_table_size:
saves_per_epoch: 10
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

Linux
macOS
Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

My issue title is concise, descriptive, and in title casing.
I have searched the existing issues to make sure this bug has not been reported yet.
I am using the latest version of axolotl.
I have provided enough information for the maintainers to reproduce and diagnose the issue.

The text was updated successfully, but these errors were encountered:

l3utterfly added the bug Something isn't working label Jun 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

l3utterfly commented Jun 13, 2024

ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

Comments

l3utterfly commented Jun 13, 2024

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements