Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: Expected a cuda device, but got: cpu when using Deepspeed zero3 #1705

Open
6 of 8 tasks
l3utterfly opened this issue Jun 13, 2024 · 0 comments
Open
6 of 8 tasks
Labels
bug Something isn't working

Comments

@l3utterfly
Copy link

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Works with DP3 out of the box

Current behaviour

Got this error:

[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire                                                                                       [45/1935]
[rank5]:     component, remaining_args = _CallAndUpdateTrace(
[rank5]:                                 ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
[rank5]:     component = fn(*varargs, **kwargs)
[rank5]:                 ^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
[rank5]:     return do_train(parsed_cfg, parsed_cli_args)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 66, in do_train
[rank5]:     return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/axolotl/src/axolotl/train.py", line 170, in train
[rank5]:     trainer.train(resume_from_checkpoint=resume_from_checkpoint)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train
[rank5]:     return inner_training_loop(
[rank5]:            ^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
[rank5]:     tr_loss_step = self.training_step(model, inputs)
[rank5]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step
[rank5]:     self.accelerator.backward(loss)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2117, in backward
[rank5]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward
[rank5]:     self.engine.step()
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step
[rank5]:     self._take_model_step(lr_kwargs)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step
[rank5]:     self.optimizer.step()
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
[rank5]:     ret_val = func(*args, **kwargs)
[rank5]:               ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step
[rank5]:     self._optimizer_step(sub_group_id)
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 939, in _optimizer_step
[rank5]:     cpu_loss = self.optimizer.step()
[rank5]:                ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/optim/lr_scheduler.py", line 129, in wrapper
[rank5]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/optim/optimizer.py", line 483, in wrapper
[rank5]:     out = func(*args, **kwargs)
[rank5]:           ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 287, in step
[rank5]:     self.update_step(group, p, gindex, pindex)
[rank5]:   File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context
[rank5]:     return func(*args, **kwargs)
[rank5]:            ^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 542, in update_step
[rank5]:     F.optimizer_update_8bit_blockwise(
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 1770, in optimizer_update_8bit_blockwise
[rank5]:     prev_device = pre_call(g.device)
[rank5]:                   ^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 459, in pre_call
[rank5]:     torch.cuda.set_device(device)
[rank5]:   File "/home/layla/src/pytorch/torch/cuda/__init__.py", line 414, in set_device
[rank5]:     device = _get_device_index(device)
[rank5]:              ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank5]:   File "/home/layla/src/pytorch/torch/cuda/_utils.py", line 34, in _get_device_index
[rank5]:     raise ValueError(f"Expected a cuda device, but got: {device}")
[rank5]: ValueError: Expected a cuda device, but got: cpu

Steps to reproduce

  1. Use deepspeed zero3 config in this repo
  2. installed fixed deepspeed: pip install "deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@bc48371c5e1fb8fd70fc79285e66201dbb65679b"

Config yaml

base_model: models/Meta-Llama-3-8B
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: train.jsonl
    ds_type: json # see other options below
    type: sharegpt
    conversation: chatml
    roles:
      input: ['User', 'Information']
      output: ['Layla']

chat_template: chatml
default_system_message: The following is a conversation. Embody the character and personality completely.

dataset_prepared_path: last_run_prepared
val_set_size: 0.05
output_dir: ./outputs/out

sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-7

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
auto_resume_from_checkpoints: false
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 5
evals_per_epoch: 10
eval_table_size:
saves_per_epoch: 10
debug:
deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: <|end_of_text|>
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.11

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@l3utterfly l3utterfly added the bug Something isn't working label Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant