We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Works with DP3 out of the box
Got this error:
[rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire [45/1935] [rank5]: component, remaining_args = _CallAndUpdateTrace( [rank5]: ^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace [rank5]: component = fn(*varargs, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 38, in do_cli [rank5]: return do_train(parsed_cfg, parsed_cli_args) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/axolotl/src/axolotl/cli/train.py", line 66, in do_train [rank5]: return train(cfg=cfg, cli_args=cli_args, dataset_meta=dataset_meta) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/axolotl/src/axolotl/train.py", line 170, in train [rank5]: trainer.train(resume_from_checkpoint=resume_from_checkpoint) [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 1885, in train [rank5]: return inner_training_loop( [rank5]: ^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop [rank5]: tr_loss_step = self.training_step(model, inputs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/transformers/trainer.py", line 3250, in training_step [rank5]: self.accelerator.backward(loss) [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/accelerator.py", line 2117, in backward [rank5]: self.deepspeed_engine_wrapped.backward(loss, **kwargs) [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/accelerate/utils/deepspeed.py", line 175, in backward [rank5]: self.engine.step() [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2169, in step [rank5]: self._take_model_step(lr_kwargs) [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 2075, in _take_model_step [rank5]: self.optimizer.step() [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn [rank5]: ret_val = func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 2050, in step [rank5]: self._optimizer_step(sub_group_id) [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/deepspeed/runtime/zero/stage3.py", line 939, in _optimizer_step [rank5]: cpu_loss = self.optimizer.step() [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/pytorch/torch/optim/lr_scheduler.py", line 129, in wrapper [rank5]: return func.__get__(opt, opt.__class__)(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/pytorch/torch/optim/optimizer.py", line 483, in wrapper [rank5]: out = func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 287, in step [rank5]: self.update_step(group, p, gindex, pindex) [rank5]: File "/home/layla/src/pytorch/torch/utils/_contextlib.py", line 115, in decorate_context [rank5]: return func(*args, **kwargs) [rank5]: ^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/optim/optimizer.py", line 542, in update_step [rank5]: F.optimizer_update_8bit_blockwise( [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 1770, in optimizer_update_8bit_blockwise [rank5]: prev_device = pre_call(g.device) [rank5]: ^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/miniconda3/envs/axolotl/lib/python3.11/site-packages/bitsandbytes/functional.py", line 459, in pre_call [rank5]: torch.cuda.set_device(device) [rank5]: File "/home/layla/src/pytorch/torch/cuda/__init__.py", line 414, in set_device [rank5]: device = _get_device_index(device) [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: File "/home/layla/src/pytorch/torch/cuda/_utils.py", line 34, in _get_device_index [rank5]: raise ValueError(f"Expected a cuda device, but got: {device}") [rank5]: ValueError: Expected a cuda device, but got: cpu
pip install "deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@bc48371c5e1fb8fd70fc79285e66201dbb65679b"
base_model: models/Meta-Llama-3-8B model_type: LlamaForCausalLM tokenizer_type: AutoTokenizer load_in_8bit: false load_in_4bit: false strict: false datasets: - path: train.jsonl ds_type: json # see other options below type: sharegpt conversation: chatml roles: input: ['User', 'Information'] output: ['Layla'] chat_template: chatml default_system_message: The following is a conversation. Embody the character and personality completely. dataset_prepared_path: last_run_prepared val_set_size: 0.05 output_dir: ./outputs/out sequence_len: 8192 sample_packing: true pad_to_sequence_len: true wandb_project: wandb_entity: wandb_watch: wandb_name: wandb_log_model: gradient_accumulation_steps: 8 micro_batch_size: 1 num_epochs: 1 optimizer: paged_adamw_8bit lr_scheduler: cosine learning_rate: 2e-7 train_on_inputs: false group_by_length: false bf16: auto fp16: tf32: false gradient_checkpointing: true gradient_checkpointing_kwargs: use_reentrant: false early_stopping_patience: resume_from_checkpoint: auto_resume_from_checkpoints: false logging_steps: 1 xformers_attention: flash_attention: true warmup_steps: 5 evals_per_epoch: 10 eval_table_size: saves_per_epoch: 10 debug: deepspeed: deepspeed_configs/zero3_bf16_cpuoffload_all.json weight_decay: 0.0 fsdp: fsdp_config: special_tokens: pad_token: <|end_of_text|> tokens: # these are delimiters - "<|im_start|>" - "<|im_end|>"
No response
3.11
main
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Please check that this issue hasn't been reported before.
Expected Behavior
Works with DP3 out of the box
Current behaviour
Got this error:
Steps to reproduce
pip install "deepspeed @ git+https://github.com/microsoft/DeepSpeed.git@bc48371c5e1fb8fd70fc79285e66201dbb65679b"
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.11
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: