Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda 12 Error #817

Closed
6 of 8 tasks
generalsvr opened this issue Nov 3, 2023 · 10 comments
Closed
6 of 8 tasks

Cuda 12 Error #817

generalsvr opened this issue Nov 3, 2023 · 10 comments
Labels
bug Something isn't working

Comments

@generalsvr
Copy link

generalsvr commented Nov 3, 2023

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected to run the fine-tuning task.

Current behaviour

Error after loading the shards:

ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

2 days ago worked perfectly with the same setup. Today I got this issue. Tried docker image and building from source. Tried openllama3 (as per repo example) and llama2 13b. Tried A100, H100, 3090 GPUs.

Steps to reproduce

Same steps as in repo. Both docker and building from source. Running on runpod instances

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@generalsvr generalsvr added the bug Something isn't working label Nov 3, 2023
@truenorth8
Copy link

Running into the same problem. Was working 2 days ago when I last tried. Base docker image is winglian/axolotl-runpod:main-latest

huggingface/peft just released a new version, but installing at the previous tag didn't resolve the issue

pip3 install -U git+https://github.com/huggingface/peft.git@v0.5.0

@generalsvr
Copy link
Author

Screenshot 2023-11-03 at 18 57 04

Same on vast.ai 3090 built from source

@generalsvr
Copy link
Author

Compiling flash attention from https://github.com/Dao-AILab/flash-attention also didn't help

@fpreiss
Copy link
Contributor

fpreiss commented Nov 3, 2023

I ran into the same issue, apparently auto-gptq is getting updated to version 0.5.0 when installing axolotl. Downgrading it fixed the issue for me:

pip install auto-gptq==0.4.2

@IamGianluca
Copy link

I can confirm. Downgrading auto-gptq resolved the issue also for me. Thank you @fpreiss

@Mihaiii
Copy link

Mihaiii commented Nov 3, 2023

I can confirm too. Thanks!

@winglian
Copy link
Collaborator

winglian commented Nov 3, 2023

#818 fixes this

@winglian winglian closed this as completed Nov 3, 2023
@jaywongs
Copy link

jaywongs commented Nov 8, 2023

#818 fixes this

Tried this pr, but meet with this problem:
ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2, but only versions above {AUTOGPTQ_MINIMUM_VERSION} are supported

Traceback (most recent call last):
  File "/mnt/workspace/qishi/project/axolotl/scripts/finetune.py", line 52, in <module>
    fire.Fire(do_cli)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/workspace/qishi/project/axolotl/scripts/finetune.py", line 48, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/train.py", line 62, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 440, in load_model
    model, lora_config = load_adapter(model, cfg, cfg.adapter)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 475, in load_adapter
    return load_lora(model, cfg, inference=inference)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 556, in load_lora
    model = get_peft_model(model, lora_config)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/mapping.py", line 116, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/peft_model.py", line 947, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/peft_model.py", line 119, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 93, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 231, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 193, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 255, in _create_new_module
    AutoGPTQQuantLinear = get_auto_gptq_quant_linear(gptq_quantization_config)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/utils/other.py", line 415, in get_auto_gptq_quant_linear
    if is_auto_gptq_available():
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/import_utils.py", line 41, in is_auto_gptq_available
    raise ImportError(
ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2, but only versions above {AUTOGPTQ_MINIMUM_VERSION} are supported

@jaywongs
Copy link

jaywongs commented Nov 8, 2023

#818 fixes this

Tried this pr, but meet with this problem: ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2, but only versions above {AUTOGPTQ_MINIMUM_VERSION} are supported

Traceback (most recent call last):
  File "/mnt/workspace/qishi/project/axolotl/scripts/finetune.py", line 52, in <module>
    fire.Fire(do_cli)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/mnt/workspace/qishi/project/axolotl/scripts/finetune.py", line 48, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/train.py", line 62, in train
    model, peft_config = load_model(cfg, tokenizer, inference=cli_args.inference)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 440, in load_model
    model, lora_config = load_adapter(model, cfg, cfg.adapter)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 475, in load_adapter
    return load_lora(model, cfg, inference=inference)
  File "/mnt/workspace/qishi/project/axolotl/src/axolotl/utils/models.py", line 556, in load_lora
    model = get_peft_model(model, lora_config)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/mapping.py", line 116, in get_peft_model
    return MODEL_TYPE_TO_PEFT_MODEL_MAPPING[peft_config.task_type](model, peft_config, adapter_name=adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/peft_model.py", line 947, in __init__
    super().__init__(model, peft_config, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/peft_model.py", line 119, in __init__
    self.base_model = cls(model, {adapter_name: peft_config}, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 111, in __init__
    super().__init__(model, config, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 93, in __init__
    self.inject_adapter(self.model, adapter_name)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/tuners_utils.py", line 231, in inject_adapter
    self._create_and_replace(peft_config, adapter_name, target, target_name, parent, **optional_kwargs)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 193, in _create_and_replace
    new_module = self._create_new_module(lora_config, adapter_name, target, **kwargs)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/tuners/lora/model.py", line 255, in _create_new_module
    AutoGPTQQuantLinear = get_auto_gptq_quant_linear(gptq_quantization_config)
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/utils/other.py", line 415, in get_auto_gptq_quant_linear
    if is_auto_gptq_available():
  File "/root/anaconda3/envs/qishi/lib/python3.10/site-packages/peft/import_utils.py", line 41, in is_auto_gptq_available
    raise ImportError(
ImportError: Found an incompatible version of auto-gptq. Found version 0.4.2, but only versions above {AUTOGPTQ_MINIMUM_VERSION} are supported

update: peft update their code and restrict the version of auto_gptq to 0.5.0 lead to this error.

@Nixellion
Copy link

Nixellion commented Nov 8, 2023

This is still a problem, should probably be reopened? Impossible to train anything I try.

This is what I get if I downgrade peft:

Traceback (most recent call last):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in <module>
    fire.Fire(do_cli)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/workspace/axolotl/src/axolotl/cli/train.py", line 34, in do_cli
    train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
  File "/workspace/axolotl/src/axolotl/train.py", line 124, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1591, in train
    return inner_training_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1984, in _inner_training_loop
    self._maybe_log_save_evaluate(tr_loss, model, trial, epoch, ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2328, in _maybe_log_save_evaluate
    metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3066, in evaluate
    output = eval_loop(
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 3214, in evaluation_loop
    if has_length(dataloader):
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer_utils.py", line 623, in has_length
    return len(dataset) is not None
  File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 483, in __len__
    return len(self._index_sampler)
ValueError: __len__() should return >= 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

8 participants