-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda 12 Error #817
Comments
Running into the same problem. Was working 2 days ago when I last tried. Base docker image is winglian/axolotl-runpod:main-latest huggingface/peft just released a new version, but installing at the previous tag didn't resolve the issue
|
Compiling flash attention from https://github.com/Dao-AILab/flash-attention also didn't help |
I ran into the same issue, apparently auto-gptq is getting updated to version 0.5.0 when installing axolotl. Downgrading it fixed the issue for me: pip install auto-gptq==0.4.2 |
I can confirm. Downgrading |
I can confirm too. Thanks! |
#818 fixes this |
Tried this pr, but meet with this problem:
|
update: peft update their code and restrict the version of auto_gptq to 0.5.0 lead to this error. |
This is still a problem, should probably be reopened? Impossible to train anything I try. This is what I get if I downgrade peft:
|
Please check that this issue hasn't been reported before.
Expected Behavior
Expected to run the fine-tuning task.
Current behaviour
Error after loading the shards:
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
2 days ago worked perfectly with the same setup. Today I got this issue. Tried docker image and building from source. Tried openllama3 (as per repo example) and llama2 13b. Tried A100, H100, 3090 GPUs.
Steps to reproduce
Same steps as in repo. Both docker and building from source. Running on runpod instances
Config yaml
No response
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
The text was updated successfully, but these errors were encountered: