-
-
Notifications
You must be signed in to change notification settings - Fork 780
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VRAM usage regression in 2f2582e #1127
Comments
@adamo1139 i know this seems unrelated but can you give this pr a try? #1141 ? You'll need to delete the prepared dataset and rerun the preprocessing before training |
@winglian I updated to cbecf3e, confirmed that behaviour is still the same (regression over 0ce1a65), then moved to branch BUNK TESTS, IGNORE branch
|
I can't train 20b llama with qLoRA on 4x A100. Oom every time with batch size 1 and 8k context. How to fix? |
@ehartford i suppose it's the llama-fied internlm2-20b, yes? |
yes, exactly. I will try this. Thank you. I am doing sft, dolphin. |
@adamo1139 single gpu or multigpu 24GB? |
@winglian single GPU |
Is the model in the posted yml basically https://huggingface.co/01-ai/Yi-34B-200K/tree/main ? |
@adamo1139 I tried #1141 and it has the exact same VRAM usage for me as 0ce1a65, and never spiked. Did you run the |
the latest version of that PR |
@winglian I typically don't do preprocessing as a separate step since I didn't think it makes much difference with small 50-100 MB datasets. I will be doing it from now on. I noticed an issue with my testing of the branch Thanks for the help. |
Please check that this issue hasn't been reported before.
Expected Behavior
Fine-tuning using configurations that were working in previous version should continue working without OOMs.
Current behaviour
I attempted to replicate my previous fine-tune from a few weeks ago on a slightly modified base model. Same parameter count and base, just merged-in LoRA adapter. I was using the same config file. I was met with OOMs and I was able to trace it back to introduction of commit 2f2582e. Previous commit 0ce1a65 does not exhibit the issue. When moving between versions, I made sure to run
pip3 install -e '.[flash-attn,deepspeed]'
after moving to a different version. I also tested this with a few various configuration parameters to be confident that I identified the right commit. On commit 0ce1a65 and earlier I am able to do QLoRA SFT Yi-34B finetune on context length of 1400. On commit 2f2582e and later, this is around 600 tokens - running it with sequence length 1400 results in OOM a few steps after starting. I have confirmed this issue exist on 2 various datasets (airoboros 3.1 and aezakmi v2 sharegpt). I am using 24GB VRAM RTX 3090 TI, some VRAM is reserved for DE (XFCE). I also see that, keeping config file constant with sequence length 600, VRAM usage raised by around 1.3GB on commit 2f2582e. In the past, I used that config file and 25-hour training session completed just fine.Steps to reproduce
accelerate launch -m axolotl.cli.train config.yml
with supplied config file, adjusting the base model to either Yi-34B 4K llama-fied model or changing max_position_embeddings in config of 200K ctx model to 4K, otherwise it will OOM at loading (known separate issue)Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main/9cd27b2f91111e7ff991cfd464bccc3dc9ffa86a
Acknowledgements
The text was updated successfully, but these errors were encountered: