Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Deepspeed Zero3 Config #791

Merged
merged 2 commits into from
Oct 28, 2023
Merged

Conversation

teknium1
Copy link
Contributor

Update DS Zero 3 conf to not use CPU Offload (because it dramatically slows things down, just reduce batchsize unless you cant fit bs 1) and replace LR Scheduler with a properly decaying one - with constant LR scheduler, it will cause loss to increase after each epoch

Compare dark pink (decaying LR Schedule) with all others, including blue, which used constant LR Schedule (the default in this conf)
image

Take away CPU Offload by default (Slows things down horribly, better off reducing batchsize), and changes LR Scheduler to a properly decaying one
fix something
@teknium1 teknium1 changed the title Patch 1 Fix Deepspeed Zero3 Config Oct 27, 2023
@casper-hansen
Copy link
Collaborator

Probably a good idea to remove CPU offload in zero2 as well. In my tests, the offloading is very minimal and if you turn it off, you get 2x training speed.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/c25ba7939b35dbd9589bc694ea06c3490e8f9b54/deepspeed/zero2.json#L4-L6

@teknium1
Copy link
Contributor Author

Probably a good idea to remove CPU offload in zero2 as well. In my tests, the offloading is very minimal and if you turn it off, you get 2x training speed.

https://github.com/OpenAccess-AI-Collective/axolotl/blob/c25ba7939b35dbd9589bc694ea06c3490e8f9b54/deepspeed/zero2.json#L4-L6

Yea I think it is only a good idea if and when you cant fit even bs 1 on vram

@casper-hansen
Copy link
Collaborator

Yea I think it is only a good idea if and when you cant fit even bs 1 on vram

Yes, then it makes sense. Perhaps an idea is to remove it from zero2 and add a disclaimer about it in the REAMDE?

@winglian winglian merged commit d3193be into axolotl-ai-cloud:main Oct 28, 2023
mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Update zero3.json

Take away CPU Offload by default (Slows things down horribly, better off reducing batchsize), and changes LR Scheduler to a properly decaying one

* Update zero3.json

fix something
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants