Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README with some explanations #700

Merged
merged 9 commits into from
Oct 8, 2023

Conversation

seungduk-yanolja
Copy link
Contributor

Description

Added some explanation and examples to the YAML config documentation to help future users.

Motivation and Context

Sharing lesson-learned

How has this been tested?

MD file viewer

Screenshots (if appropriate)

N/A

Types of changes

Comments

README.md Outdated
gradient_accumulation_steps: 1
# The number of samples to accumulate gradients for, before performing a backward/update pass.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro batch size is the per gpu number of samples to accumulate in each forward pass.
micro batch size * gradient accumulations steps * # of gpus = total batch size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added more explanation below

@flexchar
Copy link

flexchar commented Oct 7, 2023

I absolutely appreciate the work on this! If I may, I'd really like to ask for a couple short examples (a sentence or two) on each option for dummies - technical people who are not from Machine Learning/AI background.

For example:
lora_r - specifies how many layers should be trained. The more layers the longer the training will take but it can yield better results with bigger dataset. It's recommended to use 32 or 16.
num_epochs - how many times should the whole training be repeated. It's also known as training steps. The more epochs the better can model learn the data. Recommended starting point is 10.
micro_batch_size - how many trainers run at the same time. Running more can speed up the progress but can also cause OOM... best to leave at 2.
lr_scheduler - this means abc and xyz. Recommended to use cosine unless you know what you are doing. And so on so for.

NOTE I have no idea if this is correct, I have technical non ML background and my goal is to learn the practical aspect of training with some navigation in theory too. :)

@seungduk-yanolja
Copy link
Contributor Author

I absolutely appreciate the work on this! If I may, I'd really like to ask for a couple short examples (a sentence or two) on each option for dummies - technical people who are not from Machine Learning/AI background.

For example: lora_r - specifies how many layers should be trained. The more layers the longer the training will take but it can yield better results with bigger dataset. It's recommended to use 32 or 16. num_epochs - how many times should the whole training be repeated. It's also known as training steps. The more epochs the better can model learn the data. Recommended starting point is 10. micro_batch_size - how many trainers run at the same time. Running more can speed up the progress but can also cause OOM... best to leave at 2. lr_scheduler - this means abc and xyz. Recommended to use cosine unless you know what you are doing. And so on so for.

NOTE I have no idea if this is correct, I have technical non ML background and my goal is to learn the practical aspect of training with some navigation in theory too. :)

I added a link in the doc for more details about the LoRA hyperparameters.
https://www.anyscale.com/blog/fine-tuning-llms-lora-or-full-parameter-an-in-depth-analysis-with-llama-2

PTAL

@winglian
Copy link
Collaborator

winglian commented Oct 8, 2023

@seungduk-yanolja thanks for doing this! much needed. are you happy with the state of this? Should I go ahead and merge?

@seungduk-yanolja
Copy link
Contributor Author

@seungduk-yanolja thanks for doing this! much needed. are you happy with the state of this? Should I go ahead and merge?

yes, please! thanks

@seungduk-yanolja
Copy link
Contributor Author

I added one more explanation about lora_modules_to_save as follows.

# If you added new tokens to the tokenizer, you may need to save some LoRA modules because they need to know the new tokens.
# For LLaMA and Mistral, you need to save `embed_tokens` and `lm_head`. It may vary for other models.
# `embed_tokens` converts tokens to embeddings, and `lm_head` converts embeddings to token probabilities.
# https://github.com/huggingface/peft/issues/334#issuecomment-1561727994
lora_modules_to_save:
#  - embed_tokens
#  - lm_head

@winglian winglian merged commit 77c84e0 into axolotl-ai-cloud:main Oct 8, 2023
@flexchar
Copy link

flexchar commented Oct 8, 2023

Many thanks for this! It means the world to me and to many others. I also found this resource written up for StableDiffusion LoRAs. It seems to be quite relevant https://github.com/bmaltais/kohya_ss/wiki/LoRA-training-parameters.

mkeoliya pushed a commit to mkeoliya/axolotl that referenced this pull request Dec 15, 2023
* Update README with some explanations

* revert commit-hook change

* add more explanation about batch size and gradient accum

* not use latex foromat

* decorate

* git hook again

* Attach a link that explains about LoRA hyperparameters

* update table of content

* Explanation about lora_modules_to_save
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants