Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any way we can get dropout on full finetune? #672

Closed
5 tasks done
enn-nafnlaus opened this issue Oct 4, 2023 · 20 comments
Closed
5 tasks done

Any way we can get dropout on full finetune? #672

enn-nafnlaus opened this issue Oct 4, 2023 · 20 comments
Labels
enhancement New feature or request

Comments

@enn-nafnlaus
Copy link

⚠️ Please check that this feature request hasn't been suggested before.

  • I searched previous Ideas in Discussions didn't find any similar feature requests.
  • I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

Full finetune suffers badly from epoch spikes, making any training lasting past the end of an epoch (and esp. 2 or more epochs) difficult to get any further progress from it. A deeper understanding of the data should be able to be achieved with dropout. But while there's lora_dropout, we don't have any dropout option available for full finetune. Anyway we could get that added?

✔️ Solution

Add dropout

❓ Alternatives

lora_dropout only applies to LoRAs.

📝 Additional Context

No response

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this feature has not been requested yet.
  • I have provided enough information for the maintainers to understand and evaluate this request.
@enn-nafnlaus enn-nafnlaus added the enhancement New feature or request label Oct 4, 2023
@NanoCode012
Copy link
Collaborator

Dropout feature in LoRA is due to a feature in PEFT upstream. If we want to add dropout, we would need to modify the architecture ourselves since it's not a built in thing. Not sure if that's a good idea.

Alternatively, perhaps, a Lower lr might be good for you or to experiment with schedulers?

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Oct 5, 2023

Lower LR can hide the spike but doesn't help spread out learning / prevent overconcentration of functionality in specific neurons and bottlenecking. And actually if you use a lower LR throughout the whole training (we only have limited tools for nuanced LR control over time) you actually get a worse eval loss because you start hitting epochs when you're less far along in the training process.

Is there an upstream library where it would be better to add dropout?

Another option would be L2 regularization or any of the other dropout alternatives. Through directly dropping out parts of the network (whether through traditional dropout, DropPath, or whatnot) is AFAIK the most effective means.

I know most people are using axolotl to train LoRAs, but these epoch spikes with finetuning are a big problem.

@NanoCode012
Copy link
Collaborator

You would need to modify the architecture model code itself in modeling code. For ex, llama: https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py

However, I do not have enough expertise in updating this to add layers.

@NanoCode012
Copy link
Collaborator

I will close this for now as an Issue has been open upstream. Please let us know if this needs to be re-opened later due to an update.

@winglian
Copy link
Collaborator

winglian commented Oct 5, 2023

@enn-nafnlaus btw, I was doing some experimentation w dropout. I don't know if this iteration works out of the box, but might be a good starting point. main...llama-dropout

@winglian
Copy link
Collaborator

winglian commented Oct 5, 2023

similar one too for mistral
main...mistral-dropoout

@NanoCode012
Copy link
Collaborator

Didn't know there was an open branch. Reopening back.

@NanoCode012 NanoCode012 reopened this Oct 5, 2023
@enn-nafnlaus
Copy link
Author

@enn-nafnlaus btw, I was doing some experimentation w dropout. I don't know if this iteration works out of the box, but might be a good starting point. main...llama-dropout

Ooh, nice! I'm regenerating a larger dataset at the moment, but will try it as soon as my cards are freed up! :)

@enn-nafnlaus
Copy link
Author

Just tried both branches, sadly no luck.

mistral-dropout:

...
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 594, in init
LlamaAttention(config=config)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 283, in init
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
File "/home/user/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'MistralConfig' object has no attribute 'attention_bias

It keeps going after that error, but I run out of memory, seems I can't fit it in either just a RTX 3090 or a RTX 3090+3060. So, dead end.

llama-dropout:

[2023-10-11 22:07:21,996] [INFO] [axolotl.load_model:176] [PID:2617024] [RANK:0] patching _expand_mask
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:334] [PID:2617025] [RANK:1] Exception raised attempting to load model, retrying with AutoModelForCausalLM
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:334] [PID:2617024] [RANK:0] Exception raised attempting to load model, retrying with AutoModelForCausalLM
[2023-10-11 22:07:29,473] [ERROR] [axolotl.load_model:337] [PID:2617025] [RANK:1] 'LlamaConfig' object has no attribute 'dropout_attn'
Traceback (most recent call last):
File "/home/user/axolotl/src/axolotl/utils/models.py", line 227, in load_model
model = LlamaForCausalLM.from_pretrained(
File "/home/user/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3076, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 961, in init
self.model = LlamaModel(config)
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 786, in init
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/user/.local/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 786, in
self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
File "/home/user/axolotl/src/axolotl/monkeypatch/llama_attn_hijack_flash.py", line 593, in init
if config.dropout_attn:
File "/home/user/.local/lib/python3.10/site-packages/transformers/configuration_utils.py", line 261, in getattribute
return super().getattribute(key)
AttributeError: 'LlamaConfig' object has no attribute 'dropout_attn'

This is with model winglian/llama-2-4b.

Obviously we don't have support for falcon at all, let alone dropout, so I can't try that.

@enn-nafnlaus
Copy link
Author

Any progress on this? :)

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Nov 1, 2023

Found a small enough mistral model that I can actually try it (hongyin/mistral-0.5b-40k). Unfortunately, it's a terrible model compared to, say, PY007/TinyLlama-1.1B-intermediate-step-480k-1T. I retested both dropout branches. llama-dropout is still broken in the same manner. I tried the mistral-dropout branch. Weirdly, it runs out of VRAM. This does not happen on the main branch. I could try reducing e.g. batch size, but I don't get why its VRAM footprint should be any different from the mainline branch....

Going to try an alternative to dropout to deal with the loss spikes at end-of-epoch - I wrote a script to randomly tweak the input data with synonyms, antonyms, hypernyms and hyponyms, as well as various minor text formatting changes, so be able to multiply out the input data and thus hopefully decrease the ability of the model to memorize the training data when running multiple epochs. Dunno if it'll work, but it's a stopgap...

(Unrelated to normal dropout, but it did occur to me that Learning Rate Dropout would be a cool feature. I don't know if it's been mainlined in PyTorch, though. But in theory it should allow for faster learning with a smaller memory footprint by having only a random fraction of the nodes involved in backpropagation, with the others just running inference and gradient accumulation)

@enn-nafnlaus
Copy link
Author

Hey, dropout just got added upstream!

huggingface/transformers#27315

Hopefully we can use that in axolotl soon!

@NanoCode012
Copy link
Collaborator

Cool @enn-nafnlaus !

Seems like you can manually do this by adding the config attention_dropout to your config.json.

Alternatively, if you would like to PR, it's adding a new param to the yaml and setting like this rope sample for llama:
https://github.com/OpenAccess-AI-Collective/axolotl/blob/614cff41077839a6c1380275ada954537f01c0ed/src/axolotl/utils/models.py#L242

@enn-nafnlaus
Copy link
Author

Will try adding attention_dropout to my yaml this evening. Any way to know if it's actually being used apart from a difference being visible in the training outputs?

@winglian
Copy link
Collaborator

I've noticed the train loss can get pretty high, even with a 0.05 dropout rate.

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Nov 18, 2023

Trying it out this evening. I don't have a config.json file. My models.py (just did a git pull this evening) doesn't look like the one you link - the closest equivalent to the code you pointed at is:

        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
            from transformers import LlamaForCausalLM

            config = LlamaConfig.from_pretrained(base_model_config)

I tried hacking something to have an equivalent impact - hopefully it works.

        elif cfg.is_llama_derived_model and not cfg.trust_remote_code:
            from transformers import LlamaForCausalLM

            config_kwargs = {}
            if cfg.attention_dropout:
                config_kwargs["attention_dropout"] = cfg.attention_dropout
                LOG.debug()
                LOG.debug("===================================")
                LOG.debug("ATTENTION DROPOUT ADDED!")
                LOG.debug("===================================")
                LOG.debug()
    
            config = LlamaConfig.from_pretrained(base_model_config, **config_kwargs)

Kinda awkward that I can't see if it's being used or not. But I set dropout to 0.2 so hopefully there will be an obvious impact...

(I've given up on doing github PRs... the Github side and approval side is 10 times more effort than doing the actual code changes)

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Nov 18, 2023

ED: Nope, not seeing a difference between 0.2 and 0.05... I doubt it's being used. Also not seeing my debug show up.

@NanoCode012
Copy link
Collaborator

Hey @enn-nafnlaus , a PR has just been merged to facilitate this.

You can just pass the following to yaml

model_config:
    attention_dropout: 0.01

Please let us know how it goes

@enn-nafnlaus
Copy link
Author

enn-nafnlaus commented Nov 19, 2023

Hey @enn-nafnlaus , a PR has just been merged to facilitate this.

You can just pass the following to yaml

model_config:
    attention_dropout: 0.01

Please let us know how it goes

I can now verify that it does indeed affect training :)

Before I do a serious training run to evaluate its impacts in preventing eval spikes at end-of-epoch / overfitting to the training data, I need to create a new training baseline, as my base model (TinyLlama) just had a new release. Will update once I have a good answer.

@enn-nafnlaus
Copy link
Author

While I still plan to do more test runs with different LRs, LR schedules, and weight decays, I'm prepared to say that this feature is now:

A) implemented, and
B) absolutely useful for dealing with epoch spikes and improving generalization).

Here's 25% dropout (purple) vs. no dropout (orange), both at 0.1 weight decay, the no-dropout (orange) case using what I previously determined as an optimal LR and schedule (inv_sqrt, lr=0.00003 - has to try to get as much learning done as possible before epoch while also having a greatly reduced LR after epoch to reduce the spike severity). The dropout (purple) case is running on a cos schedule with lr=0.000005, as it has the initial dropout-induced loss to overcome but can run for longer without severe epoch spikes.

image

image

Note in particular how similar eval_loss is to train_loss on purple (dropout) vs. how far it is on orange (no dropout). I may be able to improve learning further with further tuning of schedulers, LR, and weight decay. Note that the original dropout paper used up to 0.5, although obviously that only makes sense if you're going to finetune for a lot of epochs;if you don't want a big initial loss penalty to overcome, you can stick with a much lower dropout than I used (e.g. single digits, even low single digits).

Anyway, as far as I'm concerned, this can now be closed as successfully implemented!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants