-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange Results on first step #24
Comments
Hi, thanks for the datapoint. Do you have a comparison of the commands used for running with Lion and AdamW? |
@xiangning-chen same command besides the lr_opt value |
Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW. |
They are in the main post. ‘Relevant Code’ is for lion |
@nbardy Sorry I'm a bit confused, in |
@nbardy do you get this behaviour without triton? one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here |
@mitchellnw thanks for bringing this to my attention Mitchell! |
I have finally got back to training more diffusion models. Tried upgrading to https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet |
|
Turned off lion and it’s still there. This is probably something else from my changes. Will test more next week. |
Confirmed fixed. |
I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)
Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state
Step 0 on the left and Step 1 on the right
Here is one I let run longer, It seems to actually be converging 🤔 , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500
Relevant Code:
Relevant parameters:
No apparent sharp decrease in loss?
More samples across many prompts:
Note black squares are just NSFW filter I believe
The text was updated successfully, but these errors were encountered: