Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Results on first step #24

Closed
nbardy opened this issue Apr 12, 2023 · 11 comments
Closed

Strange Results on first step #24

nbardy opened this issue Apr 12, 2023 · 11 comments

Comments

@nbardy
Copy link

nbardy commented Apr 12, 2023

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

image

Here is one I let run longer, It seems to actually be converging 🤔 , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500
image

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?
image

More samples across many prompts:

image

Note black squares are just NSFW filter I believe

@xiangning-chen
Copy link
Contributor

Hi, thanks for the datapoint.

Do you have a comparison of the commands used for running with Lion and AdamW?

@nbardy
Copy link
Author

nbardy commented Apr 21, 2023

@xiangning-chen same command besides the lr_opt value

@xiangning-chen
Copy link
Contributor

Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW.

@nbardy
Copy link
Author

nbardy commented Apr 22, 2023

They are in the main post.

‘Relevant Code’ is for lion
‘Relevant parameters’ is for Adam

@xiangning-chen
Copy link
Contributor

@nbardy Sorry I'm a bit confused, in Relevant parameters you set the --lion_opt flag, but this is for Adam?
Can you please just tell me the learning_rate, lr_end, and weight decay for Lion and AdamW respectively, thanks!

@mitchellnw
Copy link

@nbardy do you get this behaviour without triton?

one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here

@lucidrains
Copy link
Owner

@mitchellnw thanks for bringing this to my attention Mitchell!

@nbardy do you want to see if 6ab873a addresses the issue?

@nbardy
Copy link
Author

nbardy commented Oct 14, 2023

I have finally got back to training more diffusion models.

Tried upgrading to lion-pytorch==0.1.2 and still getting a reset it seems on first step

https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet

@nbardy
Copy link
Author

nbardy commented Oct 14, 2023

lion-pytorch==0.1.2
pytorch-triton==2.1.0+e650d3708b
triton==2.0.0
torch==2.0.1

@nbardy
Copy link
Author

nbardy commented Oct 14, 2023

Turned off lion and it’s still there. This is probably something else from my changes. Will test more next week.

@nbardy
Copy link
Author

nbardy commented Oct 21, 2023

Confirmed fixed.

@nbardy nbardy closed this as completed Oct 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants