Strange Results on first step #24

nbardy · 2023-04-12T21:15:48Z

I'm finetuning SD 1.5 at high 576 batch size.(24 gradient accumulation right now for testing on 1 GPU before scaling up)

Trying to get LION working. Getting very strange results on the first step. Seems to reset the model to a weird texture filled state

Step 0 on the left and Step 1 on the right

Here is one I let run longer, It seems to actually be converging 🤔 , but still has the same reset problem at the start.
Step 500, Step 1000, Step 1500

Relevant Code:

       from lion_pytorch import Lion

        optimizer  = Lion(
            params_to_optimize,
            lr=args.learning_rate,
            weight_decay=1e-2,
            betas=(0.95,0.98),
            use_triton=True # set this to True to use cuda kernel w/ Triton lang (Tillet et al)
        )

Relevant parameters:

--batch_size=24 --learning_rate=7.0e-7 --gradient_accumulation_steps=24 --lion_opt --lr_end=7.0e-9 --lr_scheduler=cosine_with_restarts --lr_num_cycles=20 --lr_warmup_steps=0  --max_train_steps=10000 --mixed_precision bf16

No apparent sharp decrease in loss?

More samples across many prompts:

Note black squares are just NSFW filter I believe

The text was updated successfully, but these errors were encountered:

xiangning-chen · 2023-04-13T23:32:53Z

Hi, thanks for the datapoint.

Do you have a comparison of the commands used for running with Lion and AdamW?

nbardy · 2023-04-21T07:46:14Z

@xiangning-chen same command besides the lr_opt value

xiangning-chen · 2023-04-21T16:39:30Z

Oh I meant the learning_rate, lr_end, and weight decay comparison for Lion and AdamW.

nbardy · 2023-04-22T08:38:20Z

They are in the main post.

‘Relevant Code’ is for lion
‘Relevant parameters’ is for Adam

xiangning-chen · 2023-04-23T19:24:47Z

@nbardy Sorry I'm a bit confused, in Relevant parameters you set the --lion_opt flag, but this is for Adam?
Can you please just tell me the learning_rate, lr_end, and weight decay for Lion and AdamW respectively, thanks!

mitchellnw · 2023-05-09T02:11:38Z

@nbardy do you get this behaviour without triton?

one thing i noticed here is that the triton code uses auto-tune + in-place updates which may cause issues. on the first step multiple differnt kernels will be launched which all do the same thing to see what is fastest. this is unique to the first step. usually this is not a problem when training from scratch as warmup is used but it may be here

lucidrains · 2023-05-09T21:04:00Z

@mitchellnw thanks for bringing this to my attention Mitchell!

@nbardy do you want to see if 6ab873a addresses the issue?

nbardy · 2023-10-14T02:09:44Z

I have finally got back to training more diffusion models.

Tried upgrading to lion-pytorch==0.1.2 and still getting a reset it seems on first step

https://wandb.ai/nbardy-facet/sd_xl_train_t2iadapter/runs/eey3bj1n?workspace=user-nbardy-facet

nbardy · 2023-10-14T02:14:01Z

lion-pytorch==0.1.2
pytorch-triton==2.1.0+e650d3708b
triton==2.0.0
torch==2.0.1

nbardy · 2023-10-14T02:27:46Z

Turned off lion and it’s still there. This is probably something else from my changes. Will test more next week.

nbardy · 2023-10-21T15:36:25Z

Confirmed fixed.

mitchellnw mentioned this issue May 9, 2023

Instability when resuming trains #13

Closed

lucidrains mentioned this issue Jun 21, 2023

Update to new paper version lucidrains/gigagan-pytorch#16

Merged

nbardy closed this as completed Oct 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strange Results on first step #24

Strange Results on first step #24

nbardy commented Apr 12, 2023

xiangning-chen commented Apr 13, 2023

nbardy commented Apr 21, 2023

xiangning-chen commented Apr 21, 2023

nbardy commented Apr 22, 2023

xiangning-chen commented Apr 23, 2023

mitchellnw commented May 9, 2023

lucidrains commented May 9, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 21, 2023

Strange Results on first step #24

Strange Results on first step #24

Comments

nbardy commented Apr 12, 2023

xiangning-chen commented Apr 13, 2023

nbardy commented Apr 21, 2023

xiangning-chen commented Apr 21, 2023

nbardy commented Apr 22, 2023

xiangning-chen commented Apr 23, 2023

mitchellnw commented May 9, 2023

lucidrains commented May 9, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 14, 2023

nbardy commented Oct 21, 2023