fix incorrect learning rate warm-up after restarting from ckpt #182

bozhang-hpc · 2022-07-29T23:08:18Z

A simple fix to #181

gahdritz · 2022-07-30T04:23:39Z

Could you explain how this works? Where is resume_last_lr_step defined? When is self.last_lr_step ever assigned to anything except -1?

bozhang-hpc · 2022-07-30T05:00:43Z

Oh, I forgot the definition of resume_last_lr_step(). It's added in the 69a8a28 commit.

gahdritz · 2022-08-02T00:47:45Z

One more thing: this assumes that the resume_from_ckpt is a ZERO checkpoint. You should add a quick check that the resume_from_ckpt argument is a directory and not a file.

bozhang-hpc · 2022-08-02T17:55:44Z

The non deepspeed ckpt file is supported now.

fix incorrect learning rate warm-up after restarting from ckpt

a7274ef

add the missing resume_last_lr_step()

69a8a28

fix lr resume for non-deepspeed ckpts

a2e7dab

gahdritz merged commit 87f3cd4 into aqlaboratory:main Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix incorrect learning rate warm-up after restarting from ckpt #182

fix incorrect learning rate warm-up after restarting from ckpt #182

bozhang-hpc commented Jul 29, 2022

gahdritz commented Jul 30, 2022

bozhang-hpc commented Jul 30, 2022

gahdritz commented Aug 2, 2022

bozhang-hpc commented Aug 2, 2022

fix incorrect learning rate warm-up after restarting from ckpt #182

fix incorrect learning rate warm-up after restarting from ckpt #182

Conversation

bozhang-hpc commented Jul 29, 2022

gahdritz commented Jul 30, 2022

bozhang-hpc commented Jul 30, 2022

gahdritz commented Aug 2, 2022

bozhang-hpc commented Aug 2, 2022