The saved epoch number seems to be wrong? #296

btyu · 2019-10-04T11:57:55Z

The saved epoch number seems to be wrong. I don't know whether it is my fault.
Specifically, I first train my model for 2 epochs, with the following code:

exp = Experiment(save_dir='.')
trainer = Trainer(experiment=exp, max_nb_epochs=2, gpus=[0], checkpoint_callback=checkpoint_callback)
trainer.fit(model)

During the first epoch, epoch=0. After the training of the first epoch, it shows:

Epoch 00001: avg_val_loss improved from inf to 1.42368, saving model to checkpoints//_ckpt_epoch_1.ckpt

During the second epoch, epoch=1. After the training of the second epoch, it shows:

Epoch 00002: avg_val_loss improved from 1.42368 to 1.23873, saving model to checkpoints//_ckpt_epoch_2.ckpt

At this moment, I save exp with the code:

exp.save()

and it gives:

100%|████| 15000/15000 [04:31<00:00, 454.06it/s, avg_val_loss=1.24, batch_nb=12499, epoch=1, gpu=0, loss=1.283, v_nb=0]

And then, I want to continue my training with the following code:

new_exp = Experiment(save_dir='.', version=0)
new_trainer = Trainer(experiment=new_exp, max_nb_epochs=3, gpus=[0], checkpoint_callback=checkpoint_callback)
new_model = Net()
new_trainer.fit(new_model)

It starts with epoch=1 instead of epoch=2. Therefore, to reach new_trainer's max_nb_epochs=3, another 2 epochs will be implemented.

Obviously, the epoch number in the saved exp is wrong. After the first two epochs, the saved epoch number should be 2. But it saved epoch=1, which causes the continuing training starts from epoch=1.

It really confused me. Looking forward to your help. Thanks.

The text was updated successfully, but these errors were encountered:

williamFalcon · 2019-10-04T23:50:09Z

the checkpoint does in fact not match the epoch (it's off by one). we fixed it in another PR which didn't land. Mind submitting a PR?

btyu · 2019-10-08T07:15:39Z

no. but since you have fixed it in another PR, my PR may be not necessary? besides, im not very familiar with the code. would you please tell me which file I should help fixing? thank you.

Ir1d · 2019-11-04T08:33:01Z

#128 is changing the epoch numbers, will soon be good to land.

awaelchli · 2019-11-20T09:40:17Z

@HappyCtest It seems this is fixed now on master after merge of #128.

btyu · 2019-11-20T10:12:10Z

It's great! Thank you for your contributions.

williamFalcon added this to Todo (next release) in Key features - Roadmap v1.0 Oct 4, 2019

btyu closed this as completed Nov 20, 2019

Key features - Roadmap v1.0 automation moved this from Todo (next release) to Done Nov 20, 2019

Borda removed this from Done in Key features - Roadmap v1.0 Mar 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The saved epoch number seems to be wrong? #296

The saved epoch number seems to be wrong? #296

btyu commented Oct 4, 2019

williamFalcon commented Oct 4, 2019

btyu commented Oct 8, 2019

Ir1d commented Nov 4, 2019

awaelchli commented Nov 20, 2019

btyu commented Nov 20, 2019

The saved epoch number seems to be wrong? #296

The saved epoch number seems to be wrong? #296

Comments

btyu commented Oct 4, 2019

williamFalcon commented Oct 4, 2019

btyu commented Oct 8, 2019

Ir1d commented Nov 4, 2019

awaelchli commented Nov 20, 2019

btyu commented Nov 20, 2019