Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The saved epoch number seems to be wrong? #296

Closed
btyu opened this issue Oct 4, 2019 · 5 comments
Closed

The saved epoch number seems to be wrong? #296

btyu opened this issue Oct 4, 2019 · 5 comments

Comments

@btyu
Copy link

btyu commented Oct 4, 2019

The saved epoch number seems to be wrong. I don't know whether it is my fault.
Specifically, I first train my model for 2 epochs, with the following code:

exp = Experiment(save_dir='.')
trainer = Trainer(experiment=exp, max_nb_epochs=2, gpus=[0], checkpoint_callback=checkpoint_callback)
trainer.fit(model)

During the first epoch, epoch=0. After the training of the first epoch, it shows:

Epoch 00001: avg_val_loss improved from inf to 1.42368, saving model to checkpoints//_ckpt_epoch_1.ckpt

During the second epoch, epoch=1. After the training of the second epoch, it shows:

Epoch 00002: avg_val_loss improved from 1.42368 to 1.23873, saving model to checkpoints//_ckpt_epoch_2.ckpt

At this moment, I save exp with the code:

exp.save()

and it gives:

100%|████| 15000/15000 [04:31<00:00, 454.06it/s, avg_val_loss=1.24, batch_nb=12499, epoch=1, gpu=0, loss=1.283, v_nb=0]

And then, I want to continue my training with the following code:

new_exp = Experiment(save_dir='.', version=0)
new_trainer = Trainer(experiment=new_exp, max_nb_epochs=3, gpus=[0], checkpoint_callback=checkpoint_callback)
new_model = Net()
new_trainer.fit(new_model)

It starts with epoch=1 instead of epoch=2. Therefore, to reach new_trainer's max_nb_epochs=3, another 2 epochs will be implemented.

Obviously, the epoch number in the saved exp is wrong. After the first two epochs, the saved epoch number should be 2. But it saved epoch=1, which causes the continuing training starts from epoch=1.

It really confused me. Looking forward to your help. Thanks.

@williamFalcon
Copy link
Contributor

the checkpoint does in fact not match the epoch (it's off by one). we fixed it in another PR which didn't land. Mind submitting a PR?

@williamFalcon williamFalcon added this to Todo (next release) in Key features - Roadmap v1.0 Oct 4, 2019
@btyu
Copy link
Author

btyu commented Oct 8, 2019

no. but since you have fixed it in another PR, my PR may be not necessary? besides, im not very familiar with the code. would you please tell me which file I should help fixing? thank you.

@Ir1d
Copy link
Contributor

Ir1d commented Nov 4, 2019

#128 is changing the epoch numbers, will soon be good to land.

@awaelchli
Copy link
Member

@HappyCtest It seems this is fixed now on master after merge of #128.

@btyu
Copy link
Author

btyu commented Nov 20, 2019

It's great! Thank you for your contributions.

@btyu btyu closed this as completed Nov 20, 2019
Key features - Roadmap v1.0 automation moved this from Todo (next release) to Done Nov 20, 2019
@Borda Borda removed this from Done in Key features - Roadmap v1.0 Mar 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants