Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mismatch of displayed 'epoch' #675

Closed
onkyo14taro opened this issue Jan 9, 2020 · 9 comments · Fixed by #786
Closed

Mismatch of displayed 'epoch' #675

onkyo14taro opened this issue Jan 9, 2020 · 9 comments · Fixed by #786
Assignees
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@onkyo14taro
Copy link
Contributor

🐛 Bug

The display of epoch's number mismatches between the progress bar and the checkpoint indicator. I wonder this mismatch could confuse users.

  • progress bar: The number of epochs starts from 1.
  • checkpoint indicator: The number of epochs starts from 0.
  • metrics.csv also starts from 0.

I think that to change checkpoint and metrics.csv causes a serious problem.
So progress bar should be changed in my opinion.

What do you think about it?

Epoch 32: 100%|██████████| 331/331 [00:05<00:00, 88.73batch/s, batch_idx=17, loss=1.148, train_batch_loss=1.02, v_num=0, val_loss=1.05]
AINFO:root:
Epoch 00031: val_loss reached 1.04545 (best 1.04545), saving model to /dummy/version_0/checkpoints/_ckpt_epoch_31.ckpt as top 1
{'loss': 1.022357702255249, 'train_batch_loss': 1.022357702255249, 'val_loss': 1.0454469919204712}
Epoch 33:   5%|▌         | 18/331 [00:00<00:05, 61.06batch/s, batch_idx=17, loss=1.073, train_batch_loss=1.31, v_num=0, val_loss=1.05]

Environment

  • PyTorch Version : 1.3.1
  • OS : macOS 10.14.6
  • How you installed PyTorch : pip install git+https://github.com/williamFalcon/pytorch-lightning.git@master --upgrade
  • Python version : 3.7.3
  • use CPU
@onkyo14taro onkyo14taro added the bug Something isn't working label Jan 9, 2020
@matthew-z
Copy link
Contributor

Indeed, the inconsistency of zero-based/one-based epoch is very confusing.

I think we should use zero-based only.

@neggert neggert added the good first issue Good for newcomers label Jan 21, 2020
@williamFalcon
Copy link
Contributor

@matthew-z i agree it should be zero based. want to submit a PR?
@onkyo14taro @matthew-z

@onkyo14taro
Copy link
Contributor Author

@williamFalcon I'll try this weekend.

@onkyo14taro
Copy link
Contributor Author

onkyo14taro commented Jan 26, 2020

@matthew-z @neggert @williamFalcon @Borda

I found a 'one epoch'-based API in GradientAccumulationScheduler while I was fixing the code.
If we force the API and display to be zero-based, it will break backward compatibility.
However, I think that mixing 0 and 1 will confuse users than changing the API, it should be zero based.
I think it would be better we warn with FutureWarning in version 0.6.x, and then change the API and display to be zero-based in version 0.7.x.
What do you think about it?

@awaelchli
Copy link
Contributor

I can't quite wrap my head around this. If you make it zero based, then when the progress bar shows
Epoch 1: 50/100%
it will no longer mean that the first epoch is in progress, but actually it is the second epoch 50% completed? Is this really what you want? Why should it be this way?

@onkyo14taro
Copy link
Contributor Author

onkyo14taro commented Jan 27, 2020

@awaelchli

I agree with you, but I think zero-based would be better than one-based.

Zero-based implementations are:

  • Progress bar (console display)
  • GradientAccumulationScheduler (console display, API)
  • EarlyStopping (console display)

One-based implementations are:

  • metrics.csv (in the file)
  • _ckpt_epoch_{0-based epoch number}.ckpt (filename)
  • ModelCheckpoint (console display)

I think what of most influensive in these items are "metrics.csv" and "_ckpt_epoch_{0-based epoch number}.ckpt."
That's because we usually use the result file rather than console display.
So I think that zero-based implementations have less confusion than one-based when breaking backward compatibility in order to unify representation of epoch numbers.

@williamFalcon
Copy link
Contributor

agreed. let’s do zero-based

@onkyo14taro
Copy link
Contributor Author

onkyo14taro commented Jan 28, 2020

@williamFalcon Is it ok that FutureWarning is thrown when Progress bar, GradientAccumulationScheduler or EarlyStopping are used in version 0.6.x , and then the API and display renew in version 0.7.0?

@Borda
Copy link
Member

Borda commented Jan 28, 2020

we are using DeprecatedWarning :]

williamFalcon pushed a commit that referenced this issue Feb 5, 2020
* [update] : #675 : set warnings

* [fix] : #675 : remove white space
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants