Trainer.test() in combination with resume_from_checkpoint is broken #5091
Labels
bug
Something isn't working
checkpointing
Related to checkpointing
help wanted
Open to be worked on
priority: 0
High priority task
waiting on author
Waiting on user action, correction, or update
Milestone
🐛 Bug
When passing
resume_from_checkpoint
toTrainer
, and then training (e.g. call totrainer.fit()
), the state used fortrainer.test()
is always the checkpoint initially given toresume_from_checkpoint
, and never the newer, better one.Please reproduce using the BoringModel and post here
https://colab.research.google.com/drive/1ABXnUP10QUqHeUQmFy-FX26cV2w1JILA?usp=sharing
Expected behavior
After fine-tuning, the best model state is looked up internally as introduced by #2190 before running on the test dataset.
Environment
Additional context
A hotfix is to manually set
trainer.resume_from_checkpoint = None
between calls totrainer.fit()
andtrainer.test()
.The cause behind the issue is that
Trainer.test()
is performed internally by calling toTrainer.fit()
for all configurations.Long term, the checkpoint passed by
resume_from_checkpoint
should most likely be consumed internally (i.e. reset toNone
) after the state is restored. Alternatively, one could make use of theTrainer.testing
attribute to limit the utilization ofTrainer.resume_from_checkpoint
byCheckpointConnector
to the training state only.The text was updated successfully, but these errors were encountered: