Trainer.test() in combination with resume_from_checkpoint is broken #5091

ORippler · 2020-12-11T15:12:27Z

🐛 Bug

When passing resume_from_checkpoint to Trainer, and then training (e.g. call to trainer.fit()), the state used for trainer.test() is always the checkpoint initially given to resume_from_checkpoint, and never the newer, better one.

trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit() # do some fine-tuning/resume training
trainer.test() # should make use of "best" checkpoint, however uses ckpt passed to resume_from_checkpoint

Please reproduce using the BoringModel and post here

https://colab.research.google.com/drive/1ABXnUP10QUqHeUQmFy-FX26cV2w1JILA?usp=sharing

Expected behavior

After fine-tuning, the best model state is looked up internally as introduced by #2190 before running on the test dataset.

Environment

CUDA:
- GPU:
  - Tesla T4
- available: True
- version: 10.1
Packages:
- numpy: 1.18.5
- pyTorch_debug: True
- pyTorch_version: 1.7.0+cu101
- pytorch-lightning: 1.1.0
- tqdm: 4.41.1
System:
- OS: Linux
- architecture:
  - 64bit
- processor: x86_64
- python: 3.6.9
- version: 1 SMP Thu Jul 23 08:00:38 PDT 2020

Additional context

A hotfix is to manually set trainer.resume_from_checkpoint = None between calls to trainer.fit() and trainer.test().

trainer = Trainer(resume_from_checkpoint="path_to_ckpt") # pass ckpt to Trainer for resuming
trainer.fit()
trainer.resume_from_checkpoint = None
trainer.test()

The cause behind the issue is that Trainer.test() is performed internally by calling to Trainer.fit() for all configurations.

Long term, the checkpoint passed by resume_from_checkpoint should most likely be consumed internally (i.e. reset to None) after the state is restored. Alternatively, one could make use of the Trainer.testing attribute to limit the utilization of Trainer.resume_from_checkpoint by CheckpointConnector to the training state only.

The text was updated successfully, but these errors were encountered:

edenlightning · 2020-12-11T21:03:20Z

@awaelchli thoughts?

tchaton · 2020-12-16T11:19:50Z

Hey @ORippler,

Thanks for reporting the bug. Would you mind making the notebook public ?
Waiting for you to do so, I will try to reproduce the bug locally and update I am manage to.

Best regards,
Thomas Chaton.

ORippler · 2020-12-16T11:23:01Z

@tchaton

Link should be updated.

Cheers!

ananthsub · 2021-12-22T00:55:47Z

For future reference, #9405 and https://github.com/PyTorchLi.../pytorch-lightning/pull/10061 unified the checkpoint loading paths to avoid this confusion by deprecating resume_from_checkpoint from the Trainer constructor

ORippler added bug Something isn't working help wanted Open to be worked on labels Dec 11, 2020

edenlightning added the checkpointing Related to checkpointing label Dec 11, 2020

tchaton added priority: 0 High priority task with code labels Dec 14, 2020

edenlightning assigned tchaton Dec 14, 2020

tchaton added the waiting on author Waiting on user action, correction, or update label Dec 16, 2020

tchaton added this to the 1.1.x milestone Dec 16, 2020

tchaton mentioned this issue Dec 16, 2020

[bug-fix] Trainer.test points to latest best_model_path #5161

Merged

11 tasks

haideraltahan mentioned this issue Dec 18, 2020

ModelCheckpoint fails at garbage collecting checkpoint passed to Trainer.resume_from_checkpoint #5090

Closed

tchaton closed this as completed in #5161 Jan 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer.test() in combination with resume_from_checkpoint is broken #5091

Trainer.test() in combination with resume_from_checkpoint is broken #5091

ORippler commented Dec 11, 2020 •

edited by Borda

Loading

edenlightning commented Dec 11, 2020

tchaton commented Dec 16, 2020

ORippler commented Dec 16, 2020 •

edited

Loading

ananthsub commented Dec 22, 2021

Trainer.test() in combination with resume_from_checkpoint is broken #5091

Trainer.test() in combination with resume_from_checkpoint is broken #5091

Comments

ORippler commented Dec 11, 2020 • edited by Borda Loading

🐛 Bug

Please reproduce using the BoringModel and post here

Expected behavior

Environment

Additional context

edenlightning commented Dec 11, 2020

tchaton commented Dec 16, 2020

ORippler commented Dec 16, 2020 • edited Loading

ananthsub commented Dec 22, 2021

ORippler commented Dec 11, 2020 •

edited by Borda

Loading

ORippler commented Dec 16, 2020 •

edited

Loading