Checkpoint gives error #526

jiequanz · 2019-11-19T17:45:29Z

Hi,

I wonder if we can only save the best model with lowest validation error, and not save the others checkpoints.

I took a look at checkpoint_callback's save_best_only (below), it seems that this is for saving the best at every epoch (because the file name changes at every epoch). So I wonder if we can only save the best in the whole training process. Thanks!
checkpoint_callback = ModelCheckpoint( filepath=os.getcwd(), save_best_only=True, verbose=True, monitor='val_loss', mode='min', prefix='' )

The text was updated successfully, but these errors were encountered:

neggert · 2019-11-19T18:31:17Z

I'm pretty sure that save_best_only does what you're asking.

jiequanz · 2019-11-19T18:33:07Z

Really? It is default to True, but it is still saving all the models from all epochs

Borda · 2019-11-19T18:33:37Z

but it may be changed in #128 :)

jiequanz · 2019-11-20T05:40:42Z

I used the newest version of pytorch lightning, and got this error:

`.../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_i
o.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparame
ters
"Did not find hyperparameters at model.hparams. Saving checkpoint without"
Traceback (most recent call last):
File "main.py", line 315, in
fire.Fire()
File ".../lib/python3.7/site-packages/fire/core.py", line 138, in
Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File ".../lib/python3.7/site-packages/fire/core.py", line 471, in
_Fire
target=component.name)
File ".../lib/python3.7/site-packages/fire/core.py", line 675, in
_CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "main.py", line 287, in train
trainer.fit(m)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/t
rainer.py", line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn
.py", line 171, in spawn
while not spawn_context.join():
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn
.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine
self.train()
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train
self.run_training_epoch()
File "...lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 120, in run_training_epoch
self.logger.save()
File ".../lib/python3.7/site-packages/pytorch_lightning/logging/base.py", line 13, in wrapped_fn
fn(self, *args, **kwargs)
File ".../lib/python3.7/site-packages/pytorch_lightning/logging/test_tube_logger.py", line 57, in save
self.experiment.save()
File ".../lib/python3.7/site-packages/test_tube/log.py", line 346, in save
with open(self.__get_log_name(), 'w') as file:
FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
`

Thanks for helping!

Borda · 2019-11-20T06:59:39Z

@Jiequannnnnnnnnn can you report it as a bug with reproducibility example....?

jiequanz · 2019-11-20T07:27:42Z

@Jiequannnnnnnnnn can you report it as a bug with reproducibility example....?

Sorry....I tried and not sure how to add a bug label. What does reproducibility example mean in this case?

Borda · 2019-11-20T07:38:40Z

you probably can't change a label, you need t to create a new issue... by example I mean a sample code which gave you this error plus what library version you used...

williamFalcon · 2019-11-20T13:34:04Z

@Borda added label.
@Jiequannnnnnnnnn a minimal example to reproduce would be great!
Thanks :)

Borda · 2019-11-20T14:08:59Z

@Borda added label.

@williamFalcon unfortunately, we do not right/permission to do it :]

neggert · 2019-11-20T16:00:24Z

Same issue as #525?

jiequanz · 2019-11-20T21:24:12Z

Same issue as #525?

No it is not the same issue

jiequanz · 2019-11-20T21:51:53Z

I made a file that could produce more of the errors (except for the last Traceback )

@Borda added label.
@Jiequannnnnnnnnn a minimal example to reproduce would be great!
Thanks :)

I made a file that could produce more of the errors (except for the last Traceback )
I put them here. Please take a look if you have free time! Thanks
main-Copy1.py.zip

jiequanz · 2019-11-20T22:31:13Z

Also, I am using the newest version of lightning

jiequanz · 2019-11-21T08:14:41Z

If I comment out the checkpoint_callback in the trainer, it ends at the 8th epoch, not sure why this happened.

jeffling · 2019-11-23T01:01:18Z

Just covering over the obvious things first:

the main issue is that FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Can you make sure that the parent directories exist?

Can you turn off DDP and see if it works?

williamFalcon · 2019-11-25T11:53:01Z

@Jiequannnnnnnnnn still having issues? If so, will take a look at your code

jiequanz · 2019-11-25T18:58:02Z

Yeah, still having the issue. Couldn't fix it. Thanks!

williamFalcon · 2020-01-21T12:46:18Z

@Jiequannnnnnnnnn still issues? should be fixed on master

vanpersie32 · 2023-01-24T03:01:37Z

I used the newest version of pytorch lightning, and got this error:

`.../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_i o.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparame ters "Did not find hyperparameters at model.hparams. Saving checkpoint without" Traceback (most recent call last): File "main.py", line 315, in fire.Fire() File ".../lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File ".../lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.name) File ".../lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "main.py", line 287, in train trainer.fit(m) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/t rainer.py", line 343, in fit mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,)) File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn .py", line 171, in spawn while not spawn_context.join(): File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn .py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train self.run_pretrain_routine(model) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine self.train() File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train self.run_training_epoch() File "...lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 120, in run_training_epoch self.logger.save() File ".../lib/python3.7/site-packages/pytorch_lightning/logging/base.py", line 13, in wrapped_fn fn(self, *args, **kwargs) File ".../lib/python3.7/site-packages/pytorch_lightning/logging/test_tube_logger.py", line 57, in save self.experiment.save() File ".../lib/python3.7/site-packages/test_tube/log.py", line 346, in save with open(self.__get_log_name(), 'w') as file: FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) `

Thanks for helping!

I have also met the problem. Have you ever solved it?

jiequanz added the question Further information is requested label Nov 19, 2019

jiequanz changed the title ~~Can we only save the best model in training?~~ Checkpoint gives error Nov 20, 2019

williamFalcon added bug Something isn't working and removed question Further information is requested labels Nov 20, 2019

williamFalcon closed this as completed Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Checkpoint gives error #526

Checkpoint gives error #526

jiequanz commented Nov 19, 2019

neggert commented Nov 19, 2019

jiequanz commented Nov 19, 2019

Borda commented Nov 19, 2019

jiequanz commented Nov 20, 2019

Borda commented Nov 20, 2019

jiequanz commented Nov 20, 2019

Borda commented Nov 20, 2019

williamFalcon commented Nov 20, 2019

Borda commented Nov 20, 2019 •

edited

Loading

neggert commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 21, 2019

jeffling commented Nov 23, 2019

williamFalcon commented Nov 25, 2019 •

edited

Loading

jiequanz commented Nov 25, 2019

williamFalcon commented Jan 21, 2020

vanpersie32 commented Jan 24, 2023

Checkpoint gives error #526

Checkpoint gives error #526

Comments

jiequanz commented Nov 19, 2019

neggert commented Nov 19, 2019

jiequanz commented Nov 19, 2019

Borda commented Nov 19, 2019

jiequanz commented Nov 20, 2019

Borda commented Nov 20, 2019

jiequanz commented Nov 20, 2019

Borda commented Nov 20, 2019

williamFalcon commented Nov 20, 2019

Borda commented Nov 20, 2019 • edited Loading

neggert commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 20, 2019

jiequanz commented Nov 21, 2019

jeffling commented Nov 23, 2019

williamFalcon commented Nov 25, 2019 • edited Loading

jiequanz commented Nov 25, 2019

williamFalcon commented Jan 21, 2020

vanpersie32 commented Jan 24, 2023

Borda commented Nov 20, 2019 •

edited

Loading

williamFalcon commented Nov 25, 2019 •

edited

Loading