Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpoint gives error #526

Closed
jiequanz opened this issue Nov 19, 2019 · 19 comments
Closed

Checkpoint gives error #526

jiequanz opened this issue Nov 19, 2019 · 19 comments
Labels
bug Something isn't working

Comments

@jiequanz
Copy link

Hi,

I wonder if we can only save the best model with lowest validation error, and not save the others checkpoints.

I took a look at checkpoint_callback's save_best_only (below), it seems that this is for saving the best at every epoch (because the file name changes at every epoch). So I wonder if we can only save the best in the whole training process. Thanks!
checkpoint_callback = ModelCheckpoint( filepath=os.getcwd(), save_best_only=True, verbose=True, monitor='val_loss', mode='min', prefix='' )

@jiequanz jiequanz added the question Further information is requested label Nov 19, 2019
@neggert
Copy link
Contributor

neggert commented Nov 19, 2019

I'm pretty sure that save_best_only does what you're asking.

@jiequanz
Copy link
Author

Really? It is default to True, but it is still saving all the models from all epochs

@Borda
Copy link
Member

Borda commented Nov 19, 2019

but it may be changed in #128 :)

@jiequanz
Copy link
Author

I used the newest version of pytorch lightning, and got this error:

`.../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_i
o.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparame
ters
"Did not find hyperparameters at model.hparams. Saving checkpoint without"
Traceback (most recent call last):
File "main.py", line 315, in
fire.Fire()
File ".../lib/python3.7/site-packages/fire/core.py", line 138, in
Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File ".../lib/python3.7/site-packages/fire/core.py", line 471, in
_Fire
target=component.name)
File ".../lib/python3.7/site-packages/fire/core.py", line 675, in
_CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "main.py", line 287, in train
trainer.fit(m)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/t
rainer.py", line 343, in fit
mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn
.py", line 171, in spawn
while not spawn_context.join():
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn
.py", line 118, in join
raise Exception(msg)
Exception:

-- Process 0 terminated with the following error:
Traceback (most recent call last):
File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, *args)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train
self.run_pretrain_routine(model)
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine
self.train()
File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train
self.run_training_epoch()
File "...lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 120, in run_training_epoch
self.logger.save()
File ".../lib/python3.7/site-packages/pytorch_lightning/logging/base.py", line 13, in wrapped_fn
fn(self, *args, **kwargs)
File ".../lib/python3.7/site-packages/pytorch_lightning/logging/test_tube_logger.py", line 57, in save
self.experiment.save()
File ".../lib/python3.7/site-packages/test_tube/log.py", line 346, in save
with open(self.__get_log_name(), 'w') as file:
FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
Traceback (most recent call last):
File "", line 1, in
File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
_pickle.UnpicklingError: pickle data was truncated
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
.../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown
len(cache))
`

Thanks for helping!

@Borda
Copy link
Member

Borda commented Nov 20, 2019

@Jiequannnnnnnnnn can you report it as a bug with reproducibility example....?

@jiequanz jiequanz changed the title Can we only save the best model in training? Checkpoint gives error Nov 20, 2019
@jiequanz
Copy link
Author

@Jiequannnnnnnnnn can you report it as a bug with reproducibility example....?

Sorry....I tried and not sure how to add a bug label. What does reproducibility example mean in this case?

@Borda
Copy link
Member

Borda commented Nov 20, 2019

you probably can't change a label, you need t to create a new issue... by example I mean a sample code which gave you this error plus what library version you used...

@williamFalcon williamFalcon added bug Something isn't working and removed question Further information is requested labels Nov 20, 2019
@williamFalcon
Copy link
Contributor

@Borda added label.
@Jiequannnnnnnnnn a minimal example to reproduce would be great!
Thanks :)

@Borda
Copy link
Member

Borda commented Nov 20, 2019

@Borda added label.

@williamFalcon unfortunately, we do not right/permission to do it :]

Screenshot at 2019-11-20 15-07-05

@neggert
Copy link
Contributor

neggert commented Nov 20, 2019

Same issue as #525?

@jiequanz
Copy link
Author

Same issue as #525?

No it is not the same issue

@jiequanz
Copy link
Author

I made a file that could produce more of the errors (except for the last Traceback )

@Borda added label.
@Jiequannnnnnnnnn a minimal example to reproduce would be great!
Thanks :)

I made a file that could produce more of the errors (except for the last Traceback )
I put them here. Please take a look if you have free time! Thanks
main-Copy1.py.zip

@jiequanz
Copy link
Author

Also, I am using the newest version of lightning

@jiequanz
Copy link
Author

If I comment out the checkpoint_callback in the trainer, it ends at the 8th epoch, not sure why this happened.

@jeffling
Copy link
Contributor

Just covering over the obvious things first:

the main issue is that FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Can you make sure that the parent directories exist?

Can you turn off DDP and see if it works?

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 25, 2019

@Jiequannnnnnnnnn still having issues? If so, will take a look at your code

@jiequanz
Copy link
Author

Yeah, still having the issue. Couldn't fix it. Thanks!

@williamFalcon
Copy link
Contributor

@Jiequannnnnnnnnn still issues? should be fixed on master

@vanpersie32
Copy link

I used the newest version of pytorch lightning, and got this error:

`.../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer_i o.py:210: UserWarning: Did not find hyperparameters at model.hparams. Saving checkpoint without hyperparame ters "Did not find hyperparameters at model.hparams. Saving checkpoint without" Traceback (most recent call last): File "main.py", line 315, in fire.Fire() File ".../lib/python3.7/site-packages/fire/core.py", line 138, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File ".../lib/python3.7/site-packages/fire/core.py", line 471, in _Fire target=component.name) File ".../lib/python3.7/site-packages/fire/core.py", line 675, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "main.py", line 287, in train trainer.fit(m) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/t rainer.py", line 343, in fit mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,)) File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn .py", line 171, in spawn while not spawn_context.join(): File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn .py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File ".../lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/ddp_mixin.py", line 181, in ddp_train self.run_pretrain_routine(model) File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 471, in run_pretrain_routine self.train() File ".../lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 60, in train self.run_training_epoch() File "...lib/python3.7/site-packages/pytorch_lightning/trainer/train_loop_mixin.py", line 120, in run_training_epoch self.logger.save() File ".../lib/python3.7/site-packages/pytorch_lightning/logging/base.py", line 13, in wrapped_fn fn(self, *args, **kwargs) File ".../lib/python3.7/site-packages/pytorch_lightning/logging/test_tube_logger.py", line 57, in save self.experiment.save() File ".../lib/python3.7/site-packages/test_tube/log.py", line 346, in save with open(self.__get_log_name(), 'w') as file: FileNotFoundError: [Errno 2] No such file or directory: './sandbox/debug11127712/lightning_logs/version_0/meta.experiment'

Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated Traceback (most recent call last): File "", line 1, in File ".../lib/python3.7/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File ".../lib/python3.7/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) _pickle.UnpicklingError: pickle data was truncated .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) .../lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 12 leaked semaphores to clean up at shutdown len(cache)) `

Thanks for helping!

I have also met the problem. Have you ever solved it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants