change Checkpoint callback's `save_best_only` to `save_top_k` #128

Ir1d · 2019-08-17T03:07:10Z

this pr should close #70
bringing save_function outside of pytorch_lightning/callbacks/pt_callbacks.py is really confusing, since in that case pt_callbacks.py cannot be run on its own 🌚
not sure where to add the tests, so I appended them at the end of the file.

fix Lightning-AI#70

williamFalcon · 2019-08-17T11:51:03Z

add the test to test_models.py

williamFalcon · 2019-08-17T11:52:50Z

to test, set the save_function manually. do a few cases of k (k=0, k=1, k=2) and inspect the folder contents as expected for every case (those are the assertions)

williamFalcon · 2019-08-17T14:22:14Z

@Ir1d i'll take a look at the PR once the tests are added. awesome PR!

pytorch_lightning/callbacks/pt_callbacks.py

williamFalcon · 2019-08-20T17:18:29Z

pytorch_lightning/callbacks/pt_callbacks.py

-                        shutil.rmtree(path_to_delete)
-                    except OSError:
-                        os.remove(path_to_delete)
+        try:


you should only remove files this callback saved. For instance this would remove other checkpoints the user drags in manually or the ones saved by slurm

I'm not sure what del_model and save_model is intended to be. Through my experiments I noticed that the original implementation simply delete all the models in the corresponding folder. I modified the functions so that they delete only the filepath model. AFAIK, these two functions are called with the exact model filepath.

pytorch_lightning/callbacks/pt_callbacks.py

Ir1d · 2019-11-05T13:56:45Z

@williamFalcon ping

docs/Trainer/Checkpointing.md

williamFalcon · 2019-11-05T16:10:07Z

@Ir1d looks like GPU tests fail

__________________________________________________________________________________ test_amp_gpu_dp __________________________________________________________________________________

    def test_amp_gpu_dp():
        """
        Make sure DP + AMP work
        :return:
        """
        testing_utils.reset_seed()

        if not testing_utils.can_run_gpu_test():
            return

        model, hparams = testing_utils.get_model()
        trainer_options = dict(
            max_nb_epochs=1,
            gpus='0, 1',  # test init with gpu string
            distributed_backend='dp',
            use_amp=True
        )
        with pytest.raises(MisconfigurationException):
>           testing_utils.run_gpu_model_test(trainer_options, model, hparams)

tests/test_z_amp.py:204:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/testing_utils.py:62: in run_gpu_model_test
    checkpoint = init_checkpoint_callback(logger)
tests/testing_utils.py:238: in init_checkpoint_callback
    checkpoint = ModelCheckpoint(ckpt_dir)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pytorch_lightning.callbacks.pt_callbacks.ModelCheckpoint object at 0x7efdb03242e8>
filepath = '/private/home/falc/Developer/pytorch-lightning/tests/save_dir/lightning_logs/version_22/checkpoints', monitor = 'val_loss', verbose = 0, save_top_k = 1
save_weights_only = False, mode = 'auto', period = 1, prefix = ''

    def __init__(self, filepath, monitor='val_loss', verbose=0,
                 save_top_k=1, save_weights_only=False,
                 mode='auto', period=1, prefix=''):
        super(ModelCheckpoint, self).__init__()
        if (
>           save_best_only and
            os.path.isdir(filepath) and
            len(os.listdir(filepath)) > 0
        ):
E       NameError: name 'save_best_only' is not defined

pytorch_lightning/callbacks/pt_callbacks.py:189: NameError
================================================================================= warnings summary ==================================================================================
/private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1
  /private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from .core import Container, open

tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
  /private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216, got 192
    return f(*args, **kwds)

tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
  /private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 38 failed, 48 passed, 5 warnings in 27.44 seconds ====

Ir1d · 2019-11-05T16:12:55Z

its because master changed.. a fix will be ready soon

Ir1d · 2019-11-05T16:22:56Z

Preview of the edits on docs

awaelchli · 2019-11-05T18:08:01Z

@Ir1d how about just writing k >= 1 instead of k (k >= 1)? It's cleaner.

williamFalcon · 2019-11-06T10:37:32Z

@Ir1d i don't think this can go in this release. let's land it soon and test for a bit on master before releasing.

Ir1d · 2019-11-06T10:48:24Z

Fine. I just don't feel like merging master branch into this PR again.

williamFalcon · 2019-11-06T10:51:38Z

totally. but i imagine the longer we delay that the harder it’ll become to get this PR added.

checkpointing is very important for research, i don’t want to rush this and mess up a bunch of projects

jeffling · 2019-11-15T17:58:53Z

docs/Trainer/Checkpointing.md

+
+Also, if `save_top_k` >= 2 and the callback is called multiple
+times inside an epoch, the name of the saved file will be
+appended with a version count starting with `v0`.


I understand this PR has been out for a while so it's okay if this change isn't implemented, but at latent space we've implemented a checkpointing system within an epoch (we have a usecase of epochless datasets) and we rely heavily on having the iteration in the name for some of our checkpoint analysis. If we don't think we're saving multiple times in the same iteration, we could consider using the iteration number

+1 on considering iteration numbers.
But I think it's too much for this PR, perhaps considering a seperate issue?
You see we expect so many things from this single PR, and it is blocked like forever. And I have to merge master from time to time.

williamFalcon · 2019-11-15T18:03:58Z

@jeffling feel free to bring this PR up to speed. Add your use case here as well as other production teams will find it helpful.

accidentally pressed wrong button when solving conflicts

Ir1d · 2019-11-17T04:20:26Z

@williamFalcon ping

jeffling · 2019-11-22T23:55:49Z

@jeffling feel free to bring this PR up to speed. Add your use case here as well as other production teams will find it helpful.

Will be adding in follow-up PR

Ir1d added 6 commits August 14, 2019 02:54

docs: enable syntax highlight

6b54c1e

feat: change Checkpoint callback's save_best_only to save_top_k

49e35b1

fix Lightning-AI#70

docs: update docs for save_top_k

08f57b6

revert other files

4855bf8

style: lint for travis-ci

2f6f784

fix typo

daae566

Ir1d changed the title ~~change Checkpoint callback's save_best_only to `save_top_k~~ change Checkpoint callback's save_best_only to save_top_k Aug 17, 2019

make flake8 happy

a7da269

Borda requested changes Aug 19, 2019

View reviewed changes

williamFalcon added this to To do in Key features - Roadmap v1.0 Aug 19, 2019

Ir1d added 6 commits August 20, 2019 00:47

update according to review

373cd1d

add tests

49f78d6

Merge remote-tracking branch 'wf/master' into save_top_k

a34811a

rename func to private

fbc8a4e

add doc on save_top_k == 0

52bfcb7

make flake8 happy

38afd66