Lightning-AI · williamFalcon · Nov 19, 2019 · Aug 13, 2019 · Aug 17, 2019 · Aug 17, 2019
@@ -1,6 +1,7 @@
 Lightning can automate saving and loading checkpoints.
 
 ---
+
 ### Model saving
 Checkpointing is enabled by default to the current working directory.
 To change the checkpoint path pass in :
@@ -10,13 +11,13 @@ Trainer(default_save_path='/your/path/to/save/checkpoints')
 
 To modify the behavior of checkpointing pass in your own callback.
 
-``` {.python}
+```{.python}
 from pytorch_lightning.callbacks import ModelCheckpoint
 
 # DEFAULTS used by the Trainer
 checkpoint_callback = ModelCheckpoint(
     filepath=os.getcwd(),
-    save_best_only=True,
+    save_top_k=1,
     verbose=True,
     monitor='val_loss',
     mode='min',
@@ -26,11 +27,26 @@ checkpoint_callback = ModelCheckpoint(
 trainer = Trainer(checkpoint_callback=checkpoint_callback)
 ```
 
+The `save_top_k` options works in the following ways:
+
+| save_top_k        | behavior |
+| --------   | -----  |
+| 0     | no models are saved |
+| -1        |  all models are saved  |
+|  k >= 1        |  the best k models are saved  |
+
+
+Also, if `save_top_k` >= 2 and the callback is called multiple
+times inside an epoch, the name of the saved file will be
+appended with a version count starting with `v0`.
+
 ---
-### Restoring training session 
+
+### Restoring training session
+
 You might want to not only load a model but also continue training it. Use this method to
 restore the trainer state as well. This will continue from the epoch and global step you last left off.  
-However, the dataloaders will start from the first batch again (if you shuffled it shouldn't matter).   
+However, the dataloaders will start from the first batch again (if you shuffled it shouldn't matter).
 
 Lightning will restore the session if you pass a logger with the same version and there's a saved checkpoint.   
 ``` {.python}
@@ -52,18 +68,19 @@ trainer = Trainer(
 trainer.fit(model)
 ```
 
-The trainer restores:  
+The trainer restores:
 
-- global_step    
-- current_epoch    
-- All optimizers    
-- All lr_schedulers    
+- global_step
+- current_epoch
+- All optimizers
+- All lr_schedulers
 - Model weights
 
-You can even change the logic of your model as long as the weights and "architecture" of 
-the system isn't different. If you add a layer, for instance, it might not work.   
+You can even change the logic of your model as long as the weights and "architecture" of
+the system isn't different. If you add a layer, for instance, it might not work.
+
+At a rough level, here's [what happens inside Trainer](https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/root_module/model_saving.py#L63):
 
-At a rough level, here's [what happens inside Trainer](https://github.com/williamFalcon/pytorch-lightning/blob/master/pytorch_lightning/root_module/model_saving.py#L63):   
 ```python
 
 self.global_step = checkpoint['global_step']
@@ -79,6 +96,6 @@ lr_schedulers = checkpoint['lr_schedulers']
 for scheduler, lrs_state in zip(self.lr_schedulers, lr_schedulers):
     scheduler.load_state_dict(lrs_state)
 
-# uses the model you passed into trainer        
+# uses the model you passed into trainer
 model.load_state_dict(checkpoint['state_dict'])
-```    
+```
@@ -6,14 +6,15 @@ In 99% of cases you want to just copy [one of the examples](https://github.com/w
 wget https://github.com/raw/williamFalcon/pytorch-lightning/master/pl_examples/new_project_templates/lightning_module_template.py
 ```
 
----    
-### Trainer Example 
+---
+
+### Trainer Example
 
-** \_\_main__ function**    
+** \_\_main\_\_ function**
 
-Normally, we want to let the \_\_main__ function start the training.
-Inside the main we parse training arguments with whatever hyperparameters we want. Your LightningModule will have a 
-chance to add hyperparameters.   
+Normally, we want to let the \_\_main\_\_ function start the training.
+Inside the main we parse training arguments with whatever hyperparameters we want. Your LightningModule will have a
+chance to add hyperparameters.
 
 ```{.python}
 from test_tube import HyperOptArgumentParser
@@ -32,13 +33,15 @@ if __name__ == '__main__':
     # train model
     main(hyperparams)
 ```
-**Main Function**      
+
+**Main Function**
 
 The main function is your entry into the program. This is where you init your model, checkpoint directory, and launch the training.
-The main function should have 3 arguments:   
-- hparams: a configuration of hyperparameters.    
+The main function should have 3 arguments:
+
+- hparams: a configuration of hyperparameters.
 - slurm_manager: Slurm cluster manager object (can be None)
-- dict: for you to return any values you want (useful in meta-learning, otherwise set to _)    
+- dict: for you to return any values you want (useful in meta-learning, otherwise set to \_)
 
 ```python
 def main(hparams, cluster, results_dict):
@@ -62,13 +65,15 @@ The __main__ function will start training on your **main** function. If you use
 in hyper parameter optimization mode, this main function will get one set of hyperparameters. If you use it as a simple
 argument parser you get the default arguments in the argument parser.
 
-So, calling main(hyperparams) runs the model with the default argparse arguments.       
+So, calling main(hyperparams) runs the model with the default argparse arguments.
+
 ```{.python}
 main(hyperparams)
 ```
 
 ---
-#### CPU hyperparameter search      
+
+#### CPU hyperparameter search
 
 ```{.python}
 # run a grid search over 20 hyperparameter combinations.
@@ -80,7 +85,9 @@ hyperparams.optimize_parallel_cpu(
 ```
 
 ---
-#### Hyperparameter search on a single or multiple GPUs       
+
+#### Hyperparameter search on a single or multiple GPUs
+
 ```{.python}
 # run a grid search over 20 hyperparameter combinations.
 hyperparams.optimize_parallel_gpu(
@@ -92,8 +99,10 @@ hyperparams.optimize_parallel_gpu(
 ```
 
 ---
-#### Hyperparameter search on a SLURM HPC cluster   
-```{.python}    
+
+#### Hyperparameter search on a SLURM HPC cluster
+
+```{.python}
 def optimize_on_cluster(hyperparams):
     # enable cluster training
     cluster = SlurmCluster(
@@ -126,6 +135,6 @@ def optimize_on_cluster(hyperparams):
         job_name=job_display_name
     )
 
-# run cluster hyperparameter search    
+# run cluster hyperparameter search
 optimize_on_cluster(hyperparams)
 ```
@@ -158,11 +158,17 @@ class ModelCheckpoint(Callback):
         filepath: string, path to save the model file.
         monitor: quantity to monitor.
         verbose: verbosity mode, 0 or 1.
-        save_best_only: if `save_best_only=True`,
-            the latest best model according to
-            the quantity monitored will not be overwritten.
+        save_top_k: if `save_top_k == k`,
+            the best k models according to
+            the quantity monitored will be saved.
+            if `save_top_k == 0`, no models are saved.
+            if `save_top_k == -1`, all models are saved.
+            Please note that the monitors are checked every `period` epochs.
+            if `save_top_k >= 2` and the callback is called multiple
+            times inside an epoch, the name of the saved file will be
+            appended with a version count starting with `v0`.
         mode: one of {auto, min, max}.
-            If `save_best_only=True`, the decision
+            If `save_top_k != 0`, the decision
             to overwrite the current save file is made
             based on either the maximization or the
             minimization of the monitored quantity. For `val_acc`,
@@ -176,27 +182,33 @@ class ModelCheckpoint(Callback):
     """
 
     def __init__(self, filepath, monitor='val_loss', verbose=0,
-                 save_best_only=True, save_weights_only=False,
+                 save_top_k=1, save_weights_only=False,
                  mode='auto', period=1, prefix=''):
         super(ModelCheckpoint, self).__init__()
         if (
-            save_best_only and
+            save_top_k and
             os.path.isdir(filepath) and
             len(os.listdir(filepath)) > 0
         ):
             warnings.warn(
-                f"Checkpoint directory {filepath} exists and is not empty with save_best_only=True."
+                f"Checkpoint directory {filepath} exists and is not empty with save_top_k != 0."
                 "All files in this directory will be deleted when a checkpoint is saved!"
             )
 
         self.monitor = monitor
         self.verbose = verbose
         self.filepath = filepath
-        self.save_best_only = save_best_only
+        if not os.path.exists(filepath):
+            os.makedirs(filepath)
+        self.save_top_k = save_top_k
         self.save_weights_only = save_weights_only
         self.period = period
-        self.epochs_since_last_save = 0
+        self.epochs_since_last_check = 0
         self.prefix = prefix
+        self.best_k_models = {}
+        # {filename: monitor}
+        self.kth_best_model = ''
+        self.best = 0
 
         if mode not in ['auto', 'min', 'max']:
             warnings.warn(
@@ -206,66 +218,112 @@ def __init__(self, filepath, monitor='val_loss', verbose=0,
 
         if mode == 'min':
             self.monitor_op = np.less
-            self.best = np.Inf
+            self.kth_value = np.Inf
+            self.mode = 'min'
         elif mode == 'max':
             self.monitor_op = np.greater
-            self.best = -np.Inf
+            self.kth_value = -np.Inf
+            self.mode = 'max'
         else:
             if 'acc' in self.monitor or self.monitor.startswith('fmeasure'):
                 self.monitor_op = np.greater
-                self.best = -np.Inf
+                self.kth_value = -np.Inf
+                self.mode = 'max'
             else:
                 self.monitor_op = np.less
-                self.best = np.Inf
+                self.kth_value = np.Inf
+                self.mode = 'min'
 
-    def save_model(self, filepath, overwrite):
-        dirpath = '/'.join(filepath.split('/')[:-1])
+    def _del_model(self, filepath):
+        dirpath = os.path.dirname(filepath)
 
         # make paths
-        os.makedirs(os.path.dirname(filepath), exist_ok=True)
+        os.makedirs(dirpath, exist_ok=True)
 
-        if overwrite:
-            for filename in os.listdir(dirpath):
-                if self.prefix in filename:
-                    path_to_delete = os.path.join(dirpath, filename)
-                    try:
-                        shutil.rmtree(path_to_delete)
-                    except OSError:
-                        os.remove(path_to_delete)
+        try:
+            shutil.rmtree(filepath)
+        except OSError:
+            os.remove(filepath)
+
+    def _save_model(self, filepath):
+        dirpath = os.path.dirname(filepath)
+
+        # make paths
+        os.makedirs(dirpath, exist_ok=True)
 
         # delegate the saving to the model
         self.save_function(filepath)
 
+    def check_monitor_top_k(self, current):
+        less_than_k_models = len(self.best_k_models.keys()) < self.save_top_k
+        if less_than_k_models:
+            return True
+        return self.monitor_op(current, self.best_k_models[self.kth_best_model])
+
     def on_epoch_end(self, epoch, logs=None):
         logs = logs or {}
-        self.epochs_since_last_save += 1
-        if self.epochs_since_last_save >= self.period:
-            self.epochs_since_last_save = 0
-            filepath = '{}/{}_ckpt_epoch_{}.ckpt'.format(self.filepath, self.prefix, epoch + 1)
-            if self.save_best_only:
+        self.epochs_since_last_check += 1
+
+        if self.save_top_k == 0:
+            # no models are saved
+            return
+        if self.epochs_since_last_check >= self.period:
+            self.epochs_since_last_check = 0
+            filepath = f'{self.filepath}/{self.prefix}_ckpt_epoch_{epoch}.ckpt'
+            version_cnt = 0
+            while os.path.isfile(filepath):
+                # this epoch called before
+                filepath = f'{self.filepath}/{self.prefix}_ckpt_epoch_{epoch}_v{version_cnt}.ckpt'
+                version_cnt += 1
+
+            print(filepath)
+
+            if self.save_top_k != -1:
                 current = logs.get(self.monitor)
+
                 if current is None:
                     warnings.warn(
                         f'Can save best model only with {self.monitor} available,'
                         ' skipping.', RuntimeWarning)
                 else:
-                    if self.monitor_op(current, self.best):
+                    if self.check_monitor_top_k(current):
+
+                        # remove kth
+                        if len(self.best_k_models.keys()) == self.save_top_k:
+                            delpath = self.kth_best_model
+                            self.best_k_models.pop(self.kth_best_model)
+                            self._del_model(delpath)
+
+                        self.best_k_models[filepath] = current
+                        if len(self.best_k_models.keys()) == self.save_top_k:
+                            # monitor dict has reached k elements
+                            if self.mode == 'min':
+                                self.kth_best_model = max(self.best_k_models, key=self.best_k_models.get)
+                            else:
+                                self.kth_best_model = min(self.best_k_models, key=self.best_k_models.get)
+                            self.kth_value = self.best_k_models[self.kth_best_model]
+
+                        if self.mode == 'min':
+                            self.best = min(self.best_k_models.values())
+                        else:
+                            self.best = max(self.best_k_models.values())
                         if self.verbose > 0:
                             logging.info(
-                                f'\nEpoch {epoch + 1:05d}: {self.monitor} improved'
-                                f' from {self.best:0.5f} to {current:0.5f},',
-                                f' saving model to {filepath}')
-                        self.best = current
-                        self.save_model(filepath, overwrite=True)
+                                f'\nEpoch {epoch:05d}: {self.monitor} reached',
+                                f'{current:0.5f} (best {self.best:0.5f}), saving model to',
+                                f'{filepath} as top {self.save_top_k}')
+                        self._save_model(filepath)
 
                     else:
                         if self.verbose > 0:
                             logging.info(
-                                f'\nEpoch {epoch + 1:05d}: {self.monitor} did not improve')
+                                f'\nEpoch {epoch:05d}: {self.monitor}',
+                                f'was not in top {self.save_top_k}')
+
             else:
                 if self.verbose > 0:
-                    logging.info(f'\nEpoch {epoch + 1:05d}: saving model to {filepath}')
-                self.save_model(filepath, overwrite=False)
+                    logging.info(f'\nEpoch {epoch:05d}: saving model to {filepath}')
+                self._save_model(filepath)
 
 
 class GradientAccumulationScheduler(Callback):

@@ -131,7 +131,7 @@ def test_load_model_from_checkpoint():
     # correct result and ok accuracy
     assert result == 1, 'training failed to complete'
     pretrained_model = LightningTestModel.load_from_checkpoint(
-        os.path.join(trainer.checkpoint_callback.filepath, "_ckpt_epoch_1.ckpt")
+        os.path.join(trainer.checkpoint_callback.filepath, "_ckpt_epoch_0.ckpt")
     )
 
     # test that hparams loaded correctly