Early Stopping stops too early when using SLURM #2038

Dunrar · 2020-06-01T16:39:33Z

🐛 Bug

I have a really strange bug where the Early Stopping Callback seems to fire too early, but only when using my unis Slurm cluster. When I train the same model on my laptop locally this does not happen. Sadly I can't run the code directly on the login node to see if happens on all of their systems or only when Slurm is being used. What's really strange is, when i use higher patience, the training lasts longer, early stopping never stops training sooner than hparams.patience/2 (actually it happens weirdly close to hparams.patience/2) but almost never as late as hparams.patience. I tried to create a minimum working example, code below.

To Reproduce

Steps to reproduce the behavior:

Create a custom Early Stopping Callback and use it to initialise the trainer
Run code on slurm cluster

Code sample

class RNNLightning(pl.LightningModule):
    def __init__(self, hp):
        super(RNNLightning, self).__init__()
        self.sequence_length = hp.seq_len
        self.input_size = hp.inp_size
        self.hidden_size = hp.hidden_size
        self.num_layers = hp.num_layers
        self.learning_rate = hp.learning_rate
        self.batch_size = hp.batch_size
        self.lstm = nn.LSTM(hp.inp_size, hp.hidden_size, hp.num_layers, batch_first=True)
        self.fc = nn.Linear(hp.hidden_size, hp.num_classes)
        self.training_losses = []
    
    def forward(self, x):
        # Set initial hidden and cell states 
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        
        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))  # out: tensor of shape (batch_size, seq_length, hidden_size)
        
        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out
    
    def training_step(self, batch, batch_idx):
        images, labels = batch
        images = images.reshape(-1, self.sequence_length, self.input_size)
        outputs = self(images)
        criterion = nn.CrossEntropyLoss()
        loss = criterion(outputs, labels)
        # Saving loss for epoch-wise logging
        self.training_losses.append(loss.item())
        return {'loss': loss}

    def on_epoch_end(self):
        # Logging mean loss of epoch
        train_loss_mean = np.mean(self.training_losses)
        self.logger.experiment.log({'epoch/mean_loss': train_loss_mean, 'epoch': self.current_epoch}, global_step=self.current_epoch)
        self.training_losses = []  # reset for next epoch

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

    def train_dataloader(self):
        train_dataset = torchvision.datasets.MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)
        train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=self.batch_size, shuffle=True)
        return train_loader

    @staticmethod
    def add_model_specific_args(parent_parser):
        model_parser = HyperOptArgumentParser(parents=[parent_parser])
        model_parser.add_argument('--seq_len', default=28, type=int)
        model_parser.add_argument('--inp_size', default=28, type=int)
        model_parser.add_argument('--hidden_size', default=128, type=int)
        model_parser.add_argument('--num_layers', default=2, type=int)
        model_parser.add_argument('--num_classes', default=10, type=int)
        model_parser.add_argument('--batch_size', default=100, type=int)
        model_parser.add_argument('--num_epochs', default=30, type=int)
        model_parser.add_argument('--learning_rate', default=0.1, type=int)
        model_parser.add_argument('--patience', default=6, type=int)
        model_parser.add_argument('--min_delta', default=0.9, type=float)

        return model_parser


def main(hparams):
    print(hparams)

    model = RNNLightning(hparams)
    model.parameters()
    testtube_logger = test_tube.TestTubeLogger(
        name='test',
        save_dir='logs'
    )
    early_stopping = EarlyStopping(
        monitor='loss',
        min_delta=hparams.min_delta,
        # TODO: Find out why early stopping stops too early
        patience=hparams.patience,
        mode='min'
    )

    trainer = pl.Trainer(
        logger=testtube_logger,
        max_epochs=hparams.num_epochs,
        row_log_interval=hparams.batch_size,
        log_save_interval=hparams.batch_size,
        early_stop_callback=early_stopping,
        gpus=None
    )
    trainer.fit(model)


if __name__ == '__main__':
    main_arg_parser = HyperOptArgumentParser(description="parser for min_example", add_help=False)
    parser = RNNLightning.add_model_specific_args(main_arg_parser)
    hyperparams = parser.parse_args()
    main(hyperparams)

And here is my .sh file which I call via sbatch slurm_script.sh:

#!/bin/bash

#SBATCH -e logs/early-stopping-test.err
#SBATCH -o logs/early-stopping-test.out
#SBATCH -J early-stopping

#SBATCH --partition=All
#SBATCH --time=0-02:00:00



export PATH=~/anaconda3/bin:$PATH
###
source activate pytorch-bac
~/anaconda3/envs/pytorch-bac/bin/python min_example.py

Expected behavior

The training to last at least as long as the patience value of the Early Stopping Callback.

I'm using Pytorch Lightning 0.7.7.dev0

The text was updated successfully, but these errors were encountered:

Dunrar · 2020-06-01T18:58:44Z

Updated Lightning to current master, now early stopping doesn't work at all.

HansBambel · 2020-06-02T08:46:28Z

When you are using 0.7.6 the earlystopping is called twice in the training loop. Therefore having patience 50, was effectively 25. See #1751

This was supposed to be fixed by now, but there are some issues with this.

Dunrar · 2020-06-02T08:53:05Z

@HansBambel thanks. Strange that it worked locally but not on the cluster, though. Maybe I was using a slightly different version of Lightning locally. Should I close this and open a new bug report because of the early stopping not working at all right now? I had a similar problem because I didn't use a val_step before, maybe something like that crept in again?

HansBambel · 2020-06-02T09:08:49Z

I think having no val_step could definitely be an issue, since earlystopping relies on the validation metric (to my knowledge).
I haven't tested with the latest master, so I don't know if earlystopping stopped working altogether. If so, I think you should open a bug report, yes.

Dunrar · 2020-06-02T09:11:38Z

@HansBambel Okay, thanks, will do. I'll leave this open till I have tested it when early stopping starts working again at all. Early stopping is supposed to also work with metrics of the training step

HansBambel · 2020-06-02T09:12:09Z

Alright!

williamFalcon · 2020-06-09T00:08:09Z

Closed via #2119

Dunrar added the help wanted Open to be worked on label Jun 1, 2020

Dunrar mentioned this issue Jun 2, 2020

Early Stopping Callback not working #2051

Closed

williamFalcon closed this as completed Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Early Stopping stops too early when using SLURM #2038

Early Stopping stops too early when using SLURM #2038

Dunrar commented Jun 1, 2020

Dunrar commented Jun 1, 2020

HansBambel commented Jun 2, 2020

Dunrar commented Jun 2, 2020

HansBambel commented Jun 2, 2020

Dunrar commented Jun 2, 2020

HansBambel commented Jun 2, 2020

williamFalcon commented Jun 9, 2020

Early Stopping stops too early when using SLURM #2038

Early Stopping stops too early when using SLURM #2038

Comments

Dunrar commented Jun 1, 2020

🐛 Bug

To Reproduce

Code sample

Expected behavior

Dunrar commented Jun 1, 2020

HansBambel commented Jun 2, 2020

Dunrar commented Jun 2, 2020

HansBambel commented Jun 2, 2020

Dunrar commented Jun 2, 2020

HansBambel commented Jun 2, 2020

williamFalcon commented Jun 9, 2020