Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Early Stopping stops too early when using SLURM #2038

Closed
Dunrar opened this issue Jun 1, 2020 · 7 comments
Closed

Early Stopping stops too early when using SLURM #2038

Dunrar opened this issue Jun 1, 2020 · 7 comments
Labels
help wanted Open to be worked on

Comments

@Dunrar
Copy link

Dunrar commented Jun 1, 2020

🐛 Bug

I have a really strange bug where the Early Stopping Callback seems to fire too early, but only when using my unis Slurm cluster. When I train the same model on my laptop locally this does not happen. Sadly I can't run the code directly on the login node to see if happens on all of their systems or only when Slurm is being used. What's really strange is, when i use higher patience, the training lasts longer, early stopping never stops training sooner than hparams.patience/2 (actually it happens weirdly close to hparams.patience/2) but almost never as late as hparams.patience. I tried to create a minimum working example, code below.

To Reproduce

Steps to reproduce the behavior:

  1. Create a custom Early Stopping Callback and use it to initialise the trainer
  2. Run code on slurm cluster

Code sample

class RNNLightning(pl.LightningModule):
    def __init__(self, hp):
        super(RNNLightning, self).__init__()
        self.sequence_length = hp.seq_len
        self.input_size = hp.inp_size
        self.hidden_size = hp.hidden_size
        self.num_layers = hp.num_layers
        self.learning_rate = hp.learning_rate
        self.batch_size = hp.batch_size
        self.lstm = nn.LSTM(hp.inp_size, hp.hidden_size, hp.num_layers, batch_first=True)
        self.fc = nn.Linear(hp.hidden_size, hp.num_classes)
        self.training_losses = []
    
    def forward(self, x):
        # Set initial hidden and cell states 
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        
        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))  # out: tensor of shape (batch_size, seq_length, hidden_size)
        
        # Decode the hidden state of the last time step
        out = self.fc(out[:, -1, :])
        return out
    
    def training_step(self, batch, batch_idx):
        images, labels = batch
        images = images.reshape(-1, self.sequence_length, self.input_size)
        outputs = self(images)
        criterion = nn.CrossEntropyLoss()
        loss = criterion(outputs, labels)
        # Saving loss for epoch-wise logging
        self.training_losses.append(loss.item())
        return {'loss': loss}

    def on_epoch_end(self):
        # Logging mean loss of epoch
        train_loss_mean = np.mean(self.training_losses)
        self.logger.experiment.log({'epoch/mean_loss': train_loss_mean, 'epoch': self.current_epoch}, global_step=self.current_epoch)
        self.training_losses = []  # reset for next epoch

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

    def train_dataloader(self):
        train_dataset = torchvision.datasets.MNIST(root='data', train=True, transform=transforms.ToTensor(), download=True)
        train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=self.batch_size, shuffle=True)
        return train_loader

    @staticmethod
    def add_model_specific_args(parent_parser):
        model_parser = HyperOptArgumentParser(parents=[parent_parser])
        model_parser.add_argument('--seq_len', default=28, type=int)
        model_parser.add_argument('--inp_size', default=28, type=int)
        model_parser.add_argument('--hidden_size', default=128, type=int)
        model_parser.add_argument('--num_layers', default=2, type=int)
        model_parser.add_argument('--num_classes', default=10, type=int)
        model_parser.add_argument('--batch_size', default=100, type=int)
        model_parser.add_argument('--num_epochs', default=30, type=int)
        model_parser.add_argument('--learning_rate', default=0.1, type=int)
        model_parser.add_argument('--patience', default=6, type=int)
        model_parser.add_argument('--min_delta', default=0.9, type=float)

        return model_parser


def main(hparams):
    print(hparams)

    model = RNNLightning(hparams)
    model.parameters()
    testtube_logger = test_tube.TestTubeLogger(
        name='test',
        save_dir='logs'
    )
    early_stopping = EarlyStopping(
        monitor='loss',
        min_delta=hparams.min_delta,
        # TODO: Find out why early stopping stops too early
        patience=hparams.patience,
        mode='min'
    )

    trainer = pl.Trainer(
        logger=testtube_logger,
        max_epochs=hparams.num_epochs,
        row_log_interval=hparams.batch_size,
        log_save_interval=hparams.batch_size,
        early_stop_callback=early_stopping,
        gpus=None
    )
    trainer.fit(model)


if __name__ == '__main__':
    main_arg_parser = HyperOptArgumentParser(description="parser for min_example", add_help=False)
    parser = RNNLightning.add_model_specific_args(main_arg_parser)
    hyperparams = parser.parse_args()
    main(hyperparams)

And here is my .sh file which I call via sbatch slurm_script.sh:

#!/bin/bash

#SBATCH -e logs/early-stopping-test.err
#SBATCH -o logs/early-stopping-test.out
#SBATCH -J early-stopping

#SBATCH --partition=All
#SBATCH --time=0-02:00:00



export PATH=~/anaconda3/bin:$PATH
###
source activate pytorch-bac
~/anaconda3/envs/pytorch-bac/bin/python min_example.py

Expected behavior

The training to last at least as long as the patience value of the Early Stopping Callback.

I'm using Pytorch Lightning 0.7.7.dev0

@Dunrar Dunrar added the help wanted Open to be worked on label Jun 1, 2020
@Dunrar
Copy link
Author

Dunrar commented Jun 1, 2020

Updated Lightning to current master, now early stopping doesn't work at all.

@HansBambel
Copy link
Contributor

When you are using 0.7.6 the earlystopping is called twice in the training loop. Therefore having patience 50, was effectively 25. See #1751

This was supposed to be fixed by now, but there are some issues with this.

@Dunrar
Copy link
Author

Dunrar commented Jun 2, 2020

@HansBambel thanks. Strange that it worked locally but not on the cluster, though. Maybe I was using a slightly different version of Lightning locally. Should I close this and open a new bug report because of the early stopping not working at all right now? I had a similar problem because I didn't use a val_step before, maybe something like that crept in again?

@HansBambel
Copy link
Contributor

I think having no val_step could definitely be an issue, since earlystopping relies on the validation metric (to my knowledge).
I haven't tested with the latest master, so I don't know if earlystopping stopped working altogether. If so, I think you should open a bug report, yes.

@Dunrar
Copy link
Author

Dunrar commented Jun 2, 2020

@HansBambel Okay, thanks, will do. I'll leave this open till I have tested it when early stopping starts working again at all. Early stopping is supposed to also work with metrics of the training step

@HansBambel
Copy link
Contributor

Alright!

@williamFalcon
Copy link
Contributor

Closed via #2119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

3 participants