Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Reduction when batch size < num gpus #1609

Merged
merged 4 commits into from
May 2, 2020
Merged

[WIP] Reduction when batch size < num gpus #1609

merged 4 commits into from
May 2, 2020

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Apr 26, 2020

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure to update the docs?
  • Did you write any new necessary tests?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

This fixes a problem where the metrics don't get reduced if batch size < num gpus #1218.
However I don't know if this is the correct thing to do, since PL explicitly checks that batch_size == num_gpus. Is there a reason for this?

This bug also happens when the batch size dynamically changes during training, or when drop_last = False and the last batch happens to be smaller than num_gpus.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@mergify mergify bot requested a review from a team April 26, 2020 07:29
@awaelchli
Copy link
Contributor Author

@williamFalcon what are your thoughts on this one?

@awaelchli awaelchli added the bug Something isn't working label Apr 26, 2020
@awaelchli awaelchli marked this pull request as draft April 26, 2020 07:36
@codecov
Copy link

codecov bot commented Apr 26, 2020

Codecov Report

Merging #1609 into master will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1609   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4133    4132    -1     
======================================
- Hits         3656    3655    -1     
  Misses        477     477           

@Borda
Copy link
Member

Borda commented Apr 26, 2020

it would be nice to ver the bug by test, do you have a minimal example for it?

@awaelchli
Copy link
Contributor Author

I verified that the test I added here fails on master but passes with the fix. However, to demonstrate it one needs at least 3 gpus, so CI won't help much :)

@awaelchli awaelchli marked this pull request as ready for review April 27, 2020 17:38
@awaelchli awaelchli changed the title DP reduction when batch size < num gpus Reduction when batch size < num gpus Apr 27, 2020
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🦝

@Borda Borda added this to the 0.7.6 milestone Apr 30, 2020
@Borda Borda added the ready PRs ready to be merged label Apr 30, 2020
@mergify mergify bot requested a review from a team April 30, 2020 13:56
@mergify
Copy link
Contributor

mergify bot commented Apr 30, 2020

This pull request is now in conflict... :(

num_gpus = 3
batch_size = 3

class CurrentTestModel(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli @Borda this needs the new test syntax not mixins...

@mergify mergify bot requested a review from a team May 2, 2020 12:42
Copy link
Contributor

@williamFalcon williamFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use latest test syntax please

@mergify mergify bot requested a review from a team May 2, 2020 12:43
@awaelchli
Copy link
Contributor Author

@williamFalcon I was not aware of any new syntax. I will have a look.
What about the bugfix itself, is it the correct place to fix it?

elif output[k].size(0) == num_gpus:
reduced = torch.mean(output[k])
output[k] = reduced
# do not reduce metrics that have batch size > num gpus
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

batch size has nothing to do with dp.... why is this fix even needed?

size(0) should be the number of GPUs in DP... NOT batch size

Copy link
Contributor Author

@awaelchli awaelchli May 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simple example: batch_size = 2, num_gpus = 3. Lightning will forward the batch with only 2 gpus, so the number of outputs is 2, so size(0) = 2. Therefore Lightning will not reduce the output and we get a problem later when the progress bar metrics call .item on that tensor.

Is my explanation correct or not?

@mergify mergify bot requested a review from a team May 2, 2020 12:54
@williamFalcon
Copy link
Contributor

williamFalcon commented May 2, 2020

split_batch = ...
outs = []
for batch_split, model_copy in range(num_gpus): 
     out = model_copy(batch_split)
     outs.append(out)

outs = torch.stack(outs, dim=0)
# size(0) is the number of GPUs NOT batch size...
# please debug this live to convince yourself.

this has been working forever... can you give me a colab or an example where this is a problem?

@awaelchli
Copy link
Contributor Author

awaelchli commented May 2, 2020

Ok thanks.
To show the issue, I added the test, it fails on master when batch_size < num_gpus, which can be the case when e.g. we have drop_last=False in dataloader.
Ok I will show it in colab

@awaelchli awaelchli changed the title Reduction when batch size < num gpus [WIP] Reduction when batch size < num gpus May 2, 2020
@awaelchli
Copy link
Contributor Author

awaelchli commented May 2, 2020

@williamFalcon Here is the minimal example.

from argparse import Namespace
from collections import OrderedDict

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, TensorDataset

from pytorch_lightning import Trainer
from pytorch_lightning.core import LightningModule


class LightningTemplateModel(LightningModule):

    def __init__(self, hparams):
        super().__init__()
        self.hparams = hparams
        self.layer = nn.Linear(32, 32)

    def forward(self, x):
        # dummy forward
        return self.layer(x).sum()

    def loss(self, labels, logits):
        nll = F.nll_loss(logits, labels)
        return nll

    def training_step(self, batch, batch_idx):
        loss = self.forward(*batch)
        output = OrderedDict({
            'loss': loss,
            'progress_bar': {'some_val': loss},  # you see, if we comment this line, everyhing works
        })
        return output

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters())
        return optimizer

    def train_dataloader(self):
        num_gpus = self.hparams.num_gpus
        batch_size = self.hparams.batch_size
        # construct a dataset with a size that is not divisible by num_gpus
        # therefore the last batch will have a size < num_gpus
        size = num_gpus * batch_size + (num_gpus - 1)
        print(size)
        dataset = TensorDataset(torch.zeros(size, 32))
        loader = DataLoader(
            dataset=dataset,
            batch_size=self.hparams.batch_size,
            drop_last=False,
        )
        return loader


if __name__ == '__main__':
    hparams = Namespace(num_gpus=3, batch_size=3)
    model = LightningTemplateModel(hparams)
    trainer = Trainer(
        checkpoint_callback=False,
        early_stop_callback=False,
        gpus=hparams.num_gpus
    )
    trainer.fit(model)  # this will crash

The progress_bar metrics don't get reduced in the case where size(0) < num_gpus, which is the case in the very last batch which has size 2 but we have 3 gpus. If you uncomment the line I marked above, everything works fine.

@awaelchli
Copy link
Contributor Author

sorry but colab won't let me run with 3 gpus, and to show this bug we need at least 3

@williamFalcon williamFalcon merged commit e6b34ef into Lightning-AI:master May 2, 2020
@williamFalcon
Copy link
Contributor

ok i see. this makes sense.
Let's fix the test model thing in a separate PR

@awaelchli
Copy link
Contributor Author

Thanks! big relief.
Will adjust the test syntax asap!

@awaelchli awaelchli deleted the bugfix/reduce_batch branch May 2, 2020 18:44
@lolaclinton
Copy link

Can someone explain the logic of this? I fail to see why it should matter and it causes inconsistent behavior, depending on your batch size.

@awaelchli
Copy link
Contributor Author

@lolaclinton I gave an example here in this comment #1609 (comment)

All of this logic only applies to DP. In data parallel, we need to reduce the output returned from all gpus, and even when the batch size was smaller than the num gpus.

A batch size smaller than num gpus is not optimal, but it can happen for example when the dataset size is not evenly divisible by the batch size and we don't set drop_last in the dataloader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready PRs ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants