[WIP] Reduction when batch size < num gpus #1609

awaelchli · 2020-04-26T07:29:46Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

This fixes a problem where the metrics don't get reduced if batch size < num gpus #1218.
However I don't know if this is the correct thing to do, since PL explicitly checks that batch_size == num_gpus. Is there a reason for this?

This bug also happens when the batch size dynamically changes during training, or when drop_last = False and the last batch happens to be smaller than num_gpus.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

awaelchli · 2020-04-26T07:31:01Z

@williamFalcon what are your thoughts on this one?

codecov · 2020-04-26T07:45:20Z

Codecov Report

Merging #1609 into master will decrease coverage by 0%.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #1609   +/-   ##
======================================
- Coverage      88%     88%   -0%     
======================================
  Files          69      69           
  Lines        4133    4132    -1     
======================================
- Hits         3656    3655    -1     
  Misses        477     477

Borda · 2020-04-26T08:38:42Z

it would be nice to ver the bug by test, do you have a minimal example for it?

awaelchli · 2020-04-26T14:54:36Z

I verified that the test I added here fails on master but passes with the fix. However, to demonstrate it one needs at least 3 gpus, so CI won't help much :)

Borda

LGTM 🦝

mergify · 2020-04-30T13:58:35Z

This pull request is now in conflict... :(

williamFalcon · 2020-05-02T12:42:32Z

tests/trainer/test_dataloaders.py

+    num_gpus = 3
+    batch_size = 3
+
+    class CurrentTestModel(


@awaelchli @Borda this needs the new test syntax not mixins...

williamFalcon

use latest test syntax please

awaelchli · 2020-05-02T12:49:21Z

@williamFalcon I was not aware of any new syntax. I will have a look.
What about the bugfix itself, is it the correct place to fix it?

williamFalcon · 2020-05-02T12:53:57Z

pytorch_lightning/trainer/logging.py

-            elif output[k].size(0) == num_gpus:
-                reduced = torch.mean(output[k])
-                output[k] = reduced
+            # do not reduce metrics that have batch size > num gpus


batch size has nothing to do with dp.... why is this fix even needed?

size(0) should be the number of GPUs in DP... NOT batch size

simple example: batch_size = 2, num_gpus = 3. Lightning will forward the batch with only 2 gpus, so the number of outputs is 2, so size(0) = 2. Therefore Lightning will not reduce the output and we get a problem later when the progress bar metrics call .item on that tensor.

Is my explanation correct or not?

williamFalcon · 2020-05-02T12:56:05Z

split_batch = ...
outs = []
for batch_split, model_copy in range(num_gpus): 
     out = model_copy(batch_split)
     outs.append(out)

outs = torch.stack(outs, dim=0)
# size(0) is the number of GPUs NOT batch size...
# please debug this live to convince yourself.

this has been working forever... can you give me a colab or an example where this is a problem?

awaelchli · 2020-05-02T12:58:51Z

Ok thanks.
To show the issue, I added the test, it fails on master when batch_size < num_gpus, which can be the case when e.g. we have drop_last=False in dataloader.
Ok I will show it in colab

awaelchli · 2020-05-02T14:39:57Z

@williamFalcon Here is the minimal example.

from argparse import Namespace
from collections import OrderedDict

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import DataLoader, TensorDataset

from pytorch_lightning import Trainer
from pytorch_lightning.core import LightningModule


class LightningTemplateModel(LightningModule):

    def __init__(self, hparams):
        super().__init__()
        self.hparams = hparams
        self.layer = nn.Linear(32, 32)

    def forward(self, x):
        # dummy forward
        return self.layer(x).sum()

    def loss(self, labels, logits):
        nll = F.nll_loss(logits, labels)
        return nll

    def training_step(self, batch, batch_idx):
        loss = self.forward(*batch)
        output = OrderedDict({
            'loss': loss,
            'progress_bar': {'some_val': loss},  # you see, if we comment this line, everyhing works
        })
        return output

    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters())
        return optimizer

    def train_dataloader(self):
        num_gpus = self.hparams.num_gpus
        batch_size = self.hparams.batch_size
        # construct a dataset with a size that is not divisible by num_gpus
        # therefore the last batch will have a size < num_gpus
        size = num_gpus * batch_size + (num_gpus - 1)
        print(size)
        dataset = TensorDataset(torch.zeros(size, 32))
        loader = DataLoader(
            dataset=dataset,
            batch_size=self.hparams.batch_size,
            drop_last=False,
        )
        return loader


if __name__ == '__main__':
    hparams = Namespace(num_gpus=3, batch_size=3)
    model = LightningTemplateModel(hparams)
    trainer = Trainer(
        checkpoint_callback=False,
        early_stop_callback=False,
        gpus=hparams.num_gpus
    )
    trainer.fit(model)  # this will crash

The progress_bar metrics don't get reduced in the case where size(0) < num_gpus, which is the case in the very last batch which has size 2 but we have 3 gpus. If you uncomment the line I marked above, everything works fine.

awaelchli · 2020-05-02T14:40:39Z

sorry but colab won't let me run with 3 gpus, and to show this bug we need at least 3

williamFalcon · 2020-05-02T15:02:32Z

ok i see. this makes sense.
Let's fix the test model thing in a separate PR

awaelchli · 2020-05-02T15:04:52Z

Thanks! big relief.
Will adjust the test syntax asap!

lolaclinton · 2021-02-03T19:50:11Z

Can someone explain the logic of this? I fail to see why it should matter and it causes inconsistent behavior, depending on your batch size.

awaelchli · 2021-02-03T20:21:32Z

@lolaclinton I gave an example here in this comment #1609 (comment)

All of this logic only applies to DP. In data parallel, we need to reduce the output returned from all gpus, and even when the batch size was smaller than the num gpus.

A batch size smaller than num gpus is not optimal, but it can happen for example when the dataset size is not evenly divisible by the batch size and we don't set drop_last in the dataloader.

mergify bot requested a review from a team April 26, 2020 07:29

awaelchli added the bug Something isn't working label Apr 26, 2020

awaelchli marked this pull request as draft April 26, 2020 07:36

awaelchli marked this pull request as ready for review April 27, 2020 17:38

awaelchli requested a review from williamFalcon April 27, 2020 17:38

awaelchli changed the title ~~DP reduction when batch size < num gpus~~ Reduction when batch size < num gpus Apr 27, 2020

Borda approved these changes Apr 30, 2020

View reviewed changes

Borda added this to the 0.7.6 milestone Apr 30, 2020

Borda added the ready PRs ready to be merged label Apr 30, 2020

mergify bot requested a review from a team April 30, 2020 13:56

Adrian Wälchli and others added 4 commits May 1, 2020 20:50

reduce if <= num_gpus

c81ce59

add test with explanation

26b2894

chlog

a1440e5

fix changelog

5b5ff25

williamFalcon reviewed May 2, 2020

View reviewed changes

mergify bot requested a review from a team May 2, 2020 12:42

williamFalcon requested changes May 2, 2020

View reviewed changes

mergify bot requested a review from a team May 2, 2020 12:43

williamFalcon requested changes May 2, 2020

View reviewed changes

mergify bot requested a review from a team May 2, 2020 12:54

awaelchli changed the title ~~Reduction when batch size < num gpus~~ [WIP] Reduction when batch size < num gpus May 2, 2020

williamFalcon merged commit e6b34ef into Lightning-AI:master May 2, 2020

awaelchli mentioned this pull request May 2, 2020

Complete test for batch size < num gpus #1705

Merged

5 tasks

awaelchli deleted the bugfix/reduce_batch branch May 2, 2020 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Reduction when batch size < num gpus #1609

[WIP] Reduction when batch size < num gpus #1609

awaelchli commented Apr 26, 2020 •

edited

Loading

awaelchli commented Apr 26, 2020

codecov bot commented Apr 26, 2020 •

edited

Loading

Borda commented Apr 26, 2020

awaelchli commented Apr 26, 2020

Borda left a comment

mergify bot commented Apr 30, 2020

williamFalcon May 2, 2020

williamFalcon left a comment

awaelchli commented May 2, 2020

williamFalcon May 2, 2020

awaelchli May 2, 2020 •

edited

Loading

williamFalcon commented May 2, 2020 •

edited

Loading

awaelchli commented May 2, 2020 •

edited

Loading

awaelchli commented May 2, 2020 •

edited

Loading

awaelchli commented May 2, 2020

williamFalcon commented May 2, 2020

awaelchli commented May 2, 2020

lolaclinton commented Feb 3, 2021

awaelchli commented Feb 3, 2021

[WIP] Reduction when batch size < num gpus #1609

[WIP] Reduction when batch size < num gpus #1609

Conversation

awaelchli commented Apr 26, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

awaelchli commented Apr 26, 2020

codecov bot commented Apr 26, 2020 • edited Loading

Codecov Report

Borda commented Apr 26, 2020

awaelchli commented Apr 26, 2020

Borda left a comment

Choose a reason for hiding this comment

mergify bot commented Apr 30, 2020

williamFalcon May 2, 2020

Choose a reason for hiding this comment

williamFalcon left a comment

Choose a reason for hiding this comment

awaelchli commented May 2, 2020

williamFalcon May 2, 2020

Choose a reason for hiding this comment

awaelchli May 2, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon commented May 2, 2020 • edited Loading

awaelchli commented May 2, 2020 • edited Loading

awaelchli commented May 2, 2020 • edited Loading

awaelchli commented May 2, 2020

williamFalcon commented May 2, 2020

awaelchli commented May 2, 2020

lolaclinton commented Feb 3, 2021

awaelchli commented Feb 3, 2021

awaelchli commented Apr 26, 2020 •

edited

Loading

codecov bot commented Apr 26, 2020 •

edited

Loading

awaelchli May 2, 2020 •

edited

Loading

williamFalcon commented May 2, 2020 •

edited

Loading

awaelchli commented May 2, 2020 •

edited

Loading

awaelchli commented May 2, 2020 •

edited

Loading