slow new epoch start with setting ddp, num_workers, gpus #1884

jeon30c · 2020-05-19T06:46:31Z

jeon30c
May 19, 2020

❓ Questions and Help

Before asking:

search the issues.
search the docs.

What is your question?

I am training MNIST with below code. 1 GPU training is ok.
But it shows slow start of new epoch when num_workers is a large number and the number of gpus > 2.
Even dataloading itself is slower than with 1gpu.

Code

import torch
from torch import nn
import pytorch_lightning as pl
from torchvision import datasets, transforms
from torch.utils.data import DataLoader, random_split
from torchvision.datasets import MNIST
from torch.nn import functional as F
import torch.distributed as dist
import os, sys

class LightningMNISTClassifier(pl.LightningModule):
def init(self):
super(LightningMNISTClassifier, self).init()
self.layer_1 = nn.Linear(28*28, 128)
self.layer_2 = nn.Linear(128, 256)
self.layer_3 = nn.Linear(256, 10)

def forward(self, x):
    batch_size, channels, width, height = x.size()
    x = x.view(batch_size, -1)
    x = self.layer_1(x)
    x = torch.relu(x)
    x = self.layer_2(x)
    x = torch.relu(x)
    x = self.layer_3(x)
    x = torch.log_softmax(x, dim=-1)
    return x

def prepare_data(self):
    transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307, ), (0.3081,))])
    mnist_train = MNIST(os.getcwd(), train=True, download=True, transform=transform)
    self.mnist_test = MNIST(os.getcwd(), train=False, download=True, transform=transform)
    self.mnist_train, self.mnist_val = random_split(mnist_train, [55000,5000])

def train_dataloader(self):
    data_loader2 = DataLoader(self.mnist_train, batch_size=64, num_workers=7, shuffle=True) 
    # data_loader2 = DataLoader(self.mnist_train, batch_size=64, shuffle=True) 
    return data_loader2
def val_dataloader(self):
    return DataLoader(self.mnist_val, batch_size=64)

# def test_dataloader(self):
#     return DataLoader(self.mnist_test, batch_size=64)

def configure_optimizers(self):
    optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
    return optimizer

def cross_entropy_loss(self, logits, labels):
    return F.nll_loss(logits, labels)

def training_step(self, batch, batch_idx):
    x, y = batch
    logits = self.forward(x)
    loss = self.cross_entropy_loss(logits, y)
    logs = {'train_loss': loss}
    return {"loss": loss, "log": logs}
def validation_step(self, batch, batch_idx):
    x, y = batch
    logits = self.forward(x)
    loss = self.cross_entropy_loss(logits, y)
    return {"val_loss": loss}

def validation_epoch_end(self, outputs):
    avg_loss = torch.stack([x['val_loss'] for x in outputs]).mean()
    tensorboard_logs = {"val_loss": avg_loss}
    return {"avg_val_loss": avg_loss, 'log':tensorboard_logs}

if name == 'main':
model = LightningMNISTClassifier()

trainer = pl.Trainer(gpus=4, distributed_backend='ddp')

trainer.fit(model)

What have you tried?

Horovod backend does not show slow start of new epoch.

What's your environment?

OS: ubuntu 18.04
Packaging pip
Version pytorch 1.5.0, 0.7.6

Answered by jeon30c

May 20, 2020

I found that the slow deprecationwarnings shown above are due to the torchvision library. I changed to a simple dataset and the slow start disappeared until now.

View full answer

mpaepper · 2020-05-19T10:00:32Z

mpaepper
May 19, 2020

I experience similar things - when running with ddp, then it seems the higher num_workers the longer it takes before getting data to the GPUs.
It seems stuck at the dataloaders which might be killed and reinitialized the whole time by multiprocessing?

0 replies

jeon30c · 2020-05-20T02:02:58Z

jeon30c
May 20, 2020
Author

@mpaepper Your comment may be right.
I did the same thing using torch 1.4.0.
Below log is about new start of epoch 2.
DeprecationWarnings were shown at the beginning of the first epoch ngpu*num_workers times.
At the second epoch the same warnings are shown again.
Moreover, the each log itself is very slow.

Epoch 2: 0%| | 0/470 [00:00<?, ?it/s, loss=0.180, v_num=113]/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
import imp
/opt/conda/lib/python3.7/site-packages/torchvision/io/video.py:2: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
...

0 replies

jeon30c · 2020-05-20T09:42:13Z

jeon30c
May 20, 2020
Author

I found that the slow deprecationwarnings shown above are due to the torchvision library. I changed to a simple dataset and the slow start disappeared until now.

0 replies

jeon30c · 2020-05-21T21:28:56Z

jeon30c
May 21, 2020
Author

The isssue is not due to pl.

1 reply

bw4sz Nov 3, 2021

Agreed. I believe this due to number of workers, multiprocessing and pytorch, not pl. For others coming here, make sure to test fewer workers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow new epoch start with setting ddp, num_workers, gpus #1884

{{title}}

Replies: 4 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

slow new epoch start with setting ddp, num_workers, gpus #1884

jeon30c May 19, 2020

❓ Questions and Help

Before asking:

What is your question?

Code

What have you tried?

What's your environment?

Replies: 4 comments · 1 reply

mpaepper May 19, 2020

jeon30c May 20, 2020 Author

jeon30c May 20, 2020 Author

jeon30c May 21, 2020 Author

bw4sz Nov 3, 2021

jeon30c
May 19, 2020

Replies: 4 comments 1 reply

mpaepper
May 19, 2020

jeon30c
May 20, 2020
Author

jeon30c
May 20, 2020
Author

jeon30c
May 21, 2020
Author