Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summing multiple losses with single machine ddp #1846

Closed
jamesjjcondon opened this issue May 15, 2020 · 2 comments
Closed

Summing multiple losses with single machine ddp #1846

jamesjjcondon opened this issue May 15, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@jamesjjcondon
Copy link
Contributor

jamesjjcondon commented May 15, 2020

Hi,

I'm summing together multiple different losses using ddp on a single machine, 2 gpus.
I've been struggling to reduce my loss to zero as a sanity check on a subset of my images.
Is there something I should be calling to synchronise loss across gpus?
I've done this with MNIST no worries.

My model output is a dictionary with 8 components and I'm calling F.nll_loss on each of them before summing together. (One training example consists of 4 images and each example can have zero, 1 or 2 classes)

Code

Both my training and validation steps are like:

x, y = batch
out = self.forward(x)
loss1 = F.nll_loss(out['CC'][:,0], y['L-CC']['be'])
loss2 = F.nll_loss(out['CC'][:,1], y['R-CC']['ben'])
loss3 = F.nll_loss(out['CC'][:,2], y['L-CC']['ca']) 
loss4 = F.nll_loss(out['CC'][:,3], y['R-CC']['ca'])
loss5 = F.nll_loss(out['MLO'][:,0], y['L-MLO']['ben'])
loss6 = F.nll_loss(out['MLO'][:,1], y['R-MLO']['ben'])
loss7 = F.nll_loss(out['MLO'][:,2], y['L-MLO']['ca'])
loss8 = F.nll_loss(out['MLO'][:,3], y['R-MLO']['ca'])
            
lossCa = loss3 + loss4 + loss7 + loss8
lossb = loss1 + loss2 + loss5 + loss6
        
train_loss = lossCa + lossb

What have you tried?

I've tried each of them following: (before sum)

losses = [lossLCCb, lossRCCb, lossLCCca, lossRCCca, lossLMLOb, lossRMLOb, lossLMLOca, lossRMLOca]

for loss in losses:
     loss = dist.all_reduce(loss)
     loss /= dist.get_world_size()

and after sum

dist.all_reduce(train_loss)
train_loss /= dist.get_world_size()

Neither make any difference.

What's your environment?

  • OS: ubuntu 1804
  • Packaging - pip
    torch 1.5.0
    torchvision 0.6.0
  • PL Version - happens with both 0.7.1 and 0.7.2

Any tips / thoughts much appreciated. Cheers.

@jamesjjcondon jamesjjcondon added the question Further information is requested label May 15, 2020
@github-actions
Copy link
Contributor

Hi! thanks for your contribution!, great first issue!

@williamFalcon
Copy link
Contributor

williamFalcon commented May 15, 2020

in ddp you don't need to dist all reduce. each process is independent of each other.

just return:
{'loss': train_loss} from the training_step.

What syncs in DDP are the gradients... each process calculates its own loss and gradients

@Borda Borda added bug Something isn't working and removed question Further information is requested labels Dec 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants