Summing multiple losses with single machine ddp #1846

jamesjjcondon · 2020-05-15T12:04:31Z

Hi,

I'm summing together multiple different losses using ddp on a single machine, 2 gpus.
I've been struggling to reduce my loss to zero as a sanity check on a subset of my images.
Is there something I should be calling to synchronise loss across gpus?
I've done this with MNIST no worries.

My model output is a dictionary with 8 components and I'm calling F.nll_loss on each of them before summing together. (One training example consists of 4 images and each example can have zero, 1 or 2 classes)

Code

Both my training and validation steps are like:

x, y = batch
out = self.forward(x)
loss1 = F.nll_loss(out['CC'][:,0], y['L-CC']['be'])
loss2 = F.nll_loss(out['CC'][:,1], y['R-CC']['ben'])
loss3 = F.nll_loss(out['CC'][:,2], y['L-CC']['ca']) 
loss4 = F.nll_loss(out['CC'][:,3], y['R-CC']['ca'])
loss5 = F.nll_loss(out['MLO'][:,0], y['L-MLO']['ben'])
loss6 = F.nll_loss(out['MLO'][:,1], y['R-MLO']['ben'])
loss7 = F.nll_loss(out['MLO'][:,2], y['L-MLO']['ca'])
loss8 = F.nll_loss(out['MLO'][:,3], y['R-MLO']['ca'])
            
lossCa = loss3 + loss4 + loss7 + loss8
lossb = loss1 + loss2 + loss5 + loss6
        
train_loss = lossCa + lossb

What have you tried?

I've tried each of them following: (before sum)

losses = [lossLCCb, lossRCCb, lossLCCca, lossRCCca, lossLMLOb, lossRMLOb, lossLMLOca, lossRMLOca]

for loss in losses:
     loss = dist.all_reduce(loss)
     loss /= dist.get_world_size()

and after sum

dist.all_reduce(train_loss)
train_loss /= dist.get_world_size()

Neither make any difference.

What's your environment?

OS: ubuntu 1804
Packaging - pip
torch 1.5.0
torchvision 0.6.0
PL Version - happens with both 0.7.1 and 0.7.2

Any tips / thoughts much appreciated. Cheers.

The text was updated successfully, but these errors were encountered:

github-actions · 2020-05-15T12:05:12Z

Hi! thanks for your contribution!, great first issue!

williamFalcon · 2020-05-15T12:34:05Z

in ddp you don't need to dist all reduce. each process is independent of each other.

just return:
{'loss': train_loss} from the training_step.

What syncs in DDP are the gradients... each process calculates its own loss and gradients

jamesjjcondon added the question Further information is requested label May 15, 2020

williamFalcon closed this as completed May 15, 2020

williamFalcon mentioned this issue May 31, 2020

Replaces ddp .spawn with subprocess #2029

Merged

Borda added bug Something isn't working and removed question Further information is requested labels Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Summing multiple losses with single machine ddp #1846

Summing multiple losses with single machine ddp #1846

jamesjjcondon commented May 15, 2020 •

edited

Loading

github-actions bot commented May 15, 2020

williamFalcon commented May 15, 2020 •

edited

Loading

Summing multiple losses with single machine ddp #1846

Summing multiple losses with single machine ddp #1846

Comments

jamesjjcondon commented May 15, 2020 • edited Loading

Code

What have you tried?

What's your environment?

github-actions bot commented May 15, 2020

williamFalcon commented May 15, 2020 • edited Loading

jamesjjcondon commented May 15, 2020 •

edited

Loading

williamFalcon commented May 15, 2020 •

edited

Loading