Error while training on multi gpus #3273

nrjvarshney · 2020-08-30T16:54:40Z

I get the following error on training with multiple gpus. It works for single gpu training

avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [1] at entry 343

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self.forward(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])
                                                                                
        accuracy = self.compute_accuracy(logits, batch["labels"])               
                                                                                
        return {                                                                
            "val_loss": loss,                                                   
            "accuracy": accuracy,                                               
            "prediction_label_count": prediction_label_count,                   
        }                                                                       
                                                                                
    def validation_epoch_end(self, outputs_of_validation_steps):                
        avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
        val_accuracy = torch.stack([x['accuracy'] for x in outputs_of_validation_steps]).mean()
                                                                                
        log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}              
                                                                                
        return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log}

The text was updated successfully, but these errors were encountered:

nrjvarshney · 2020-08-30T18:38:07Z

Using drop_last = True is not acceptable

awaelchli · 2020-08-30T18:58:12Z

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

RahulSajnani · 2020-09-04T12:37:30Z

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

This issue persists in Pytorch Lightning v0.9.0.

awaelchli · 2020-09-04T20:50:34Z

@RahulSajnani are you using results object or the same kind of manual reduction as shown in @nrjvarshney's code?
Because in the latter case, it is normal that this is a problem and becaues you do it manually, you need to choose torch.cat.
However, I recommend you use the Results api. https://pytorch-lightning.readthedocs.io/en/latest/results.html

RahulSajnani · 2020-09-04T21:13:06Z

@awaelchli I am using the same kind of manual reduction as @nrjvarshney . The reduction is as shown here:

epoch_train_loss = torch.stack([x['val_epoch_logger']['train_val_loss'] for x in outputs]).mean()

awaelchli · 2020-09-04T22:09:28Z

yes, then it makes sense that it fails, because for stacking, all tensors need to have the same shape. If the last tensor has different batch size, it fails.
Solution: use torch.cat or Results object to reduce. Example using code above:

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])
                                                                                
        accuracy = self.compute_accuracy(logits, batch["labels"])               
                                                                                
        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

stale · 2020-10-21T15:43:51Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

nrjvarshney added the question Further information is requested label Aug 30, 2020

stale bot added the won't fix This will not be worked on label Oct 21, 2020

nrjvarshney closed this as completed Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error while training on multi gpus #3273

Error while training on multi gpus #3273

nrjvarshney commented Aug 30, 2020

nrjvarshney commented Aug 30, 2020

awaelchli commented Aug 30, 2020

RahulSajnani commented Sep 4, 2020 •

edited

Loading

awaelchli commented Sep 4, 2020

RahulSajnani commented Sep 4, 2020

awaelchli commented Sep 4, 2020 •

edited

Loading

stale bot commented Oct 21, 2020

Error while training on multi gpus #3273

Error while training on multi gpus #3273

Comments

nrjvarshney commented Aug 30, 2020

nrjvarshney commented Aug 30, 2020

awaelchli commented Aug 30, 2020

RahulSajnani commented Sep 4, 2020 • edited Loading

awaelchli commented Sep 4, 2020

RahulSajnani commented Sep 4, 2020

awaelchli commented Sep 4, 2020 • edited Loading

stale bot commented Oct 21, 2020

RahulSajnani commented Sep 4, 2020 •

edited

Loading

awaelchli commented Sep 4, 2020 •

edited

Loading