Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error while training on multi gpus #3273

Closed
nrjvarshney opened this issue Aug 30, 2020 · 7 comments
Closed

Error while training on multi gpus #3273

nrjvarshney opened this issue Aug 30, 2020 · 7 comments
Labels
question Further information is requested won't fix This will not be worked on

Comments

@nrjvarshney
Copy link

I get the following error on training with multiple gpus. It works for single gpu training

avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
RuntimeError: stack expects each tensor to be equal size, but got [2] at entry 0 and [1] at entry 343
 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self.forward(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])
                                                                                
        accuracy = self.compute_accuracy(logits, batch["labels"])               
                                                                                
        return {                                                                
            "val_loss": loss,                                                   
            "accuracy": accuracy,                                               
            "prediction_label_count": prediction_label_count,                   
        }                                                                       
                                                                                
    def validation_epoch_end(self, outputs_of_validation_steps):                
        avg_loss = torch.stack([x['val_loss'] for x in outputs_of_validation_steps]).mean()
        val_accuracy = torch.stack([x['accuracy'] for x in outputs_of_validation_steps]).mean()
                                                                                
        log = {'val_loss': avg_loss, "val_accuracy": val_accuracy}              
                                                                                
        return {'val_loss': avg_loss, "val_accuracy": val_accuracy, 'log': log} 
                                                                                

@nrjvarshney nrjvarshney added the question Further information is requested label Aug 30, 2020
@nrjvarshney
Copy link
Author

Using drop_last = True is not acceptable

@awaelchli
Copy link
Member

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

@RahulSajnani
Copy link

RahulSajnani commented Sep 4, 2020

Hi, I think I have solved that recently #3020. Which version are you on? Please try to upgrade and let me know.

This issue persists in Pytorch Lightning v0.9.0.

@awaelchli
Copy link
Member

@RahulSajnani are you using results object or the same kind of manual reduction as shown in @nrjvarshney's code?
Because in the latter case, it is normal that this is a problem and becaues you do it manually, you need to choose torch.cat.
However, I recommend you use the Results api. https://pytorch-lightning.readthedocs.io/en/latest/results.html

@RahulSajnani
Copy link

@awaelchli I am using the same kind of manual reduction as @nrjvarshney . The reduction is as shown here:

epoch_train_loss = torch.stack([x['val_epoch_logger']['train_val_loss'] for x in outputs]).mean()

@awaelchli
Copy link
Member

awaelchli commented Sep 4, 2020

yes, then it makes sense that it fails, because for stacking, all tensors need to have the same shape. If the last tensor has different batch size, it fails.
Solution: use torch.cat or Results object to reduce. Example using code above:

 def validation_step(self, batch, batch_idx):                                
        logits, softmax_logits = self(**batch)                          
        loss, prediction_label_count = self.loss_function(logits, batch["labels"])
                                                                                
        accuracy = self.compute_accuracy(logits, batch["labels"])               
                                                                                
        result = EvalResult()
        result.log('val_accuracy', accuracy, reduce_fx=torch.mean)  # mean is also the default, we don't need to write it.
        result.log('val_loss', loss)

def validation_epoch_end(self, outputs_of_validation_steps):
       # not needed! everything will be done by results: collects all acc./losses, reduces, then logs.

Hope this helps.

@stale
Copy link

stale bot commented Oct 21, 2020

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Oct 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants