-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
validation_step_end and training_step_end usage #2435
Comments
@pamparana34 I believe even with validation_step_end in version 0.8.5, you still cannot get the metrics over the entire dataset. What you can get with validation_step_end is the metrics over the one complete batch (one complete batch is sum of batches on all GPU at a given time point). See recent comments in #973 |
@pamparana34 @junwen-austin def validation_step(self, batch, batch_idx):
anchor, positive, negatives = batch
negatives = negatives.transpose(0, 1)
losses = []
for i in range(len(negatives)):
anchor_out, positive_out, negative_out = self.forward_train(anchor,
positive,
negatives[i])
loss_val = self.lossfn(anchor_out, positive_out, negative_out)
losses.append(loss_val)
loss_val = torch.stack(losses).mean()
result = EvalResult()
result.log("val_loss", val_loss, sync_dist=True) # sync_dist will compute mean over all processes
return result and no need for |
Closing this, I am confident my answer applies to your use case. But if something does not work, let me know. |
I am not sure if I should start a new issue or continue this one. My query is about usage of My use-case involves collecting outputs from multiple batches (say 5) before I calculate the loss. So I am hoping I can use What I am not sure about is if I can use |
I cannot seem to find any examples on how to collect all the batches from the validation and training steps when using
ddp
. I currently am doing training on 4 GPUs usingddp
and my validation loop is as follows:At the moment, when computing the
val_loss
, it is only taking one of the processes into account and my statistics is not over the whole validation dataset (I think this is the same for my training set) and I would like it to be over the whole dataset. To that affect, I need to gather all the outputs from all the GPUs.I see that there are some
validation_step_end
andtraining_step_end
callbacks but I do not see much examples or usage of them? Could someone please comment on whether this can be used for doing what I am trying to do i.e. compute my training loss and validation loss over the whole dataset when reporting? A small example would be really useful for newbies like me.For completeness my training loop is as follows:
I see there are several versions of this same question here. So, I think it would really help to have a small example of how to do this.
The text was updated successfully, but these errors were encountered: