Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to count training batches with support for distributed training #1581

Closed
mikerossgithub opened this issue Apr 23, 2020 · 3 comments
Closed
Labels
question Further information is requested won't fix This will not be worked on

Comments

@mikerossgithub
Copy link

I am trying to write minimal code to track the total number of training batches seen so far in the logs for validation.

For non-distributed training, I simply add a training_batches_so_far variable in my lightning module init, increment it on training_step() and add it to the progress_bar and log fields in the output.

However I want to make sure I am doing this properly for distributed training. What is the simplest way to do this? Ideally, I would like to be able to control how various metrics are accumulated (sum, avg, max). In this case, the amalgamation would be to sum the training steps seen by each worker and add that to the central total. I found related issues #702 and #1165, but it is unclear to me what the simplest / best practice is for this.

@mikerossgithub mikerossgithub added the question Further information is requested label Apr 23, 2020
@mikerossgithub
Copy link
Author

I thought I had this figured out by accumulating batch counts in training_epoch_end() -- however this is called after the validation, meaning my validation epoch did not have access to the total train batches. Any help would be appreciated.

My goal here is to just write simple code that properly accumulates batch counts regardless of what type of distributed training I am using. I'm sure pytorch lightning makes this simple, but I am having a difficult time figuring out exactly where to do the increments and accumulations.

@williamFalcon
Copy link
Contributor

williamFalcon commented Apr 24, 2020

since all the processes across gpus have to be in sync, isn't the total batch count the count from one gpu * num_gpus?

ie: what you did counting in training_step makes sense, but now multiply by world size (gpus * num_nodes)

@stale
Copy link

stale bot commented Jun 23, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 23, 2020
@stale stale bot closed this as completed Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants