How to count training batches with support for distributed training #1581

mikerossgithub · 2020-04-23T19:13:30Z

I am trying to write minimal code to track the total number of training batches seen so far in the logs for validation.

For non-distributed training, I simply add a training_batches_so_far variable in my lightning module init, increment it on training_step() and add it to the progress_bar and log fields in the output.

However I want to make sure I am doing this properly for distributed training. What is the simplest way to do this? Ideally, I would like to be able to control how various metrics are accumulated (sum, avg, max). In this case, the amalgamation would be to sum the training steps seen by each worker and add that to the central total. I found related issues #702 and #1165, but it is unclear to me what the simplest / best practice is for this.

The text was updated successfully, but these errors were encountered:

mikerossgithub · 2020-04-23T21:25:44Z

I thought I had this figured out by accumulating batch counts in training_epoch_end() -- however this is called after the validation, meaning my validation epoch did not have access to the total train batches. Any help would be appreciated.

My goal here is to just write simple code that properly accumulates batch counts regardless of what type of distributed training I am using. I'm sure pytorch lightning makes this simple, but I am having a difficult time figuring out exactly where to do the increments and accumulations.

williamFalcon · 2020-04-24T01:17:56Z

since all the processes across gpus have to be in sync, isn't the total batch count the count from one gpu * num_gpus?

ie: what you did counting in training_step makes sense, but now multiply by world size (gpus * num_nodes)

stale · 2020-06-23T01:52:20Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

mikerossgithub added the question Further information is requested label Apr 23, 2020

mikerossgithub closed this as completed Apr 23, 2020

mikerossgithub reopened this Apr 23, 2020

stale bot added the won't fix This will not be worked on label Jun 23, 2020

stale bot closed this as completed Jul 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to count training batches with support for distributed training #1581

How to count training batches with support for distributed training #1581

mikerossgithub commented Apr 23, 2020

mikerossgithub commented Apr 23, 2020

williamFalcon commented Apr 24, 2020 •

edited

Loading

stale bot commented Jun 23, 2020

How to count training batches with support for distributed training #1581

How to count training batches with support for distributed training #1581

Comments

mikerossgithub commented Apr 23, 2020

mikerossgithub commented Apr 23, 2020

williamFalcon commented Apr 24, 2020 • edited Loading

stale bot commented Jun 23, 2020

williamFalcon commented Apr 24, 2020 •

edited

Loading