Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding NVIDIA-SMI like information #2074

Closed
groadabike opened this issue Jun 4, 2020 · 9 comments
Closed

Adding NVIDIA-SMI like information #2074

groadabike opened this issue Jun 4, 2020 · 9 comments
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on let's do it! approved to implement

Comments

@groadabike
Copy link
Contributor

🚀 Feature

  • Add the GPU usage information during training.

Motivation

Most of the research is done on HPC. Therefore, if I want to see the GPU RAM and usage of my job, I have to open a secondary screen to run "watch nvidia-smi" or "nvidia-smi dmon".
Have this info saved in the logs will help to:

  1. See if I have space for larger batches
  2. Report the correct resources needed to replicate my experiment.

Pitch

When training starts, report the GPU RAM and the GPU usage together with loss and v_num

Alternatives

After the first epoch is loaded into the GPU, log the GPU RAM and the GPU usage

Additional context

@groadabike groadabike added feature Is an improvement or enhancement help wanted Open to be worked on labels Jun 4, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Jun 4, 2020

Hi! thanks for your contribution!, great first issue!

@Borda
Copy link
Member

Borda commented Jun 10, 2020

add 1) you can use Batch finder
add 2) how it is different from a logger?
cc: @jeremyjordan @SkafteNicki

@SkafteNicki
Copy link
Member

If your goal is just to optimize for batch size, then the batch finder may be what you are looking for.
If we where to log the resource usage, I guess that we could write a callback similar to LearningRateLogger that extracts this information (through nvidia-smi or maybe gpustat?) and logs these numbers to the logger of the trainer.

@groadabike
Copy link
Contributor Author

groadabike commented Jun 12, 2020

Hi @SkafteNicki @Borda ,
In fact, what I am looking are both things. First, optimize the batch size that fits in my GPU and then to keep logging the GPU usage.
Recently, I come across with https://github.com/sicara/gpumonitor that implements a pl's callback.
I will test if this gpumonitor has what I was looking for.
Thank you for your comments

@Borda
Copy link
Member

Borda commented Aug 4, 2020

@groadabike mind send a PR with PL callback?

@Borda Borda added the good first issue Good for newcomers label Aug 4, 2020
@groadabike
Copy link
Contributor Author

Hi @Borda , I try to use the gpumonitor callback but it didn't work in my HPC.
For some reason, the training stop waiting for something.
I can't send a PR as I don't have any callback implemented.
I am still in need to know the GPU utilisation because I know I have a bottleneck in the dataloader (using the profiler), but I don't know for how long the GPU is waiting for the next batch.
Will back to you when I solve this issue

@groadabike
Copy link
Contributor Author

Hi Borda,
I have a first attempt for my Callback to measure the GPU usage and GPU "dead" periods.
Can you take a look at it and give me your feedback?
I am doing several measures and logging in tensorboard:

1- Time between batches - the time between the end of one batch and the start of the next.
2- Time in batch - the time between the start and end of one batch
Screenshot from 2020-08-11 12-51-21

3- GPU utilisation - % of GPU utilisation measured at the beginning and end of each batch.
Screenshot from 2020-08-11 12-52-17

4- GPU memory used
Screenshot from 2020-08-11 12-53-46

5- GPU memory free
Screenshot from 2020-08-11 12-53-32

gpuusage_callback.zip

@SkafteNicki
Copy link
Member

@groadabike i think that looks like a great addition. I you want to submit a PR, feel free :)
Personally I would also add flags for temperature (temperature.gpu and temperature.memory query) and fans (fan.speed query), both can be disabled by default. For the memory_utilization flag I would also log utilization.memory.

@Borda Borda added the let's do it! approved to implement label Aug 11, 2020
@SkafteNicki
Copy link
Member

closing this as it was solved by PR #2932

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement good first issue Good for newcomers help wanted Open to be worked on let's do it! approved to implement
Projects
None yet
Development

No branches or pull requests

3 participants