Update docs to be clear on --gpus behaviour. #563

jeffling · 2019-11-29T20:28:42Z

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.

Is your feature request related to a problem? Please describe.
Trainer.gpus can currently be used to specify a number of GPUs or specific GPUs to run on. This makes values like

0 (run on CPU), "0" (Run on GPU 0), [0] (run on GPU 0)

confusing for newcomers.

Describe the solution you'd like
As an aggressive solution to this issue, we move to have gpus always specify specific GPUs as that is the more encompassing case. Going forward, we can put a deprecation notice up when a single int is passed in:

"In the future, gpus to specify specific GPUs the model will run on. If you would like to run on CPU, pass in None or an empty list."

Then, in the next breaking version, we can simplify the behaviour.

Describe alternatives you've considered

Keep as is: This is a viable solution. We could just document more carefully. However, anecdotally, this is confusing for our team and most likely other users.
Have gpus mean number of GPUs: There are many cases where researchers need to run multiple experiments on the same time on a multi-gpu machine. Being able to specify which GPU easily would be useful. As an argument for this, one could use 'CUDA_VISIBLE_DEVICES' to do this.
Create a new num_gpus argument: This could make it self-documenting and allow for both workflows. However, it will be an additional argument to maintain.

Additional context

The text was updated successfully, but these errors were encountered:

mpariente · 2019-11-30T18:08:47Z

My two cents : I would be against 2 and 3.
Why not 2? If the GPU ids are passed with CUDA_VISIBLE_DEVICES, then what's the point of specifying the number of GPUs
CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 2 asks for double work of the user and would CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 1 make sense?
Why not 3? The arguments are not orthogonal and there are already too many arguments in Trainer
Solution 1 would be ok IMO, it's nice to have different possibilities.
I'm also not against the solution you suggest

It will be clearer for sure.
It should additionally support something like gpus=-1 or gpus=[-1] to use all available GPUs
The bad point I see is for this solution is the interface with argparse from bash, it's not straightforward to parse a list of integer from the command line.

BTW, this line makes that when gpus is an integer, only the gpus first GPUs will be used. This makes the usage of integer a little less useful.

williamFalcon · 2019-11-30T19:31:28Z

@jeffling Let's keep the first option. As mentioned by @mpariente here are the reasons:

Case 1:
User uses an integer to indicate how many GPUs to use.
This is likely the most common case. User also doesn't want to deal with choosing the GPUs.

gpus=2

Case 2:
User cares about choosing the GPUs. In this case this already implies 2 gpus, so it'd be redundant to have another argument.

gpus=[0, 3]

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.

python main.py --gpus "0,1,2,3"

I do agree that we need to support -1 for all (i thought we already did).

So, the resolution seems to be updated docs, maybe a table with examples?

mpariente · 2019-11-30T20:23:34Z

I do agree that we need to support -1 for all (i thought we already did).

I told that in the case the suggestion was adopted. -1 is supported as an integer and a string but not as a list. But it seems normal with the current behavior.

So, the resolution seems to be updated docs, maybe a table with examples?

Can the docs be updated on master directly?

Borda · 2019-11-30T22:25:02Z

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.
python main.py --gpus "0,1,2,3"

I would not use a string for multiple options, argparse supports multiple values directly:

parser.add_argument('--gpus', type=int, nargs='*', help='list of GPUs')

to be used as

python main.py --gpus 0 1 2 3

williamFalcon · 2019-11-30T22:47:29Z

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

I have however thought about providing a default parser with args for each trainer flag so users don't have to remember them (maybe a new issue)?

Borda · 2019-11-30T22:53:50Z

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

well, you should know it well as developer, but for the users you have the help message to guide them... then this default parameters can be written in docs or readme :]

williamFalcon · 2019-11-30T22:55:06Z

i'm talking about scientists, physicists etc... deep learning is not just being done by engineers or developers.

That level of usability is critical and at the core of lightning

Borda · 2019-11-30T22:59:08Z

I meant that the people writing argparser should do it easier like with this listed parameters, I personally would be very confused passing it as a string with a separator... but I am open discussion :]

jeffling · 2019-12-01T03:05:30Z

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.

Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.

First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

If our ideal for the framework user is to be able to do straight pass-throughs to lightning, the string is a better choice. Otherwise everyone will need to deal with the 'list vs int' logic their own way.

The original issue: if we're using string, a user can easily screw up python train.py --gpus "0" vs python train.py --gpus 0. But we can mitigate through other means

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.

I still have a few PRs I want to do before this, so if anybody would like an easy one feel free to take it :)

Borda · 2019-12-01T08:22:53Z

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.
Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.
First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

This is s simple fix, and we shall correct it in lightning size
after parsing arg parser check if it is a list, else make it as a list... I did the same couple times in past

williamFalcon · 2019-12-01T08:41:04Z

yeah, either way we need to keep support for passing in a string because this is exactly the friction we need to avoid with users
:)

awaelchli · 2019-12-03T06:51:38Z

How about we make a list in the docs with all accepted inputs and how they are parsed/interpreted, with the edge cases discussed here? Then we write a test for each case to make sure that the parsing works properly. The parsing is already quite complicated and it takes a bit of time to see the edge cases.

williamFalcon · 2019-12-04T12:12:04Z

@awaelchli yes, mind submitting a PR?

awaelchli · 2019-12-04T12:24:12Z

sure. I will try to deliver it before next release.

Borda · 2019-12-04T12:33:44Z

next release on 6th of Dec? :)

awaelchli · 2019-12-04T12:37:45Z

@Borda yep, today or tomorrow :)

stale · 2020-03-03T12:48:37Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

junjy007 · 2021-03-18T23:55:47Z

I have spent quite a while browsing a couple of discussions on this issue, and the latest referred official doc is missing:

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html

I have to write a small script to be sure of the behaviour of "python train.py --gpus 0" when the trainer is created using ".from_argparse_args(args)". The document is not doing is job well.

Basically, I think that one needs to be explicit, no matter expert/non-expert in coding. So "--gpus" may not be a good parameter name from the first place. If we stick to it, there should be examples of using it

from bash command line
by specifying in Trainer() constructor
by configuration in vs-code/pycharm launch settings
with cases of
specifying the number of devices for [a single node | multiple nodes]
specifying the specific devices for [a single node | multiple nodes using the same set of device indexes | multiple nodes using respective device indexes]

jeffling added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 29, 2019

jeffling mentioned this issue Nov 29, 2019

Correct behavior for argument gpus in Trainer #561

Merged

jeffling changed the title ~~Trainer.gpus should have only one meaning~~ Proposed change to Trainer.gpus to have singular intent Nov 29, 2019

jeffling changed the title ~~Proposed change to Trainer.gpus to have singular intent~~ Update docs to be clear on --gpus behaviour. Dec 1, 2019

awaelchli mentioned this issue Dec 5, 2019

Docs and Tests for "gpus" Trainer Argument #593

Merged

4 tasks

stale bot added the won't fix This will not be worked on label Mar 3, 2020

Borda closed this as completed Mar 3, 2020

Borda removed the won't fix This will not be worked on label Mar 3, 2020

awaelchli mentioned this issue Aug 6, 2020

Trainer.on_gpu incorrectly set to False when specifying gpus=0 #2837

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update docs to be clear on --gpus behaviour. #563

Update docs to be clear on --gpus behaviour. #563

jeffling commented Nov 29, 2019 •

edited

Loading

mpariente commented Nov 30, 2019 •

edited

Loading

williamFalcon commented Nov 30, 2019

mpariente commented Nov 30, 2019

Borda commented Nov 30, 2019

williamFalcon commented Nov 30, 2019 •

edited

Loading

Borda commented Nov 30, 2019

williamFalcon commented Nov 30, 2019 •

edited

Loading

Borda commented Nov 30, 2019

jeffling commented Dec 1, 2019 •

edited

Loading

Borda commented Dec 1, 2019

williamFalcon commented Dec 1, 2019

awaelchli commented Dec 3, 2019

williamFalcon commented Dec 4, 2019

awaelchli commented Dec 4, 2019

Borda commented Dec 4, 2019

awaelchli commented Dec 4, 2019

stale bot commented Mar 3, 2020

junjy007 commented Mar 18, 2021

Update docs to be clear on --gpus behaviour. #563

Update docs to be clear on --gpus behaviour. #563

Comments

jeffling commented Nov 29, 2019 • edited Loading

mpariente commented Nov 30, 2019 • edited Loading

williamFalcon commented Nov 30, 2019

mpariente commented Nov 30, 2019

Borda commented Nov 30, 2019

williamFalcon commented Nov 30, 2019 • edited Loading

Borda commented Nov 30, 2019

williamFalcon commented Nov 30, 2019 • edited Loading

Borda commented Nov 30, 2019

jeffling commented Dec 1, 2019 • edited Loading

Borda commented Dec 1, 2019

williamFalcon commented Dec 1, 2019

awaelchli commented Dec 3, 2019

williamFalcon commented Dec 4, 2019

awaelchli commented Dec 4, 2019

Borda commented Dec 4, 2019

awaelchli commented Dec 4, 2019

stale bot commented Mar 3, 2020

junjy007 commented Mar 18, 2021

jeffling commented Nov 29, 2019 •

edited

Loading

mpariente commented Nov 30, 2019 •

edited

Loading

williamFalcon commented Nov 30, 2019 •

edited

Loading

williamFalcon commented Nov 30, 2019 •

edited

Loading

jeffling commented Dec 1, 2019 •

edited

Loading