Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update docs to be clear on --gpus behaviour. #563

Closed
jeffling opened this issue Nov 29, 2019 · 18 comments
Closed

Update docs to be clear on --gpus behaviour. #563

jeffling opened this issue Nov 29, 2019 · 18 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on

Comments

@jeffling
Copy link
Contributor

jeffling commented Nov 29, 2019

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.


Is your feature request related to a problem? Please describe.
Trainer.gpus can currently be used to specify a number of GPUs or specific GPUs to run on. This makes values like

0 (run on CPU), "0" (Run on GPU 0), [0] (run on GPU 0)

confusing for newcomers.

Describe the solution you'd like
As an aggressive solution to this issue, we move to have gpus always specify specific GPUs as that is the more encompassing case. Going forward, we can put a deprecation notice up when a single int is passed in:

"In the future, gpus to specify specific GPUs the model will run on. If you would like to run on CPU, pass in None or an empty list."

Then, in the next breaking version, we can simplify the behaviour.

Describe alternatives you've considered

  1. Keep as is: This is a viable solution. We could just document more carefully. However, anecdotally, this is confusing for our team and most likely other users.
  2. Have gpus mean number of GPUs: There are many cases where researchers need to run multiple experiments on the same time on a multi-gpu machine. Being able to specify which GPU easily would be useful. As an argument for this, one could use 'CUDA_VISIBLE_DEVICES' to do this.
  3. Create a new num_gpus argument: This could make it self-documenting and allow for both workflows. However, it will be an additional argument to maintain.

Additional context

@jeffling jeffling added feature Is an improvement or enhancement help wanted Open to be worked on labels Nov 29, 2019
@jeffling jeffling changed the title Trainer.gpus should have only one meaning Proposed change to Trainer.gpus to have singular intent Nov 29, 2019
@mpariente
Copy link
Contributor

mpariente commented Nov 30, 2019

My two cents : I would be against 2 and 3.
Why not 2? If the GPU ids are passed with CUDA_VISIBLE_DEVICES, then what's the point of specifying the number of GPUs
CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 2 asks for double work of the user and would CUDA_VISIBLE_DEVICES=0,1 python train.py --gpus 1 make sense?
Why not 3? The arguments are not orthogonal and there are already too many arguments in Trainer
Solution 1 would be ok IMO, it's nice to have different possibilities.
I'm also not against the solution you suggest

  • It will be clearer for sure.
  • It should additionally support something like gpus=-1 or gpus=[-1] to use all available GPUs
  • The bad point I see is for this solution is the interface with argparse from bash, it's not straightforward to parse a list of integer from the command line.

BTW, this line makes that when gpus is an integer, only the gpus first GPUs will be used. This makes the usage of integer a little less useful.

@williamFalcon
Copy link
Contributor

@jeffling Let's keep the first option. As mentioned by @mpariente here are the reasons:

Case 1:
User uses an integer to indicate how many GPUs to use.
This is likely the most common case. User also doesn't want to deal with choosing the GPUs.

gpus=2

Case 2:
User cares about choosing the GPUs. In this case this already implies 2 gpus, so it'd be redundant to have another argument.

gpus=[0, 3]

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.

python main.py --gpus "0,1,2,3"

I do agree that we need to support -1 for all (i thought we already did).

So, the resolution seems to be updated docs, maybe a table with examples?

@mpariente
Copy link
Contributor

I do agree that we need to support -1 for all (i thought we already did).

I told that in the case the suggestion was adopted. -1 is supported as an integer and a string but not as a list. But it seems normal with the current behavior.

So, the resolution seems to be updated docs, maybe a table with examples?

Can the docs be updated on master directly?

@Borda
Copy link
Member

Borda commented Nov 30, 2019

Case 3:
These arguments likely come from command line.
In this case, we need to support strings for these users.

python main.py --gpus "0,1,2,3"

I would not use a string for multiple options, argparse supports multiple values directly:

parser.add_argument('--gpus', type=int, nargs='*', help='list of GPUs')

to be used as

python main.py --gpus 0 1 2 3

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 30, 2019

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

I have however thought about providing a default parser with args for each trainer flag so users don't have to remember them (maybe a new issue)?

@Borda
Copy link
Member

Borda commented Nov 30, 2019

@Borda yeah, but this assumes users know argparse super well.... again, we need to keep in mind non-expert users.

well, you should know it well as developer, but for the users you have the help message to guide them... then this default parameters can be written in docs or readme :]

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 30, 2019

i'm talking about scientists, physicists etc... deep learning is not just being done by engineers or developers.

That level of usability is critical and at the core of lightning

@Borda
Copy link
Member

Borda commented Nov 30, 2019

I meant that the people writing argparser should do it easier like with this listed parameters, I personally would be very confused passing it as a string with a separator... but I am open discussion :]

@jeffling
Copy link
Contributor Author

jeffling commented Dec 1, 2019

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.

Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.

First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

If our ideal for the framework user is to be able to do straight pass-throughs to lightning, the string is a better choice. Otherwise everyone will need to deal with the 'list vs int' logic their own way.

The original issue: if we're using string, a user can easily screw up python train.py --gpus "0" vs python train.py --gpus 0. But we can mitigate through other means

Final resolution:

The resolution then should be alternative 1, since we agree that don't want to get rid of the 'number of gpus' functionality (which was the original proposed aggressive solution).

If we detect --gpus 0 with int, a warning should suffice alongside updated docs.

I still have a few PRs I want to do before this, so if anybody would like an easy one feel free to take it :)

@jeffling jeffling changed the title Proposed change to Trainer.gpus to have singular intent Update docs to be clear on --gpus behaviour. Dec 1, 2019
@Borda
Copy link
Member

Borda commented Dec 1, 2019

From commandline, an issue happens with using nargs syntax --gpus 0 1 2 vs --gpus 0,1,2.
Let's say I'm doing two runs on 3 GPUs, one that uses 2 GPUs and one that uses 1.
First run: python train.py --gpus 0 1. cool.
Second run: python train.py --gpus 2. Whoops, how do we parse this?

This is s simple fix, and we shall correct it in lightning size
after parsing arg parser check if it is a list, else make it as a list... I did the same couple times in past

@williamFalcon
Copy link
Contributor

yeah, either way we need to keep support for passing in a string because this is exactly the friction we need to avoid with users
:)

@awaelchli
Copy link
Member

How about we make a list in the docs with all accepted inputs and how they are parsed/interpreted, with the edge cases discussed here? Then we write a test for each case to make sure that the parsing works properly. The parsing is already quite complicated and it takes a bit of time to see the edge cases.

@williamFalcon
Copy link
Contributor

@awaelchli yes, mind submitting a PR?

@awaelchli
Copy link
Member

sure. I will try to deliver it before next release.

@Borda
Copy link
Member

Borda commented Dec 4, 2019

next release on 6th of Dec? :)

@awaelchli
Copy link
Member

@Borda yep, today or tomorrow :)

@stale
Copy link

stale bot commented Mar 3, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Mar 3, 2020
@Borda Borda closed this as completed Mar 3, 2020
@Borda Borda removed the won't fix This will not be worked on label Mar 3, 2020
@junjy007
Copy link

I have spent quite a while browsing a couple of discussions on this issue, and the latest referred official doc is missing:

https://pytorch-lightning.readthedocs.io/en/latest/multi_gpu.html

I have to write a small script to be sure of the behaviour of "python train.py --gpus 0" when the trainer is created using ".from_argparse_args(args)". The document is not doing is job well.

Basically, I think that one needs to be explicit, no matter expert/non-expert in coding. So "--gpus" may not be a good parameter name from the first place. If we stick to it, there should be examples of using it

  • from bash command line
  • by specifying in Trainer() constructor
  • by configuration in vs-code/pycharm launch settings
    with cases of
  • specifying the number of devices for [a single node | multiple nodes]
  • specifying the specific devices for [a single node | multiple nodes using the same set of device indexes | multiple nodes using respective device indexes]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

6 participants