Feature: auto scale batch size #1638

SkafteNicki · 2020-04-27T15:21:44Z

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?
If you made a notable change (that affects users), did you update the CHANGELOG?

What does this PR do?

Fixes #1615 and #1444.
This implements an algorithm that can automatically finds the largest possible batch size that fits in memory (no OOM). Current two modes are supported: power and binsearch. power will iteratively multiply the batch size by 2, until an OOM is encountered and stop. binsearch will further try to refine the batch size from there through a binary search strategy.

In power (default) the output in terminal currently look something like this (running the LightningTemplateModel)

The interface for this feature is currently very much like the learning rate finder introduced some time ago. In the basic case the user can set the trainer flag auto_scale_batch_size=True and the batch finder will run when .fit() is called. Similar to the learning rate finder, this assumes the user has a field in hparam field called model.hparams.batch_size that can be overridden with whatever batch size is found. If the user instead want to write to another field this can be done with auto_scale_batch_size=my_field (corresponding to model.hparams.my_field).

For the power-user, after initializing the trainer, can invoke the method scale_batch_size and thereby control the search through the methods parameters.

WIP right now as test and better documentation are missing. Also needs to figure out exactly where this should be located in the codebase: currently in TrainerTrainingTricksMixin but should maybe be its own Mixin.

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

pep8speaks · 2020-04-27T15:21:49Z

Hello @SkafteNicki! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-05-09 08:43:53 UTC

pytorch_lightning/trainer/training_tricks.py

mergify · 2020-05-01T11:17:34Z

This pull request is now in conflict... :(

williamFalcon · 2020-05-05T18:24:34Z

/rebase

williamFalcon · 2020-05-05T18:26:08Z

@SkafteNicki i really want to try this haha... can we merge?

SkafteNicki · 2020-05-05T19:26:57Z

I think it works as it should (atleast for me). I could maybe add more tests if you want. You are also welcome to checkout the branch before merge.

williamFalcon · 2020-05-05T19:28:37Z

all good. let’s get all the tests to pass

Borda

This is Great addition ❤️

docs/source/training_tricks.rst

pytorch_lightning/trainer/__init__.py

pytorch_lightning/trainer/training_tricks.py

tests/trainer/test_trainer_tricks.py

awaelchli

went on a typo hunt :) hope you don't mind my pedantic behaviour :)
The batch finder function is quite complex and could be further abstracted a bit. How hard would it be to add a third mode we needed to do so in the future?

docs/source/training_tricks.rst

pytorch_lightning/trainer/__init__.py

pytorch_lightning/trainer/training_tricks.py

SkafteNicki · 2020-05-06T13:11:43Z

@Borda & @williamFalcon I messed up when I tried to pull the latest changes that you made, and now the latest commit edits a lot more file than intended. Is there a way to fix this? Basically revert 1 commit. I am no git expert.

Borda · 2020-05-06T13:24:59Z

@Borda & @williamFalcon I messed up when I tried to pull the latest changes that you made, and now the latest commit edits a lot more file than intended. Is there a way to fix this? Basically revert 1 commit. I am no git expert.

sure, just drop the last commit :] with git rebase -i HEAD~2 or I can do it for you...

EDIT: it does not allow me to fetch your branch it crashes...

Borda · 2020-05-06T13:51:20Z

@SkafteNicki I have tried to get it back and seems to be fine now, but pls check it...

SkafteNicki · 2020-05-08T10:20:32Z

On a side note, in the future I think both this feature and the learning rate finder should be redone (I made them both, so this is my own fault). I realized that both follow a pattern:

dump current state of model and trainer
alter some trainer args/variables to suit the feature
do the feature by calling .fit() internally
save results from the feature
restore the initial settings

Instead of all the hassel of saving, altering and restoring it is probably a better idea to just initialize a new instance of pl.Trainer inside the feature, and copy over important settings (like device) to the new instance. Then only initial state of model (which can easily be done if issue #1619 is solved) need to be saved/restored.
This pattern is not only present in these two features, but future features also follow it. For example, I have looked a bit on incorporating a cross-validation feature into pl (issue #839) and this feature follows the exact same pattern.

Borda · 2020-05-08T11:07:06Z

There is a suggestion from @tullie (and I like) that we may consider splitting this hyper param tuning like batch size, learning rate to separate class/object Tuner to lower a bit the code complexity and make it more transparent...

Borda · 2020-05-08T11:11:23Z

pytorch_lightning/trainer/trainer.py

@@ -474,7 +482,7 @@ def __init__(
            self.show_progress_bar = show_progress_bar

        self.progress_bar_refresh_rate = progress_bar_refresh_rate
-        self.progress_bar_callback = None
+        self.progress_bar_callback = progress_bar_callback


the arg progress_bar_callback was not used anywhere - forgotten, hope it is the right place...

Borda · 2020-05-08T11:13:51Z

pytorch_lightning/trainer/training_tricks.py

+            count += 1
+            if count > max_trials:
+                break
+            # Double in size
+            low = new_size
+            if high:
+                if high - low <= 1:
+                    break
+                midval = (high + low) // 2
+                new_size = _adjust_batch_size(trainer, batch_arg_name, value=midval, desc='succeeded')
+            else:
+                new_size = _adjust_batch_size(trainer, batch_arg_name, factor=2.0, desc='succeeded')


I would rather move this to the else section as here we do not expect any failed, right?

this is meant to check if we are still in the initial phase of doubling the batch size or we have failed once (i.e. high is defined) and thus is in the binary search phase

Sure, I mean

try: Do something except: Do if the something failed else: Do others if the something pass

pytorch_lightning/trainer/training_tricks.py

SkafteNicki · 2020-05-08T13:34:30Z

There is a suggestion from @tullie (and I like) that we may consider splitting this hyper param tuning like batch size, learning rate to separate class/object Tuner to lower a bit the code complexity and make it more transparent...

I also like this. I really think that it is great that lightning has kept its interface so minimalist for most user e.g. most user only need to interact with LightningModule and Trainer. However, I also think the time has come to extend the interface for these more advance features.

williamFalcon · 2020-05-08T23:10:49Z

We can explore this on @tullie's GH issue. In the meantime, let's get this merged haha.

mergify · 2020-05-09T03:33:05Z

This pull request is now in conflict... :(

BlackHC · 2020-05-25T20:38:59Z

pytorch_lightning/utilities/memory.py

+def is_cudnn_snafu(exception):
+    return isinstance(exception, RuntimeError) \
+        and len(exception.args) == 1 \
+        and "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED." in exception.args[0]


Thanks for this PR to implement toma in PyTorch Lightning!

If you copy code and ideas from my projects, could you please add a mention from it, too? I see that you're a fellow PhD student, so you are aware of the importance of credit assignment.

In particular, if you copy code verbatim and remove helpful comments... maybe add them back.

def is_cuda_out_of_memory(exception): return ( isinstance(exception, RuntimeError) and len(exception.args) == 1 and "CUDA out of memory." in exception.args[0] ) def is_cudnn_snafu(exception): # For/because of https://github.com/pytorch/pytorch/issues/4107 return ( isinstance(exception, RuntimeError) and len(exception.args) == 1 and "cuDNN error: CUDNN_STATUS_NOT_SUPPORTED." in exception.args[0] )

def gc_cuda(): """Gargage collect Torch (CUDA) memory.""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache()

is from https://github.com/BlackHC/toma/blob/master/toma/torch_cuda_memory.py.

Now, I think this PR contains lots of other code, and I think it's great, but maybe add a link or a mention.

Thank you,
Andreas

PS: Keeping my rather random method names is a bit of a give-away.

williamFalcon · 2020-05-25T20:42:21Z

@BlackHC i'm sorry, i had no idea this code was copied... @SkafteNicki generally we like to do our own implementations of features, and under no circumstance do we allow code copying.

I suggest a few things to rectify this:

we use toma as is and add them as a dependency for this feature
or
we come up with our own actual implementation.

But seeing how the code was copied from the toma repo i would rather play nice and bring them in as an actual dependency.

@BlackHC my deepest apologies, i was not aware that this code came from your repo!

@PyTorchLightning/core-contributors thoughts?

BlackHC · 2020-05-25T21:09:17Z

The rest of the code seems quite original/I haven't reviewed it in detail. I'm sure you have a good understanding of it and its quality because it follows a slightly different approach than toma. With the binary search and the potential of using a higher batchsize than specified, it might be worth looking into ghost batchnorm in the future if this is used for training.

What I would be great is:

add an inspired by Andreas Kirsch's https://github.com/BlackHC/toma comment in the source/feature docs,
add the # For/because of https://github.com/pytorch/pytorch/issues/4107 comment back to explain why it checks for that exception (it's a bit magical otherwise); and
add a 'based on https://github.com/BlackHC/toma/blob/master/toma/torch_cuda_memory.py' comment to utilities/memory.py?

I can grant you license for the those lines of code outside of the MIT license used in toma, so should be fine. No need to rewrite things.

Please let me what you think.

Thanks,
Andreas

Borda · 2020-05-25T21:14:23Z

@BlackHC I am sorry for this, I was not aware of it...
I would add it as a dependency, there is no need to develop wheel again unless we can get a better wheel :]
@SkafteNicki or @BlackHC (to be also a mentioned contributor in next release) mind send a PR with importing mentioned util functions from your lib?

BlackHC · 2020-05-25T21:34:00Z

Thank you very much!

From a dependency point of view, it's only the functions I mentioned (and is_out_of_cpu_memory from https://github.com/BlackHC/toma/blob/master/toma/cpu_memory.py), so I'm not sure it's worth including the full dependency at this point. Just mentioning the original source might be enough.

If toma adds lots of functionality it might be worth having another look. It does not scale batchsizes up but only down at the moment, as it follows a slightly different paradigm.

I'm really impressed by how quickly you have replied and reacted. I think it's amazing.

Thanks,
Andreas

Borda · 2020-05-25T21:49:37Z

@williamFalcon we may think about using some other functionalities but

toma is MIT license and PL is Apache-2.0
toma is tested only for one configuration - Linux, py3.7, unspec PyTorch

which is a very limiting factor for us...
the good side is that the only dependencies are torch and psutils only

SkafteNicki · 2020-05-25T21:52:29Z

I am so sorry about this. Let me try to explain. I originally tried to integrate toma into lightning (because it is a awesome library) but could not figure out how to get some functionality to work with the lightning interface, especially the binary search and hparams. I therefore ended up doing a custom implementation by myself. Everything except for the utility functions for determining when we are out of memory I wrote myself, but I am truly sorry that I in the heat of programming forgot to reference the original source code for these function. I am sorry if I have offended you @BlackHC, it was never my intent, would never do that to a follow PhD student.
@williamFalcon and @Borda I will gladly rectify my mistake in a PR by updating all the code with correct references.

Borda · 2020-05-25T22:01:14Z

@williamFalcon and @Borda I will gladly rectify my mistake in a PR by updating all the code with correct references.

We need to resolve the license case in dependencies then I would suggest to @BlackHC to make this simple PR to be also on the contribution list (as it is generated from PR authors)

williamFalcon · 2020-05-25T22:04:13Z

@SkafteNicki no worries, i'm sure it wasn't malicious - we just want to play fair with the broader set of tools.

Let's follow @BlackHC suggestions here and make the correct PRs/adjustment to our codebase.

@BlackHC thank you for providing the rights to use! We will make sure to follow the considerations you requested.

Thanks!

BlackHC · 2020-05-26T07:02:38Z

Thank you very much! I'll prepare a PR shortly (probably this evening as we have an internal NeurIPS deadline before that 😬).

@SkafteNicki thanks for explaining and no worries! I think it's great you implemented it in a way that targets PyTorch Lightning specifically, and I'm glad I was able to provide inspiration with toma, and the utility functions were useful. This is a big part of why open-source is great. I hope you'll keep contributing to PyTorch Lightning!

Thanks,
Andreas

mergify bot requested a review from a team April 27, 2020 15:22

justusschock reviewed Apr 27, 2020

View reviewed changes

pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team April 27, 2020 15:30

justusschock reviewed Apr 27, 2020

View reviewed changes

pytorch_lightning/trainer/training_tricks.py Show resolved Hide resolved

mergify bot requested a review from a team April 27, 2020 15:31

SkafteNicki changed the title ~~[WIP] Feature: auto scale batch size~~ Feature: auto scale batch size May 4, 2020

Borda added the feature Is an improvement or enhancement label May 4, 2020

Borda added this to the 0.7.6 milestone May 4, 2020

Borda force-pushed the feature/auto_batch_size branch from 0a639b5 to 2d3988a Compare May 5, 2020 19:08

Borda requested changes May 5, 2020

View reviewed changes

mergify bot requested a review from a team May 5, 2020 19:37

awaelchli reviewed May 5, 2020

View reviewed changes

mergify bot requested a review from a team May 5, 2020 20:43

williamFalcon approved these changes May 5, 2020

View reviewed changes

mergify bot requested a review from a team May 5, 2020 21:45

Borda force-pushed the feature/auto_batch_size branch from df02a2c to 810ecc8 Compare May 6, 2020 13:47

Nicki Skafte added 5 commits May 6, 2020 15:53

auto batch finder

fbacbaf

fix styling

863bce9

add description

b004826

add different modes

a83417e

fix copy paste error

78d31eb

Borda reviewed May 8, 2020

View reviewed changes

mergify bot requested a review from a team May 8, 2020 11:14

Borda reviewed May 8, 2020

View reviewed changes

pytorch_lightning/trainer/training_tricks.py Outdated Show resolved Hide resolved

mergify bot requested a review from a team May 8, 2020 11:17

Borda requested review from awaelchli and justusschock May 8, 2020 11:57

Borda added the ready PRs ready to be merged label May 8, 2020

update based on review

c3cdd6c

Merge branch 'master' into feature/auto_batch_size

3122cb4

williamFalcon merged commit 4970927 into Lightning-AI:master May 9, 2020

BlackHC reviewed May 25, 2020

View reviewed changes

This was referenced May 26, 2020

Integrate toma for automatic batch sizing #1444

Closed

Add toma comments to auto_scale_batch_size #1994

Merged

SkafteNicki deleted the feature/auto_batch_size branch June 15, 2020 14:31

njzjz mentioned this pull request Sep 15, 2021

[BUG] _dp test raise OOM | 2.0.0 _ deepmodeling/deepmd-kit#1149

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: auto scale batch size #1638

Feature: auto scale batch size #1638

SkafteNicki commented Apr 27, 2020 •

edited

Loading

pep8speaks commented Apr 27, 2020 •

edited

Loading

mergify bot commented May 1, 2020

williamFalcon commented May 5, 2020 •

edited

Loading

williamFalcon commented May 5, 2020

SkafteNicki commented May 5, 2020

williamFalcon commented May 5, 2020

Borda left a comment

awaelchli left a comment

SkafteNicki commented May 6, 2020

Borda commented May 6, 2020 •

edited

Loading

Borda commented May 6, 2020

SkafteNicki commented May 8, 2020

Borda commented May 8, 2020

Borda May 8, 2020

Borda May 8, 2020

SkafteNicki May 8, 2020

Borda May 8, 2020

SkafteNicki commented May 8, 2020

williamFalcon commented May 8, 2020

mergify bot commented May 9, 2020

BlackHC May 25, 2020 •

edited

Loading

williamFalcon commented May 25, 2020 •

edited

Loading

BlackHC commented May 25, 2020

Borda commented May 25, 2020

BlackHC commented May 25, 2020

Borda commented May 25, 2020

SkafteNicki commented May 25, 2020

Borda commented May 25, 2020

williamFalcon commented May 25, 2020 •

edited

Loading

BlackHC commented May 26, 2020

Feature: auto scale batch size #1638

Feature: auto scale batch size #1638

Conversation

SkafteNicki commented Apr 27, 2020 • edited Loading

Before submitting

What does this PR do?

PR review

Did you have fun?

pep8speaks commented Apr 27, 2020 • edited Loading

Comment last updated at 2020-05-09 08:43:53 UTC

mergify bot commented May 1, 2020

williamFalcon commented May 5, 2020 • edited Loading

williamFalcon commented May 5, 2020

SkafteNicki commented May 5, 2020

williamFalcon commented May 5, 2020

Borda left a comment

Choose a reason for hiding this comment

awaelchli left a comment

Choose a reason for hiding this comment

SkafteNicki commented May 6, 2020

Borda commented May 6, 2020 • edited Loading

Borda commented May 6, 2020

SkafteNicki commented May 8, 2020

Borda commented May 8, 2020

Borda May 8, 2020

Choose a reason for hiding this comment

Borda May 8, 2020

Choose a reason for hiding this comment

SkafteNicki May 8, 2020

Choose a reason for hiding this comment

Borda May 8, 2020

Choose a reason for hiding this comment

SkafteNicki commented May 8, 2020

williamFalcon commented May 8, 2020

mergify bot commented May 9, 2020

BlackHC May 25, 2020 • edited Loading

Choose a reason for hiding this comment

williamFalcon commented May 25, 2020 • edited Loading

BlackHC commented May 25, 2020

Borda commented May 25, 2020

BlackHC commented May 25, 2020

Borda commented May 25, 2020

SkafteNicki commented May 25, 2020

Borda commented May 25, 2020

williamFalcon commented May 25, 2020 • edited Loading

BlackHC commented May 26, 2020

SkafteNicki commented Apr 27, 2020 •

edited

Loading

pep8speaks commented Apr 27, 2020 •

edited

Loading

williamFalcon commented May 5, 2020 •

edited

Loading

Borda commented May 6, 2020 •

edited

Loading

BlackHC May 25, 2020 •

edited

Loading

williamFalcon commented May 25, 2020 •

edited

Loading

williamFalcon commented May 25, 2020 •

edited

Loading