Removed torch.cuda.empty_cache from train loop. #31530

FoamoftheSea · 2024-06-21T05:27:56Z

What does this PR do?

Removes the addition of torch.cuda.empty_cache from the training loop (introduced in #28769).

This line caused training slowdowns observed in issue #31372

While this thread in the PyTorch forums recommends not to use this function because it is slow, it appears many in the comments there still find it necessary to save them from OOMs on their training jobs, so it might be nice to have the option, but users can just add it on their own if they're in a jam.

Fixes # 31372

@muellerz @SunMarc

SunMarc

Thanks again for your investigation @FoamoftheSea ! LGTM !

muellerzr

Overall agree with it, if users decide this isn't enough, the next step IMO would be a toggle-able "after n steps" do an empty_cache() or some sort, to at least delay it and give users control.

amyeroberts

Thanks for the detailed PR description @FoamoftheSea and the fix ❤️

Agreed - let's remove for now and if we find that users need it we can think about smarter ways to reintroduce

aliencaocao · 2024-06-21T15:05:09Z

Does it make sense if its added as a TrainingArgument and default to False, but with a tip to turn it on if vram usage is near the limit? It is useful because many OOMs only happen after X unpredictable steps and many don't watch them all the way before going off

This may also cause some behaviour changes where hyperparams/models working previously OOMs after the change

amyeroberts · 2024-06-21T16:53:25Z

@aliencaocao I'm not sure we necessarily want to actively monitor the memory and trigger a tip (I suspect this is more fiddly and flaky than expected as you have to balance catching in time vs not spamming, making sure values are correct etc.).

A flag which we can configure for clearing after every n-steps seems reasonable. Would you like to open a PR with a proposal and we can iterate from there?

aliencaocao · 2024-06-21T17:01:30Z

sure i can do it

Removed torch.cuda.empty_cache from train loop.

d1884ff

SunMarc approved these changes Jun 21, 2024

View reviewed changes

SunMarc requested review from muellerzr and amyeroberts June 21, 2024 09:22

muellerzr approved these changes Jun 21, 2024

View reviewed changes

amyeroberts approved these changes Jun 21, 2024

View reviewed changes

amyeroberts merged commit 8b7cd40 into huggingface:main Jun 21, 2024
20 checks passed

aliencaocao mentioned this pull request Jun 22, 2024

Add torch_empty_cache_steps to TrainingArguments #31546

Merged

5 tasks

zucchini-nlp pushed a commit to zucchini-nlp/transformers that referenced this pull request Jun 24, 2024

Removed torch.cuda.empty_cache from train loop. (huggingface#31530)

220104f

hiyouga mentioned this pull request Jun 28, 2024

新版本显存大小为什么跳来跳去 hiyouga/LLaMA-Factory#4310

Closed

1 task

amyeroberts mentioned this pull request Jul 16, 2024

Memory leak when using CLIPTextModel #31439

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removed torch.cuda.empty_cache from train loop. #31530

Removed torch.cuda.empty_cache from train loop. #31530

FoamoftheSea commented Jun 21, 2024

SunMarc left a comment

muellerzr left a comment

amyeroberts left a comment

aliencaocao commented Jun 21, 2024

amyeroberts commented Jun 21, 2024

aliencaocao commented Jun 21, 2024

Removed torch.cuda.empty_cache from train loop. #31530

Removed torch.cuda.empty_cache from train loop. #31530

Conversation

FoamoftheSea commented Jun 21, 2024

What does this PR do?

SunMarc left a comment

Choose a reason for hiding this comment

muellerzr left a comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

aliencaocao commented Jun 21, 2024

amyeroberts commented Jun 21, 2024

aliencaocao commented Jun 21, 2024