Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow startup and OOMs when using device_train_microbatch_size with torch.compile #2214

Open
tbenthompson opened this issue May 9, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@tbenthompson
Copy link
Contributor

tbenthompson commented May 9, 2023

When I combine device_train_microbatch_size="auto" with a model that is compiled with torch.compile(...), I get bad behavior in a couple predictable dimensions:

  1. Because each separate batch size requires re-compiling the model, the launch time is very slow.
  2. I get OOM errors after several compilations have failed. I'm guessing that torch is caching some values on the GPU or something from each compilation, but I'm not precisely sure what is going on.

My best guess at a clean solution is to determine the microbatch size with the uncompiled model so that only one compilation needs to be done. Another possible solution is to wait for torch to improve their implementation of dynamic shaping and treat the batch size as a dynamic shape dimension so that only one compilation needs to be done.

For now, I'm avoiding this issue by just manually specifying a microbatch size.

@tbenthompson tbenthompson added the bug Something isn't working label May 9, 2023
@mvpatel2000
Copy link
Contributor

mvpatel2000 commented May 9, 2023

Thanks for flagging this!

My best guess at a clean solution is to determine the microbatch size with the uncompiled model so that only one compilation needs to be done. Another possible solution is to wait for torch to improve their implementation of dynamic shaping and treat the batch size as a dynamic shape dimension so that only one compilation needs to be done.

I think this seems like a reasonable path forward. We have put this on our roadmap but may not get to it for a few weeks.

We also welcome contributions, and you're more than welcome to open a PR implementing option 1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants