-
Notifications
You must be signed in to change notification settings - Fork 417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for torch 2.0 #2172
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is there a gap in the resumption tests?
Mostly LGTM / minor comments. will do one more pass after comments are resolved before approval because PR is massive
@mvpatel2000 the resumptions with a gap are because run 1 went for 10 batches, and then run 2 was run with autoresume true and an increased max duration, rather than deleting some checkpoints or something. Lmk if that makes sense. |
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>
Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. Overall looks good. I liked the detailed comments that you have added at multiple places to get better understanding of a code.
Minor nit: More descriptive PR header name ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's get more approvals though before merging. Only outstanding is adding GPU daily tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add helpful error messages to the ONNX export method if a user runs into an error and is running PyTorch 2.0? We could suggest they try downgrading PyTorch versions if one of their model operators is not supported.
This is not a blocking suggestion --- we can merge without this.
I think the message you get from ONNX directly is about as clear as it gets...since we don't know which operator they're having trouble with and how that correspond to opset version and torch version. but open to another suggestion |
What does this PR do?
This PR upgrades the torch pin to support torch 2.0. It includes related fixes, mostly resulting from the use of
use_orig_params=True
with FSDP, which is necessary to supportcompile
.Changes:
_LRScheduler
->LRScheduler
summon_full_params
to get HF generate to work with FSDPManual tests:
What issue(s) does this change relate to?
Closes CO-2029
Closes #2147
Before submitting
pre-commit
on your change? (see thepre-commit
section of prerequisites)