Add support for torch 2.0 #2172

dakinggg · 2023-04-26T19:02:03Z

What does this PR do?

This PR upgrades the torch pin to support torch 2.0. It includes related fixes, mostly resulting from the use of use_orig_params=True with FSDP, which is necessary to support compile.

Changes:

Different way of loading FSDP state dicts for torch 2
_LRScheduler -> LRScheduler
using summon_full_params to get HF generate to work with FSDP
fixing the way we compute optimizer metrics to allow for the fact that not all ranks have all params
assorted minor fixes and test fixes

Manual tests:

sharded multinode autoresume

local autoresume

full resume

optimizer monitor (https://wandb.ai/mosaic-ml/daniel-debug/runs/34o29laf?workspace=user-danielking)
base 1b run (https://wandb.ai/mosaic-ml/daniel-debug/runs/37e2g60h?workspace=user-danielking)
hf generate for gpt neo (https://wandb.ai/mosaic-ml/daniel-debug/groups/torch2-evals/workspace?workspace=user-danielking)
hf generate for mosaicgpt (https://wandb.ai/mosaic-ml/daniel-debug/runs/10lknf0v?workspace=user-danielking)
added remote tests pass

What issue(s) does this change relate to?

Closes CO-2029
Closes #2147

Before submitting

Have you read the contributor guidelines?
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

.github/workflows/pr-gpu.yaml

mvpatel2000

Why is there a gap in the resumption tests?

Mostly LGTM / minor comments. will do one more pass after comments are resolved before approval because PR is massive

.github/workflows/pr-cpu.yaml

.github/workflows/pr-gpu.yaml

composer/callbacks/optimizer_monitor.py

composer/core/state.py

tests/algorithms/test_gradient_clipping.py

tests/callbacks/test_optimizer_monitor.py

tests/common/models.py

tests/trainer/test_sharded_checkpoint.py

composer/models/huggingface.py

dakinggg · 2023-04-27T03:49:08Z

@mvpatel2000 the resumptions with a gap are because run 1 went for 10 batches, and then run 2 was run with autoresume true and an increased max duration, rather than deleting some checkpoints or something. Lmk if that makes sense.

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle

karan6181

Some minor comments. Overall looks good. I liked the detailed comments that you have added at multiple places to get better understanding of a code.

Minor nit: More descriptive PR header name ?

composer/core/state.py

composer/core/types.py

composer/utils/auto_log_hparams.py

mvpatel2000

LGTM. Let's get more approvals though before merging. Only outstanding is adding GPU daily tests

karan6181

LGTM. Thanks!

nik-mosaic

Should we add helpful error messages to the ONNX export method if a user runs into an error and is running PyTorch 2.0? We could suggest they try downgrading PyTorch versions if one of their model operators is not supported.

This is not a blocking suggestion --- we can merge without this.

dakinggg · 2023-04-27T21:51:26Z

I think the message you get from ONNX directly is about as clear as it gets...since we don't know which operator they're having trouble with and how that correspond to opset version and torch version. but open to another suggestion

dakinggg commented Apr 27, 2023

View reviewed changes

.github/workflows/pr-gpu.yaml Outdated Show resolved Hide resolved

dakinggg marked this pull request as ready for review April 27, 2023 02:38

dakinggg requested review from a team as code owners April 27, 2023 02:38

dakinggg requested review from eracah, karan6181, mvpatel2000, bandish-shah and nik-mosaic April 27, 2023 02:38

mvpatel2000 reviewed Apr 27, 2023

View reviewed changes

dakinggg commented Apr 27, 2023

View reviewed changes

composer/models/huggingface.py Show resolved Hide resolved

dakinggg requested a review from dskhudia as a code owner April 27, 2023 07:12

dakinggg added 17 commits April 27, 2023 01:01

fix prefetch default

8b838cb

fix pyright issues with torch2

6aca00c

make tests conditional

582c6fe

make tests conditional

7b49c3b

go back to old lr type

18130cc

xfail onnx test on torch2

685e2de

more torch2 test fixes

bd12aec

attempt to add torch2 workflows

7f6de3a

switch to pull_request

c706b13

summon full params for generate

e75068a

scope the hack to torch2

62a342d

better fix for hf generate

8a49238

fix optimizer monitor

a8a8191

merge

2d7feae

refix optimizer monitor

a4017b7

remove print

3b47546

bump pin

a357b55

dakinggg and others added 13 commits April 27, 2023 01:02

Update composer/models/huggingface.py

317ab4d

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Update composer/models/huggingface.py

5cc4309

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Update composer/models/huggingface.py

e699b5d

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Update composer/models/huggingface.py

791d98a

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Update composer/utils/module_surgery.py

7e397c4

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

Update .github/workflows/pr-cpu.yaml

5ff03ad

Co-authored-by: Mihir Patel <mihir.v.patel7@gmail.com>

rename using_torch_2

646d212

some pr comments

fd51277

remove min_params

41212a0

Use state_dict Torchmetrics Serialization (#2116)

b9922c5

Serialize and load torchmetrics through state_dict() and load_state_dict() instead of pickle

add comment

a78bc99

remove print

1547db3

adjust hf generate comment

b6b5de7

dakinggg force-pushed the torch2branch2 branch from 6b69fcd to b6b5de7 Compare April 27, 2023 08:04

dakinggg added 2 commits April 27, 2023 01:12

precommit

64957be

revert unsafe pull_request

bbd6c32

dakinggg requested a review from mvpatel2000 April 27, 2023 08:31

karan6181 reviewed Apr 27, 2023

View reviewed changes

composer/core/state.py Outdated Show resolved Hide resolved

composer/core/types.py Show resolved Hide resolved

composer/utils/auto_log_hparams.py Show resolved Hide resolved

dakinggg changed the title ~~Torch2~~ Add support for torch 2.0 Apr 27, 2023

dakinggg added 3 commits April 27, 2023 10:00

Merge branch 'dev' into torch2branch2

029483a

add pyright comment

a954565

fix s3 bucket path

6cf947d

mvpatel2000 approved these changes Apr 27, 2023

View reviewed changes

dakinggg added 2 commits April 27, 2023 11:24

add torch2 to daily gpu

322f5d0

fix workflow name

1aaf156

karan6181 approved these changes Apr 27, 2023

View reviewed changes

nik-mosaic approved these changes Apr 27, 2023

View reviewed changes

dakinggg merged commit 6180ef0 into dev Apr 27, 2023

dakinggg deleted the torch2branch2 branch April 27, 2023 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for torch 2.0 #2172

Add support for torch 2.0 #2172

dakinggg commented Apr 26, 2023 •

edited

Loading

mvpatel2000 left a comment

dakinggg commented Apr 27, 2023 •

edited

Loading

karan6181 left a comment •

edited

Loading

mvpatel2000 left a comment •

edited

Loading

karan6181 left a comment

nik-mosaic left a comment •

edited

Loading

dakinggg commented Apr 27, 2023

Add support for torch 2.0 #2172

Add support for torch 2.0 #2172

Conversation

dakinggg commented Apr 26, 2023 • edited Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

mvpatel2000 left a comment

Choose a reason for hiding this comment

dakinggg commented Apr 27, 2023 • edited Loading

karan6181 left a comment • edited Loading

Choose a reason for hiding this comment

mvpatel2000 left a comment • edited Loading

Choose a reason for hiding this comment

karan6181 left a comment

Choose a reason for hiding this comment

nik-mosaic left a comment • edited Loading

Choose a reason for hiding this comment

dakinggg commented Apr 27, 2023

dakinggg commented Apr 26, 2023 •

edited

Loading

dakinggg commented Apr 27, 2023 •

edited

Loading

karan6181 left a comment •

edited

Loading

mvpatel2000 left a comment •

edited

Loading

nik-mosaic left a comment •

edited

Loading