Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

bcui19 · 2023-01-18T23:54:01Z

What does this PR do?

When loading composer state dicts, we load in train and eval metrics. However, these metrics have a defined _device which is usually always the rank 0 device since that's what stored in the state_dict. As a result, we need to manually copy over each metric (so we don't override a user's current metrics) and manually set the devices.

Now, our memory overhead when loading models is normal:
| 0 N/A N/A 3770555 C 31567MiB |
| 1 N/A N/A 3770556 C 31567MiB |

What issue(s) does this change relate to?

CO-1485

Before submitting

[ x] Have you read the contributor guidelines?
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related tests and add any new tests related to your change? (see testing)
[x ] Did you run the tests locally to make sure they pass?
[x ] Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

dakinggg

LGTM, could you add a test that loading a checkpoint doesn't use extra memory? I hope this is not hard and can just be done as part of this PR. If it is hard, feel free to make a JIRA and do it later.

composer/core/state.py

eracah

LGTM. Is there an easy way to add a test?

composer/devices/device.py

composer/core/state.py

Updating device to satisfy access requirements

12e531e

bcui19 requested review from eracah and dakinggg January 18, 2023 23:54

dakinggg approved these changes Jan 19, 2023

View reviewed changes

composer/core/state.py Outdated Show resolved Hide resolved

removing extraneous continue

d6f7182

eracah approved these changes Jan 19, 2023

View reviewed changes

composer/devices/device.py Show resolved Hide resolved

dakinggg reviewed Jan 19, 2023

View reviewed changes

composer/core/state.py Show resolved Hide resolved

bcui19 added 2 commits January 19, 2023 00:58

Fixing for eval_metrics

711bc69

Cleaning up code

0d620bd

bcui19 merged commit 2c956d0 into mosaicml:dev Jan 19, 2023

bcui19 deleted the fix_metrics_memory_leak branch March 10, 2023 18:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

bcui19 commented Jan 18, 2023 •

edited by jira bot

Loading

dakinggg left a comment

eracah left a comment

Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

Conversation

bcui19 commented Jan 18, 2023 • edited by jira bot Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

dakinggg left a comment

Choose a reason for hiding this comment

eracah left a comment

Choose a reason for hiding this comment

bcui19 commented Jan 18, 2023 •

edited by jira bot

Loading