Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating how we load metrics in a state_dict so we don't add extra memory overhead #1892

Merged
merged 4 commits into from
Jan 19, 2023

Conversation

bcui19
Copy link
Contributor

@bcui19 bcui19 commented Jan 18, 2023

What does this PR do?

When loading composer state dicts, we load in train and eval metrics. However, these metrics have a defined _device which is usually always the rank 0 device since that's what stored in the state_dict. As a result, we need to manually copy over each metric (so we don't override a user's current metrics) and manually set the devices.

Before our memory overhead was:
| 0 N/A N/A 3770555 C 31567MiB |
| 0 N/A N/A 3770556 C 965MiB |
| 1 N/A N/A 3770556 C 31567MiB

Now, our memory overhead when loading models is normal:
| 0 N/A N/A 3770555 C 31567MiB |
| 1 N/A N/A 3770556 C 31567MiB |

What issue(s) does this change relate to?

CO-1485

Before submitting

  • [ x] Have you read the contributor guidelines?
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • [x ] Did you run the tests locally to make sure they pass?
  • [x ] Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

Copy link
Contributor

@dakinggg dakinggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, could you add a test that loading a checkpoint doesn't use extra memory? I hope this is not hard and can just be done as part of this PR. If it is hard, feel free to make a JIRA and do it later.

composer/core/state.py Outdated Show resolved Hide resolved
Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is there an easy way to add a test?

composer/devices/device.py Show resolved Hide resolved
@bcui19 bcui19 merged commit 2c956d0 into mosaicml:dev Jan 19, 2023
@bcui19 bcui19 deleted the fix_metrics_memory_leak branch March 10, 2023 18:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants