Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] Add the low level SSD APIs #829

Merged
merged 7 commits into from
Nov 1, 2021
Merged

[feature] Add the low level SSD APIs #829

merged 7 commits into from
Nov 1, 2021

Conversation

anj-s
Copy link
Contributor

@anj-s anj-s commented Oct 24, 2021

What does this PR do?

Adds the low level SSD APIs required for representing a tensor, reading/writing to/from memory/disk. These will be used in a future PR to implement SSD offload with FSDP.

The two main concepts here are SsdTensorHandle and SsdBuffer. The SsdTensorHandle represents a tensor that can be in memory or on disk. There are APIs to enable us to set the right metadata (memory offset, file offset, file params etc.). The SsdBuffer consists of a list of SsdTensorHandles and represents all the parameters in a module (as an example). We should be able to read all tensors to and from disk as part of the SsdBuffer.

Note: This PR is a join effort between @another-pjohnson and @anj-s .

Before submitting

  • Did you have fun?
    • Make sure you had fun coding 🙃
  • Did you read the contributor guideline?
  • Was this discussed/approved via a Github issue? (no need for typos, doc improvements)
    • N/A
  • Did you make sure to update the docs?
    • N/A
  • Did you write any new necessary tests?
    • N/A
  • Did you update the changelog? (if needed)
    • N/A

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 24, 2021
@anj-s anj-s marked this pull request as draft October 24, 2021 19:13
@anj-s anj-s marked this pull request as ready for review October 25, 2021 23:52
@anj-s anj-s requested a review from min-xu-ai October 25, 2021 23:52
@anj-s anj-s marked this pull request as draft October 26, 2021 00:14
@anj-s anj-s marked this pull request as ready for review October 26, 2021 22:07
@anj-s
Copy link
Contributor Author

anj-s commented Oct 27, 2021

Friendly ping for a review!

Copy link
Contributor

@another-pjohnson another-pjohnson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, although it's code that I've mostly written :-)

@anj-s anj-s merged commit a9fcaa2 into main Nov 1, 2021
@anj-s anj-s deleted the ssd-offload-api branch November 1, 2021 19:22
@min-xu-ai
Copy link
Contributor

It seems that pytorch nightly version is a bit flaky with the new test?

@anj-s
Copy link
Contributor Author

anj-s commented Nov 1, 2021

Existing tests are flaky, not the new one:
tests.nn.checkpoint.test_checkpoint_activations and tests.nn.checkpoint.test_checkpoint_activations.

@min-xu-ai
Copy link
Contributor

My bad. I saw "test_offload_memory" and assumed it was a new test. I triggered a rerun. I haven't seen this one fail on CI for a long time. Is it very flaky and frequently fails?

@anj-s
Copy link
Contributor Author

anj-s commented Nov 1, 2021

My bad. I saw "test_offload_memory" and assumed it was a new test. I triggered a rerun. I haven't seen this one fail on CI for a long time. Is it very flaky and frequently fails?

No worries! No idea, am also looking at it but couldn't figure out why it was failing now. Hasn't been flaky recently. I've seen it fail once or twice but not in the last few weeks.

@min-xu-ai
Copy link
Contributor

The rerun is still failing the same way. It starts to look like something changed on CI...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. FSDP + SSD offload
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants