-
Notifications
You must be signed in to change notification settings - Fork 276
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature] Add the low level SSD APIs #829
Conversation
Friendly ping for a review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, although it's code that I've mostly written :-)
It seems that pytorch nightly version is a bit flaky with the new test? |
Existing tests are flaky, not the new one: |
My bad. I saw "test_offload_memory" and assumed it was a new test. I triggered a rerun. I haven't seen this one fail on CI for a long time. Is it very flaky and frequently fails? |
No worries! No idea, am also looking at it but couldn't figure out why it was failing now. Hasn't been flaky recently. I've seen it fail once or twice but not in the last few weeks. |
The rerun is still failing the same way. It starts to look like something changed on CI... |
What does this PR do?
Adds the low level SSD APIs required for representing a tensor, reading/writing to/from memory/disk. These will be used in a future PR to implement SSD offload with FSDP.
The two main concepts here are SsdTensorHandle and SsdBuffer. The SsdTensorHandle represents a tensor that can be in memory or on disk. There are APIs to enable us to set the right metadata (memory offset, file offset, file params etc.). The SsdBuffer consists of a list of SsdTensorHandles and represents all the parameters in a module (as an example). We should be able to read all tensors to and from disk as part of the SsdBuffer.
Note: This PR is a join effort between @another-pjohnson and @anj-s .
Before submitting
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.