Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cagra memory optimizations #1790

Merged
merged 15 commits into from
Sep 10, 2023
Merged

Conversation

benfred
Copy link
Member

@benfred benfred commented Aug 30, 2023

When trying to build a CAGRA index with 500M embeddings, we were running out of memory - even when using managed memory.

This PR contains some changes to reduce the memory usage:

  • For certain large matrices, don't make 2nd copies on the device or host if the memory is already accessible via UVM /ATS/HMM. For instance, we were taking a copy of the intermediate graph from host to device memory - and in certain cases (500M dataset, intermediate_graph_degree=128) the intermediate graph was 256GB alone.
  • Don’t create a separate ‘pruned_graph’ host matrix in the optimize call, and just use host memory passed in by caller
  • Free the intermediate graph before creating the index

@benfred benfred requested a review from a team as a code owner August 30, 2023 23:40
@github-actions github-actions bot added the cpp label Aug 30, 2023
@benfred benfred added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function labels Aug 30, 2023
template <typename T, typename IdxT>
class device_matrix_view_from_host {
public:
device_matrix_view_from_host(raft::resources const& res, host_matrix_view<T, IdxT> host_view)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @wphicks this pattern seems a lot like the mdbuffer to me. The goal here is to make a device_mdspan when the pointer can be accessed from device or copy memory to device when it can't.

Comment on lines 155 to 166
/**
* Utility to sync memory from a host_matrix_view to a device_matrix_view
*
* In certain situations (UVM/HMM/ATS) host memory might be directly accessible on the
* device, and no extra allocations need to be performed. This class checks
* if the host_matrix_view is already accessible on the device, and only creates device
* memory and copies over if necessary. In memory limited situations this is preferable
* to having both a host and device copy
*/
template <typename T, typename IdxT>
class device_matrix_view_from_host {
public:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm definitely okay keeping this as an internal utility for now. Could you add a todo to the docs here (and for the host->device conversion function) to use mdbuffer for this once it's available?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a TODO here 6552c66

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @benfred!

@cjnolet
Copy link
Member

cjnolet commented Sep 10, 2023

/merge

@rapids-bot rapids-bot bot merged commit 12480cf into rapidsai:branch-23.10 Sep 10, 2023
57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Development

Successfully merging this pull request may close these issues.

2 participants