Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUDirect RDMA #18

Open
omor1 opened this issue May 2, 2020 · 5 comments
Open

GPUDirect RDMA #18

omor1 opened this issue May 2, 2020 · 5 comments

Comments

@omor1
Copy link
Member

omor1 commented May 2, 2020

I'm having trouble getting GDR working. My current understanding is that the way GDR works is that the InfiniBand driver has a plugin that interacts with the CUDA driver and runtime and the IB Verbs memory registration/deregistration functions are extended to be aware of GPU memory. From the application perspective, it doesn't really need to change anything; it can just pass a CUDA device pointer into the memory registration function and then use it for RDMA.

I'm having trouble getting this to work though, with a segfault resulting from ibv_post_send after the sendd/recvd rendezvous. Specifically, the segfault is the libc memcpy implementation; this indicates to me that I'm either missing something or GDR isn't set up right.

Vu, do you have any idea what could be going on? I unfortunately only have access to Comet's GPU nodes, so I can't try another platform.

@omor1
Copy link
Member Author

omor1 commented May 3, 2020

Actually I've discovered that Comet's OFED stack might not support GDR.

@omor1
Copy link
Member Author

omor1 commented May 3, 2020

Actually, scratch that, this is on me. Small data (rightly) is attempted to be sent inline, (meaning a memcpy), but that obviously doesn't work if that memory is on a CUDA device. I can add a check in my code to ensure the pointer isn't on a CUDA device.

To be honest, does it make sense to get rid of this check in lc_server_rma_rtr? I think that this path is only taken by direct-send, meaning that the user wants to do RDMA. If an inline send were wanted, the user would have used immediate-send...

@danghvu
Copy link
Member

danghvu commented May 11, 2020

Not completely understood the issue, can you answer the questions:

  1. Have you tried the bare-metal GPUDirect and confirm it works on the cluster ?

  2. Are you talking about this line for inline check: https://github.com/uiuc-hpc/LC/blob/51ef5280a5cc5a8b7e23501d6fc273ce5f0d8b28/src/include/server/server_ibv.h#L367 ? This is purely for performance consideration since registration is expensive. If this is for it to work with GPUDirect we may want to annotate the buffer somehow.

@omor1
Copy link
Member Author

omor1 commented May 11, 2020

  1. Have you tried the bare-metal GPUDirect and confirm it works on the cluster?

As long as I make sure to send a buffer larger than the s->max_inline parameter above, GPUDirect RDMA works.

  1. Are you talking about this line for inline check:

Yes. My thought is that if someone is choosing to use the direct communication type, they are intentionally opting into RDMA—we shouldn't hide this inline decision from them.

The workaround for GPUDirect RDMA is to a) detect if the buffer is in GPU memory and b) if so, either ignore the inline check or first copy it to host memory.

@danghvu
Copy link
Member

danghvu commented May 11, 2020

I get your point, though this is purely implementation choice since the interface does not (yet) tell whether a registration to be performed. The fact is that the buffer is small, so you don’t need to register it before you send, you may — it is wasting cycles.

Maybe a better choice would be to say the user needs to register the buffer with the runtime first, then we just get the lkey from the user or from registration table.

The fix can be simple as adding a condition, if it is a gpu buffer then register anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants