Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support with CuPy #37

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

brownbaerchen
Copy link

As discussed in #14, GPU support can be achieved relatively easily by swapping numpy for cupy in a bunch of places. However, the communication is a bit tricky because even CUDA-aware MPI is not as streamlined with GPU data as with CPU data. On JUWELS Booster, calling Alltoallw from mpi4py directly leads to many small individual send/receive operations with copies between host and device. This is very slow to the point that the CPU implementation is faster. This may be specific to Jülich machines. I cannot test this.
A remedy was implementing a custom Alltoallw replacement. However, I am not an expert on this and chose a fairly simple scheme which roughly follows this. It gives decent weak scaling but poor strong scaling. In particular, MPI requires synchronisation between device and host after the send buffer has been prepared, which means strong scaling is inherently difficult.
I also played around with NCCL for Alltoallw using the same simple communication scheme (NCCL doesn't have Alltoallw). NCCL allows to forgo a lot of synchronisation and strong scales better. However, I am a beginner with GPUs and my feeling is that an expert can point out a much better communication scheme with NCCL, possibly with no explicit synchronisation at all. See below a plot of strong scaling, comparing to NVIDIAs cuFFTMp:

strong_scaling_ok

Note that cuFFTMp does not have easy to use python bindings and the data was generated using no python at all. Because it uses NVSHMEM, it can do without much synchronisation and strong scales really well. The orange lines use the mpi4py-fft plus CuPy version in this PR with different communication backends.

I am unfortunately not an expert on GPUs or FFTs. I am sure there is plenty of room for improvement in both the communication and maybe also the calls to the CuPy FFT functions. For more details, please see the discussion in #14. Any help is appreciated!
Please point out issues with my programming and inconsistencies with your conventions as well!

@brownbaerchen
Copy link
Author

I was experimenting with CUDA graphs and it seems to work really well as the following plots show:
scaling_graphs
scaling_double

I record separate graphs for forward and backward operations for each transfer class. Once these are recorded, it seems they are very efficient. Keep in mind that the measured times do not include recording the graph. But this overhead should not be a large concern in practice.

I ran the unit tests and I always make sure that the error after transforming back and forth is on the order of machine precision. So if you don't find any more errors, this implementation works well enough for my needs.
I am not entirely happy with all my changes. Particularly, the implementation of copyto in the wrappers of the FFT backends is a bit messy. If you have a better idea for this, please tell me. Also, I am sure you can point out a lot more places where the code could be more neat.

I am not sure if it is worth keeping the CustomMPI communication backend. As the NCCL+graphs thing is so much better, I can remove this if you prefer.
The cupyx-scipy backend for serial FFTs is also a bit unnecessary as it behaves the same as the cupy one.

Please let me know all the necessary steps to merge this PR regarding testing, code style, ...
I have to say that the library is really well written and approachable! I want to deteriorate that as little as possible.

@brownbaerchen brownbaerchen marked this pull request as ready for review February 29, 2024 13:51
@brownbaerchen
Copy link
Author

I want to comment a bit more on Alltoallw, since I anticipate a comment on this. To quote the paper you wrote to go along mpi4py-fft:

MPI_ALLTOALL(V) works on contiguous dataarrays in both ends, both for sending and receiving ranks, and there are highly optimised versions avail- able on several architectures. On the contrary, the subarray type used by MPI_ALLTOALLW is in general discontiguous, and there are to the authors’ knowledge no architecture-specific optimizations available.

Indeed, I observe much better performance with Alltoall(v) compared to Alltoallw on GPUs, presumably due to lack of architecture specific optimisation. I wrote a small script for measuring the performance on contiguous arrays with size divisible the number of processes, such that I can use either communication method. The results are as follows:
alltoall_variants_cost
At the largest data size, which is close to maxing out the memory of the GPUs, there is more than two orders of magnitude difference between Alltoall(v) and w. Clearly, this is more difference than copying to a contiguous buffer first.

I write this as a justification of the communication backend I implemented in NCCL. This is neither MPI, nor does it avoid the intermediate step of copying to contiguous buffers with Alltoallw. I can see that you may reasonably object to adding features to your code that don't fit the premise in the name or in the accompanying paper. However, This is really crucial to good performance on GPUs on my specific machine and I hope that you agree that the roughly 10x speedup from GPUs vs. CPUs I measured for FFTs on a given number of compute nodes is worth it.

@mikaem
Copy link
Member

mikaem commented Apr 8, 2024

Hi.

This looks great, but unfortunately I don't know how to test this myself. I'm on a mac and as far as I can tell cupy is not supported.

However, this PR makes cupy a hard dependency, and that is not ok. You need to hide imports to cupy such that regular usage of mpi4py-fft does not break for anyone without cupy (like myself).

@brownbaerchen
Copy link
Author

By the way, I talked to a member of the MPI Forum and developer of OpenMPI at a conference and showed him the plots with poor performance of Alltoallw. He was not surprised at all. Apparently Alltoallw on GPUs is very low priority for them. So there is no point in waiting for this.
Also, I don't think that MPI will become stream aware anytime soon, which means explicit synchronisation between host and device is required before every MPI operation to make sure the buffers are ready. Instead of submitting a communication kernel to a stream on the GPU, MPI will just do the communication directly. This synchronisation means we cannot submit the next computation kernel before communication is finished, lowering throughput on the GPU. This kills strong scaling.
Hence, even if performant Alltoallw becomes available in MPI implementations, it still cannot compete with NCCL. NCCL does submit the communication to a stream and does not require the synchronisation.
The reason the Forum doesn't want stream awareness in the standard, as far as I know, is that they don't want any vendor specific stuff in there...

So, while it's not great that this implementation is totally specific to NVIDIA GPUs, especially with the CUDA graphs, I really see no other way that also gives competitive performance on NVIDIA hardware.

@brownbaerchen
Copy link
Author

It seems there have been some developments in testing open source code on GPUs, see here. If I understand correctly, you could apply for this program to have the code tested on GPUs free of charge. Does this sound like an option for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants