Feature complete Metal FFT #1102

barronalex · 2024-05-10T20:27:55Z

Proposed changes

A feature complete GPU FFT implementation in Metal.

Supports

All n < 2^20
Real and Inverse FFTs: fft, ifft, rfft, irfft
ND FFTs: fft2, ifft2, rfft2, irfft2, fftn, ifftn, rfftn, irfftn

Algorithms

A mixed radix out of place Stockham FFT for n where all prime factors p have 2 =< p <= 13.
Rader's Algorithm for n with one prime factor p > 13 where p-1 can be computed via Stockholm.
Bluestein's Algorithm for all other n.
Four Step FFT for n > 4096 when the FFT can no longer be done purely in GPU shared memory.

Performance

For 2 <= n < 512, 1D complex to complex FFTs on my M1 Max, the average bandwidths are:

MLX GPU: 162.9 GB/s
MPS GPU: 69.3 GB/s
MLX CPU: 5.9 GB/s

So this implementation is about 2.3x faster than MPS on average and about 27x faster than CPU MLX which uses pocketfft.

This implementation does specialize for different values of n with Metal function constants so it will have more overhead than MPS on the first call for new Stockham/Rader sizes.

barronalex · 2024-05-10T20:29:38Z

mlx/backend/accelerate/primitives.cpp

@@ -255,6 +257,96 @@ void AsType::eval_cpu(const std::vector<array>& inputs, array& out) {
  eval(inputs, out);
 }



Feels a bit wrong having this as a primitive but I wasn't sure if there's a better way to it.

barronalex · 2024-05-10T20:31:10Z

mlx/backend/metal/device.cpp

@@ -357,7 +357,6 @@ MTL::Function* Device::get_function_(
  }

  mtl_func_consts->release();


@jagrit06 I was getting segfaults caused by this release when using function constants, but couldn't figure out the best place in the code to move it to. Any idea where it should fit in?

I think it's a bug to release that, and deleting it is correct. https://github.com/bkaradzic/metal-cpp/blob/metal-cpp_macOS14.2_iOS17.2/README.md#memory-allocation-policy

awni · 2024-05-13T15:51:08Z

Very impressive perf!

Regarding the design, there is a big style difference from other MLX ops which we should change if possible. Basically you do the dispatch at the op-level rather than the Primitive level. I see how this might be easier as you have access to all the ops you need for different FFT algorithms, but I don't think we should do it this way. The compute graph should be more independent of the implementation details. Also, I don't think it makes sense for the FFT plans themselves should not be part of the compute graph (implementation detail).

This redesign may require some changes to our existing backend to make it workable for you to use the requisite back-end ops from the FFT primitive's eval_gpu.

barronalex · 2024-05-13T16:03:43Z

That makes sense to me, it did feel a little anti-pattern bloating out the graph but the MLX api is just really convenient!
Let me give the re-write a go today, I don't think it'll be too bad.

awni · 2024-05-13T16:06:42Z

That makes sense to me, it did feel a little anti-pattern bloating out the graph but the MLX api is just really convenient!
Let me give the re-write a go today, I don't think it'll be too bad.

We have really bad support for doing stuff on arrays inside primitives (MLX wasn't really designed with that in mind 😓 ). But I think we can improve it a lot if needed.

angeloskath

Awesome perf and generally very nice work! Kudos!

I left a comment on BluesteinFFTSetup to maybe avoid the double precision math. I think it should be doable, let me know if I am missing something or if it feels too experimental.

angeloskath · 2024-05-13T18:02:53Z

mlx/backend/accelerate/primitives.cpp

+  // In numpy:
+  // w_k = np.exp(-1j * np.pi / N * (np.arange(-N + 1, N) ** 2))
+  // w_q = np.fft.fft(1/w_k)
+  // return w_k, w_q


What do you think of section IV.E of https://mc.stanford.edu/cgi-bin/images/7/75/SC08_FFT_on_GPUs.pdf . Would it solve our problem here to avoid double precision arithmetic?

Fig 6 is very promising :-)

This looks nice! I simplified the double precision part a bit so I think I'm going to keep it for now since it's not really an accuracy or performance bottleneck. Happy to revisit in the future though.

barronalex · 2024-06-05T16:26:35Z

OK that took a little while but I think the FFTs are in a reasonable state now:

All the GPU planning/running logic has been moved to metal/fft.cpp so we're not bloating the graph at all
Added a no transpose four step FFT implementation so big powers of two are fast now (~100-140GB/s on M1 Max)
Added FFT to the JIT
Refactored the reading/writing so we now support RFFT/IRFFT for Stockham/Rader/Bluestein/4 Step directly in the kernel

awni · 2024-06-06T15:29:09Z

mlx/fft.cpp

+    // GPU scatter for complex64 is NYI
+    in =
+        scatter(tmp, std::vector<array>{}, in, std::vector<int>{}, Device::cpu);


Can we do that with a slice_update instead?

That sounds nicer -- I'll update it

awni · 2024-06-06T15:37:15Z

mlx/backend/metal/kernels/fft.h

+#include "mlx/backend/metal/kernels/fft/radix.h"
+#include "mlx/backend/metal/kernels/fft/readwrite.h"
+#include "mlx/backend/metal/kernels/steel/defines.h"
+#include "mlx/backend/metal/kernels/utils.h"


So this is why you don't need to use the utils() in the JIT, because its already included here by the preprocessor.

To keep the JIT source small, it would be better to move the includes that we already have in the JIT out of this file (e.g. kernels/utils.h) and use the utils() when constructing the JIT source.

You can include kernels/utils.h in fft.metal before you include fft.h. I would just turn off clang formatting for that whole file and it won't mess with the include order.

awni · 2024-06-06T15:37:49Z

mlx/backend/metal/kernels/fft/radix.h

+
+#include <metal_common>
+
+#include "mlx/backend/metal/kernels/utils.h"


Note you should also remove the include here.

awni · 2024-06-06T15:39:56Z

mlx/backend/metal/kernels/utils.h

+METAL_FUNC float2 complex_mul(float2 a, float2 b) {
+  return float2(a.x * b.x - a.y * b.y, a.x * b.y + a.y * b.x);
+}
+
+// Complex mul followed by conjugate
+METAL_FUNC float2 complex_mul_conj(float2 a, float2 b) {
+  return float2(a.x * b.x - a.y * b.y, -a.x * b.y - a.y * b.x);
+}
+
+// Compute an FFT twiddle factor
+METAL_FUNC float2 get_twiddle(int k, int p) {
+  float theta = -2.0f * k * M_PI_F / p;
+
+  float2 twiddle = {metal::fast::cos(theta), metal::fast::sin(theta)};
+  return twiddle;


If the only reason you are using utils.h is for these, it might be cleaner to just put those in fft.h instead. I think they also just fit better in fft.h if it works.. we have the complex64_t which should be used in general for complex muls.

Agreed it's definitely a bit confusing otherwise. I've removed the utils.h import.

awni

🚀 🚀

barronalex commented May 10, 2024

View reviewed changes

angeloskath reviewed May 13, 2024

View reviewed changes

Alex Barron added 4 commits June 5, 2024 07:51

feature complete metal fft

13081c7

fix contiguity bug

5216e5e

jit fft

039dcff

simplify rader/bluestein constant computation

81096cf

barronalex force-pushed the ab-metal-fft-complete branch from fd1c1a3 to 81096cf Compare June 5, 2024 14:53

awni reviewed Jun 6, 2024

View reviewed changes

Alex Barron added 3 commits June 6, 2024 09:35

remove kernel/utils.h dep

06e76f3

remove bf16.h dep

0253cfe

format

565937c

awni approved these changes Jun 6, 2024

View reviewed changes

barronalex merged commit 27d70c7 into ml-explore:main Jun 6, 2024
5 checks passed

barronalex deleted the ab-metal-fft-complete branch June 6, 2024 19:57

BrewTestBot mentioned this pull request Jun 7, 2024

mlx 0.15.0 Homebrew/homebrew-core#173939

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature complete Metal FFT #1102

Feature complete Metal FFT #1102

barronalex commented May 10, 2024 •

edited

Loading

barronalex May 10, 2024

barronalex May 10, 2024

awni May 13, 2024

awni commented May 13, 2024

barronalex commented May 13, 2024

awni commented May 13, 2024

angeloskath left a comment

angeloskath May 13, 2024

angeloskath May 13, 2024

barronalex Jun 5, 2024

barronalex commented Jun 5, 2024

awni Jun 6, 2024

barronalex Jun 6, 2024

awni Jun 6, 2024

awni Jun 6, 2024

awni Jun 6, 2024

barronalex Jun 6, 2024

awni left a comment

		@@ -255,6 +257,96 @@ void AsType::eval_cpu(const std::vector<array>& inputs, array& out) {
		eval(inputs, out);
		}

		@@ -357,7 +357,6 @@ MTL::Function* Device::get_function_(
		}

		mtl_func_consts->release();


		#include <metal_common>

		#include "mlx/backend/metal/kernels/utils.h"

Feature complete Metal FFT #1102

Feature complete Metal FFT #1102

Conversation

barronalex commented May 10, 2024 • edited Loading

Proposed changes

Supports

Algorithms

Performance

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awni commented May 13, 2024

barronalex commented May 13, 2024

awni commented May 13, 2024

angeloskath left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barronalex commented Jun 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

barronalex commented May 10, 2024 •

edited

Loading