RVV/R5V implementation and optimization #371

ken-unger · 2024-10-08T05:50:49Z

We’ve recently been using your risc-v optimized fftw3 implementation found in https://github.com/rdolbeau/fftw3/tree/riscv-v-clean

I’m interested in your assessment of the opportunity for further optimization of the vector implementation within simd-support/simd-r5v.h. Is there additional work that might make sense here, or do you believe you have achieved most of the gains? For example, I see that lmul=1 (m1) is used whereas I would have anticipated a higher lmul might be possible. Or perhaps that has been explored already and rejected.

Any thoughts appreciated.

Thanks,
Ken

rdolbeau · 2024-10-08T09:31:16Z

FFTs (including FFTW3's codelets) tend to require a lot of registers unless quite small. Anything that increase register pressure, including split-mode and larger LMUL (which is effectively equivalent to unrolling for this use case), are likely to cause excessive spillage of registers which will degrade performance. Also, some instructions do not like larger LMUL performance-wise - vrgather in particular is likely to misbehave in many implementations (e.g. it takes a time quadratic with LMUL on the SpacemiT K1).

Nonetheless, there's certainly room for improvements (it has only been tested on my K1), but I'd say currently the compilers should be the primary target. I'm attaching a subset of benchmarks on the K1, and as a remark:

(a) gcc (blue/cyan) is better than LLVM (red/orange) overall
(b) gcc is better when using the 'interleaved' representation ('r5v') than when using the 'split' representation ('r5vsplit')
(c) llvm is better when using the 'split' representation than when using the 'interleaved' one

There should be a more consistent behavior of the software stack before doing much more tuning on the library itself. This is gcc-14 and clang-18 from a recent Debian Sid, so it's not an up-to-date issue (though LLVM moves quickly). A quick glance at the ASM suggest that excessive spillage is already an issue in some of the slower variants. Under a similar test setup, the SVE implementation is much more stable in behavior across compilers.

doit.ALL.double.1d.drxx.p2.pdf

Edit: it was clang-18 not -16, numbers were produced in mid-august '24 when 18 was still current.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RVV/R5V implementation and optimization #371

RVV/R5V implementation and optimization #371

ken-unger commented Oct 8, 2024

rdolbeau commented Oct 8, 2024 •

edited

Loading

RVV/R5V implementation and optimization #371

RVV/R5V implementation and optimization #371

Comments

ken-unger commented Oct 8, 2024

rdolbeau commented Oct 8, 2024 • edited Loading

rdolbeau commented Oct 8, 2024 •

edited

Loading