Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RVV/R5V implementation and optimization #371

Open
ken-unger opened this issue Oct 8, 2024 · 1 comment
Open

RVV/R5V implementation and optimization #371

ken-unger opened this issue Oct 8, 2024 · 1 comment

Comments

@ken-unger
Copy link

Hi @rdolbeau

We’ve recently been using your risc-v optimized fftw3 implementation found in https://github.com/rdolbeau/fftw3/tree/riscv-v-clean

I’m interested in your assessment of the opportunity for further optimization of the vector implementation within simd-support/simd-r5v.h. Is there additional work that might make sense here, or do you believe you have achieved most of the gains? For example, I see that lmul=1 (m1) is used whereas I would have anticipated a higher lmul might be possible. Or perhaps that has been explored already and rejected.

Any thoughts appreciated.

Thanks,
Ken

@rdolbeau
Copy link
Contributor

rdolbeau commented Oct 8, 2024

FFTs (including FFTW3's codelets) tend to require a lot of registers unless quite small. Anything that increase register pressure, including split-mode and larger LMUL (which is effectively equivalent to unrolling for this use case), are likely to cause excessive spillage of registers which will degrade performance. Also, some instructions do not like larger LMUL performance-wise - vrgather in particular is likely to misbehave in many implementations (e.g. it takes a time quadratic with LMUL on the SpacemiT K1).

Nonetheless, there's certainly room for improvements (it has only been tested on my K1), but I'd say currently the compilers should be the primary target. I'm attaching a subset of benchmarks on the K1, and as a remark:

(a) gcc (blue/cyan) is better than LLVM (red/orange) overall
(b) gcc is better when using the 'interleaved' representation ('r5v') than when using the 'split' representation ('r5vsplit')
(c) llvm is better when using the 'split' representation than when using the 'interleaved' one

There should be a more consistent behavior of the software stack before doing much more tuning on the library itself. This is gcc-14 and clang-18 from a recent Debian Sid, so it's not an up-to-date issue (though LLVM moves quickly). A quick glance at the ASM suggest that excessive spillage is already an issue in some of the slower variants. Under a similar test setup, the SVE implementation is much more stable in behavior across compilers.

doit.ALL.double.1d.drxx.p2.pdf

Edit: it was clang-18 not -16, numbers were produced in mid-august '24 when 18 was still current.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants