Skip to content
Romain Dolbeau edited this page Mar 3, 2023 · 17 revisions

What's going on here

On branch arm-sve-clean masking is always used. --enable-sve creates codelets for 128, 256, 512, 1024 & 2048 bits SIMD. They are only used if the hardware has a width equal or larger than the codelet. As you need the Arm C Language Extension for SVE, this requires ARM HPC Compiler version 19.3 or newer (earlier version have a minor bug triggered by this code), or GCC 10 or newer.

Branch riscv-v adds support for the RISC-V cycle counter (for RV64 & RV32) and some basic support for a draft of the 'V' extension using built-ins functions - for which you need the EPI (European Processor Initiative) version of LLVM. The source drop of the EPI LLVM for revision 0.8 of the V extension can be found here: https://repo.hca.bsc.es/gitlab/rferrer/llvm-epi-0.8 (see https://lists.riscv.org/g/tech-vector-ext/message/24 for more details).

Both branches use a generated vtw.h file in simd-support/ that is not automatically generated. Before compiling, it's required to:

  • in simd-support/ compile generate_vtw.c to generate_vtw (i.e. gcc generate_vtw.c -o generate_vtw)
  • in simd-support/ generate vtw.h with ./generate_vtx.sh > vtw.h

SVE configuration

Currently the SVE option is not added to the compiler automatically by configure; so when configuring FFTW3 you need to

  • enable SVE (and probably NEON) explicitely
  • enable a counter for performance evaluation; the cntvct (recommended) is always available but sometimes of dubious accuracy [1] while the pmccntr is privileged by default but cycle-accurate, see https://github.com/rdolbeau/enable_arm_pmu
  • enable SVE in the compiler flags

For instance:

./configure --enable-neon --enable-sve --enable-fma --enable-armv8-cntvct-el0  CFLAGS="-O3 -march=armv8.2-a+sve" CXXFLAGS="-O3 -march=armv8.2-a+sve" FFLAGS="-O3 -march=armv8.2-a+sve"

[1] the counter increment at an implementation-specific rate; Linux reports it like this: arch_timer: cp15 timer(s) running at 54.00MHz (phys). This on a a Raspberry Pi 4 ; Fujitsu A64FX counter runs at 100 MHz, while the Graviton 3 counter runs at 1050 MHz and the Ampere Altra at 25 MHz.

Acknowledgements

This work has partly been done as part of the European Processor Initiative project.

The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647

Clone this wiki locally