Intrinsics: Faster cuberoot scaling functions #557

This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number. This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function. Timings (s) (16M array, -O3 -mavx -mfma) | Solver | -O2 | -O3 | |---------------------|-------|-------| | GNU x**1/3 | 0.225 | 0.198 | | GNU cuberoot before | 0.418 | 0.412 | | GNU cuberoot after | 0.208 | 0.187 | | Intel x**1/3 | 0.068 | 0.067 | | Intel before | 0.514 | 0.507 | | Intel after | 0.213 | 0.189 | At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors. Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement. I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().

Some modifications were made to the cuberoot rescale and descale functions: * The machine parameters were moved from function to module parameters. This could dangerously expose them to other functions, but it prevents multiple definitions of the same numbers. * The exponent is now cube-rooted in rescale rather than descale. * The exponent expressions are broken into more explicit steps, rather than combining multiple steps and assumptions into a single expression. * The bias is no longer assumed to be a multiple of three. This is true for double precision but not single precision. A new test of quasi-random number was also added to the cuberoot test suite. These numbers were able to detect the differences in GNU and Intel compiler output. A potential error in the return value of the test was also fixed. The volatile test of 1 - 0.5*ULP has been added. The cube root of this value rounds to 1, and needs to be handled carefully. The unit test function `cuberoot(v**3)` was reversed to `cuberoot(v)**`, to include testing of this value. (Cubing would wipe out the anomaly.)

In separate testing, we observed that Intel would use the `pow()` function to evaluate the cubes of some numbers, causing different answers with GNU. In this patch, I replace the cubic x**3 operations with explicit x*x*x multiplication, which appears to avoid this substitution. Well, for the moment, at least.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intrinsics: Faster cuberoot scaling functions #557

Intrinsics: Faster cuberoot scaling functions #557

Commits on Feb 2, 2024