Intrinsics: Faster cuberoot scaling functions #557

marshallward · 2024-01-28T19:33:12Z

This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number.

This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function.

Timings (s) (16M array, -O3 -mavx -mfma, obviously many details omitted...)

Solver	-O2	-O3
GNU x**1/3	0.225	0.198
GNU cuberoot before	0.418	0.412
GNU cuberoot after	0.208	0.187
Intel x**1/3	0.068	0.067
Intel before	0.514	0.507
Intel after	0.213	0.189

At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors.

Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement.

I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().

codecov · 2024-01-28T19:36:00Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (76f0668) 37.20% compared to head (da9dd8a) 37.22%.

Additional details and impacted files

@@             Coverage Diff              @@
##           dev/gfdl     #557      +/-   ##
============================================
+ Coverage     37.20%   37.22%   +0.02%     
============================================
  Files           271      271              
  Lines         80355    80384      +29     
  Branches      14985    14985              
============================================
+ Hits          29897    29925      +28     
  Misses        44902    44902              
- Partials       5556     5557       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

marshallward · 2024-01-29T23:00:23Z

Some comments from @Hallberg-NOAA

The "machine" FP parameters in rescale and descale could probably be sensibly moved into the module, rather than redefining them in each function.
The cube root of the rescaled number in [0.125,1) will be in [0.5,1), so its exponent will always be -1. This will lead to some simplification.
Using (2), the new exponent can be computed immediately by rescale, so that descale can be transformed to a general purpose function. (e.g. fifth root?).

marshallward · 2024-01-30T21:04:02Z

I added two commits which address the requested changes:

The first refactors the rescale/descale functions so that rescale manages the cuberoot of the number's exponent, and descale is now a generic exponent substitution.
The second commit addresses an observed issue with differences in GNU and Intel, where Intel would use pow() in its SVML library to evaluate x**3. Replacing this with explicit x*x*x multiplication appears to stop this operation.

The second will probably change answers (again), and it does add some more weird variables to hold the cube values, so we can decide if it's really worth the fight.

marshallward · 2024-01-30T21:29:02Z

I forgot to add the test case that we talked about. I also added @Hallberg-NOAA as a coauthor to the rescale refactor commit.

marshallward · 2024-01-30T21:36:06Z

Good thing I added those tests... looking into it now.

marshallward · 2024-01-31T10:15:43Z

This seems to be passing now, with no regression on GitHub. (Consistent since GNU seems to always do x**3 as x*x*x.)

Hallberg-NOAA · 2024-01-31T14:19:09Z

I think that we need to add one more test of the cuberoot function - of x = (1.0 - EPSILON(1.0)), which would fit neatly in the range of 0.125 <= x < 1.0 but which would be hardest test of the assumption that 0.5 <= cuberoot(x) < 1.0 and hence has an exponent of -1. Because mathematically (1-eps)**3 = 1 - 3*eps + 3*eps**2 - eps**3 ~= 1 - 3*eps, I suspect that the best answer for any iterative solver of cuberoot(1.0-epsilon(1.0)) would be 1.0, which of course has an exponent of 0. If so, we might need to revisit line 177 of descale().

src/framework/MOM_intrinsic_functions.F90

Hallberg-NOAA

I have examined these changes, and (to the extent that I can follow all of the clever bit-manipulations) I am convinced that they are correct. Moreover, I think that we now have pretty thorough testing of this new capability that would detect anything that is not working as intended. I am therefore approving this code.

I am still a little nervous about whether our assumptions that we will have 0.5 <= root_asx < 1.0 on line 112, or whether there is a tiny (of order 1 in 10^16) chance that we would actually generate some case where root_asx == 1.0, and hence that our a our assumption that the base-2 exponent of root_asx is always -1 would be wrong, and that we do not have any code to handle this case. Adding such a test should not be too hard (previous version of the code did add the extra test), but it would make this new routine slower. I think that we have done our best to test for this, and it appears not to occur, but I am still nervous anyway.

marshallward · 2024-02-01T20:49:10Z

FWIW there is an opportunity to write the exponent as e/3 + 1 - e_x/3 which may address the problem you are concerned about. If it somehow got bumped from -1 to zero, then that would carry over to the exponent.

marshallward · 2024-02-01T21:05:04Z

You were right: This value: 0.99999999999999989 (0x3FEFFFFFFFFFFFFF) incorrectly returns 0.5 for all of my solvers. epsilon() for 1.0 is double epsilon() for 0.5, so there was still a "gap". If that gap is removed, it returns 1 which incorrectly gets morphed to 0.5.

Ed: We can write it in the test as 1.0 - 0.5*epsilon(1.). As for how to fix it... pending.

I am withdrawing the approval in light of the further investigations showing that the current code is not quite right for specific values very near to 1.

marshallward · 2024-02-02T03:54:49Z

I've addressed the 1 - 0.5*ULP problem by going back to the old method and reading the exponent of the scaled result. Maybe silly for one value, but it's still 2.5x faster than the current implementation.

I also changed the unit test so that it's actually capable of detecting this problem.

Hallberg-NOAA

I am now happy to approve this valuable and thorough PR without any further concerns.

This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number. This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function. Timings (s) (16M array, -O3 -mavx -mfma) | Solver | -O2 | -O3 | |---------------------|-------|-------| | GNU x**1/3 | 0.225 | 0.198 | | GNU cuberoot before | 0.418 | 0.412 | | GNU cuberoot after | 0.208 | 0.187 | | Intel x**1/3 | 0.068 | 0.067 | | Intel before | 0.514 | 0.507 | | Intel after | 0.213 | 0.189 | At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors. Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement. I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().

Some modifications were made to the cuberoot rescale and descale functions: * The machine parameters were moved from function to module parameters. This could dangerously expose them to other functions, but it prevents multiple definitions of the same numbers. * The exponent is now cube-rooted in rescale rather than descale. * The exponent expressions are broken into more explicit steps, rather than combining multiple steps and assumptions into a single expression. * The bias is no longer assumed to be a multiple of three. This is true for double precision but not single precision. A new test of quasi-random number was also added to the cuberoot test suite. These numbers were able to detect the differences in GNU and Intel compiler output. A potential error in the return value of the test was also fixed. The volatile test of 1 - 0.5*ULP has been added. The cube root of this value rounds to 1, and needs to be handled carefully. The unit test function `cuberoot(v**3)` was reversed to `cuberoot(v)**`, to include testing of this value. (Cubing would wipe out the anomaly.)

In separate testing, we observed that Intel would use the `pow()` function to evaluate the cubes of some numbers, causing different answers with GNU. In this patch, I replace the cubic x**3 operations with explicit x*x*x multiplication, which appears to avoid this substitution. Well, for the moment, at least.

Hallberg-NOAA · 2024-02-02T20:37:35Z

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22161.

marshallward force-pushed the faster_cuberoot_rescale branch from 00fd227 to 4927caf Compare January 28, 2024 19:35

marshallward force-pushed the faster_cuberoot_rescale branch from 4927caf to 0cadb77 Compare January 29, 2024 17:12

marshallward force-pushed the faster_cuberoot_rescale branch from 5781236 to 2b35272 Compare January 30, 2024 21:28

marshallward force-pushed the faster_cuberoot_rescale branch from 2b35272 to 6e4d5c1 Compare January 30, 2024 22:01

Hallberg-NOAA reviewed Jan 31, 2024

View reviewed changes

src/framework/MOM_intrinsic_functions.F90 Show resolved Hide resolved

marshallward force-pushed the faster_cuberoot_rescale branch 2 times, most recently from be4f2c2 to cc36692 Compare January 31, 2024 18:58

Hallberg-NOAA previously approved these changes Feb 1, 2024

View reviewed changes

marshallward force-pushed the faster_cuberoot_rescale branch from cc36692 to 3f93358 Compare February 2, 2024 03:53

Hallberg-NOAA approved these changes Feb 2, 2024

View reviewed changes

marshallward added 3 commits February 2, 2024 08:27

Hallberg-NOAA force-pushed the faster_cuberoot_rescale branch from 3f93358 to da9dd8a Compare February 2, 2024 13:27

Hallberg-NOAA merged commit 736ef16 into NOAA-GFDL:dev/gfdl Feb 2, 2024
12 checks passed

marshallward deleted the faster_cuberoot_rescale branch May 8, 2024 15:01

This was referenced May 16, 2024

GFDL to main (2024-05-16) mom-ocean/MOM6#1626

Closed

GFDL to main 2024-05-16 mom-ocean/MOM6#1627

Closed

marshallward mentioned this pull request May 31, 2024

GFDL to main (2024-05-31) mom-ocean/MOM6#1631

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intrinsics: Faster cuberoot scaling functions #557

Intrinsics: Faster cuberoot scaling functions #557

marshallward commented Jan 28, 2024

codecov bot commented Jan 28, 2024 •

edited

Loading

marshallward commented Jan 29, 2024

marshallward commented Jan 30, 2024

marshallward commented Jan 30, 2024 •

edited

Loading

marshallward commented Jan 30, 2024

marshallward commented Jan 31, 2024 •

edited

Loading

Hallberg-NOAA commented Jan 31, 2024

Hallberg-NOAA left a comment

marshallward commented Feb 1, 2024

marshallward commented Feb 1, 2024 •

edited

Loading

marshallward commented Feb 2, 2024

Hallberg-NOAA left a comment

Hallberg-NOAA commented Feb 2, 2024

Intrinsics: Faster cuberoot scaling functions #557

Intrinsics: Faster cuberoot scaling functions #557

Conversation

marshallward commented Jan 28, 2024

codecov bot commented Jan 28, 2024 • edited Loading

Codecov Report

marshallward commented Jan 29, 2024

marshallward commented Jan 30, 2024

marshallward commented Jan 30, 2024 • edited Loading

marshallward commented Jan 30, 2024

marshallward commented Jan 31, 2024 • edited Loading

Hallberg-NOAA commented Jan 31, 2024

Hallberg-NOAA left a comment

Choose a reason for hiding this comment

marshallward commented Feb 1, 2024

marshallward commented Feb 1, 2024 • edited Loading

marshallward commented Feb 2, 2024

Hallberg-NOAA left a comment

Choose a reason for hiding this comment

Hallberg-NOAA commented Feb 2, 2024

codecov bot commented Jan 28, 2024 •

edited

Loading

marshallward commented Jan 30, 2024 •

edited

Loading

marshallward commented Jan 31, 2024 •

edited

Loading

marshallward commented Feb 1, 2024 •

edited

Loading