Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intrinsics: Faster cuberoot scaling functions #557

Merged

Conversation

marshallward
Copy link
Member

This patch replaces the intrinsic-based exponent rescaling with explicit bit manipulation of the floating point number.

This appears to produce a ~2.5x speedup of the solver, reducing its time from embarassingly slow to disappointingly slow. It is slightly faster than the GNU cbrt function, but still about 3x slower than the Intel SVML cbrt function.

Timings (s) (16M array, -O3 -mavx -mfma, obviously many details omitted...)

Solver -O2 -O3
GNU x**1/3 0.225 0.198
GNU cuberoot before 0.418 0.412
GNU cuberoot after 0.208 0.187
Intel x**1/3 0.068 0.067
Intel before 0.514 0.507
Intel after 0.213 0.189

At least one issue here is that Intel SVML is using fast vectorized logic operators whereas the Fortran intrinsics are replaced with slower legacy scalar versions. Not sure there is much we could even do about that without complaining to vendors.

Also, I'm sure there's magic in their solvers which we are not capturing. Regardless, I think this is a major improvement.

I do not believe it will change answers, but probably a good idea to verify this and get it in before committing any solutions using cuberoot().

Copy link

codecov bot commented Jan 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (76f0668) 37.20% compared to head (da9dd8a) 37.22%.

Additional details and impacted files
@@             Coverage Diff              @@
##           dev/gfdl     #557      +/-   ##
============================================
+ Coverage     37.20%   37.22%   +0.02%     
============================================
  Files           271      271              
  Lines         80355    80384      +29     
  Branches      14985    14985              
============================================
+ Hits          29897    29925      +28     
  Misses        44902    44902              
- Partials       5556     5557       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@marshallward
Copy link
Member Author

Some comments from @Hallberg-NOAA

  • The "machine" FP parameters in rescale and descale could probably be sensibly moved into the module, rather than redefining them in each function.

  • The cube root of the rescaled number in [0.125,1) will be in [0.5,1), so its exponent will always be -1. This will lead to some simplification.

  • Using (2), the new exponent can be computed immediately by rescale, so that descale can be transformed to a general purpose function. (e.g. fifth root?).

@marshallward
Copy link
Member Author

I added two commits which address the requested changes:

  • The first refactors the rescale/descale functions so that rescale manages the cuberoot of the number's exponent, and descale is now a generic exponent substitution.

  • The second commit addresses an observed issue with differences in GNU and Intel, where Intel would use pow() in its SVML library to evaluate x**3. Replacing this with explicit x*x*x multiplication appears to stop this operation.

The second will probably change answers (again), and it does add some more weird variables to hold the cube values, so we can decide if it's really worth the fight.

@marshallward
Copy link
Member Author

marshallward commented Jan 30, 2024

I forgot to add the test case that we talked about. I also added @Hallberg-NOAA as a coauthor to the rescale refactor commit.

@marshallward
Copy link
Member Author

Good thing I added those tests... looking into it now.

@marshallward
Copy link
Member Author

marshallward commented Jan 31, 2024

This seems to be passing now, with no regression on GitHub. (Consistent since GNU seems to always do x**3 as x*x*x.)

@Hallberg-NOAA
Copy link
Member

I think that we need to add one more test of the cuberoot function - of x = (1.0 - EPSILON(1.0)), which would fit neatly in the range of 0.125 <= x < 1.0 but which would be hardest test of the assumption that 0.5 <= cuberoot(x) < 1.0 and hence has an exponent of -1. Because mathematically (1-eps)**3 = 1 - 3*eps + 3*eps**2 - eps**3 ~= 1 - 3*eps, I suspect that the best answer for any iterative solver of cuberoot(1.0-epsilon(1.0)) would be 1.0, which of course has an exponent of 0. If so, we might need to revisit line 177 of descale().

@marshallward marshallward force-pushed the faster_cuberoot_rescale branch 2 times, most recently from be4f2c2 to cc36692 Compare January 31, 2024 18:58
Hallberg-NOAA
Hallberg-NOAA previously approved these changes Feb 1, 2024
Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have examined these changes, and (to the extent that I can follow all of the clever bit-manipulations) I am convinced that they are correct. Moreover, I think that we now have pretty thorough testing of this new capability that would detect anything that is not working as intended. I am therefore approving this code.

I am still a little nervous about whether our assumptions that we will have 0.5 <= root_asx < 1.0 on line 112, or whether there is a tiny (of order 1 in 10^16) chance that we would actually generate some case where root_asx == 1.0, and hence that our a our assumption that the base-2 exponent of root_asx is always -1 would be wrong, and that we do not have any code to handle this case. Adding such a test should not be too hard (previous version of the code did add the extra test), but it would make this new routine slower. I think that we have done our best to test for this, and it appears not to occur, but I am still nervous anyway.

@marshallward
Copy link
Member Author

FWIW there is an opportunity to write the exponent as e/3 + 1 - e_x/3 which may address the problem you are concerned about. If it somehow got bumped from -1 to zero, then that would carry over to the exponent.

@marshallward
Copy link
Member Author

marshallward commented Feb 1, 2024

You were right: This value: 0.99999999999999989 (0x3FEFFFFFFFFFFFFF) incorrectly returns 0.5 for all of my solvers. epsilon() for 1.0 is double epsilon() for 0.5, so there was still a "gap". If that gap is removed, it returns 1 which incorrectly gets morphed to 0.5.

Ed: We can write it in the test as 1.0 - 0.5*epsilon(1.). As for how to fix it... pending.

@Hallberg-NOAA Hallberg-NOAA dismissed their stale review February 1, 2024 23:06

I am withdrawing the approval in light of the further investigations showing that the current code is not quite right for specific values very near to 1.

@marshallward
Copy link
Member Author

I've addressed the 1 - 0.5*ULP problem by going back to the old method and reading the exponent of the scaled result. Maybe silly for one value, but it's still 2.5x faster than the current implementation.

I also changed the unit test so that it's actually capable of detecting this problem.

Copy link
Member

@Hallberg-NOAA Hallberg-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now happy to approve this valuable and thorough PR without any further concerns.

This patch replaces the intrinsic-based exponent rescaling with explicit
bit manipulation of the floating point number.

This appears to produce a ~2.5x speedup of the solver, reducing its time
from embarassingly slow to disappointingly slow.  It is slightly faster
than the GNU cbrt function, but still about 3x slower than the Intel
SVML cbrt function.

Timings (s) (16M array, -O3 -mavx -mfma)

| Solver              |  -O2  |  -O3  |
|---------------------|-------|-------|
| GNU x**1/3          | 0.225 | 0.198 |
| GNU cuberoot before | 0.418 | 0.412 |
| GNU cuberoot after  | 0.208 | 0.187 |
| Intel x**1/3        | 0.068 | 0.067 |
| Intel before        | 0.514 | 0.507 |
| Intel after         | 0.213 | 0.189 |

At least one issue here is that Intel SVML is using fast vectorized
logic operators whereas the Fortran intrinsics are replaced with slower
legacy scalar versions.  Not sure there is much we could even do about
that without complaining to vendors.

Also, I'm sure there's magic in their solvers which we are not
capturing.  Regardless, I think this is a major improvement.

I do not believe it will change answers, but probably a good idea to
verify this and get it in before committing any solutions using
cuberoot().
Some modifications were made to the cuberoot rescale and descale
functions:

* The machine parameters were moved from function to module parameters.
  This could dangerously expose them to other functions, but it prevents
  multiple definitions of the same numbers.

* The exponent is now cube-rooted in rescale rather than descale.

* The exponent expressions are broken into more explicit steps, rather
  than combining multiple steps and assumptions into a single
  expression.

* The bias is no longer assumed to be a multiple of three.  This is true
  for double precision but not single precision.

A new test of quasi-random number was also added to the cuberoot test
suite.  These numbers were able to detect the differences in GNU and
Intel compiler output.  A potential error in the return value of the
test was also fixed.

The volatile test of 1 - 0.5*ULP has been added.  The cube root of this
value rounds to 1, and needs to be handled carefully.

The unit test function `cuberoot(v**3)` was reversed to `cuberoot(v)**`,
to include testing of this value.  (Cubing would wipe out the anomaly.)
In separate testing, we observed that Intel would use the `pow()`
function to evaluate the cubes of some numbers, causing different
answers with GNU.

In this patch, I replace the cubic x**3 operations with explicit x*x*x
multiplication, which appears to avoid this substitution.

Well, for the moment, at least.
@Hallberg-NOAA
Copy link
Member

This PR has passed pipeline testing at https://gitlab.gfdl.noaa.gov/ogrp/MOM6/-/pipelines/22161.

@Hallberg-NOAA Hallberg-NOAA merged commit 736ef16 into NOAA-GFDL:dev/gfdl Feb 2, 2024
12 checks passed
@marshallward marshallward deleted the faster_cuberoot_rescale branch May 8, 2024 15:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants