-
Notifications
You must be signed in to change notification settings - Fork 0
Computation Accuracy of Various sin Implemenations
On the CPU, we use an implementation of the sin function
(more precisely sin(pi * x) / pi
, but will be scaled to sin
for all tests below)
that uses the lowest order of polynomial to give us enough accuracy.
The polynomial is numerically optimized to minimize the maximum error for x
within [0, 0.5]
and is slightly different from the truncated Taylor series,
which has the maximum error at large x
instead of distributing it evenly across the whole range.
On the GPU, there are more than one choices with potentially different accuracy
and speed tradeoffs. The OpenCL standard provides sin
, half_sin
and native_sin
functions and we could potentially also port our implementation to the GPU as well.
We are the most interested in the sin
function, which should have a guaranteed accuracy of
4 ulp
or the native_sin
, which should be the fastest option available
(although it seems that AMD's GPU implements sinpi
natively so if we are not limited to
OpenCL C, we might be able to get a little faster by calling the native sinpi
instead,
e.g. via the llvm.amdgcn.sin.f32
intrinsic).
If none of these seem like the best option, we can then have a look at half_sin
.
In order to test this, we'll compare the computation with the double precision result
from the system libm
or libmvec
when available.
For both the double precision (i.e. baseline)
and the single precision (i.e. to be tested) version, the input to the function
is calculated as double precision. For some input values, this might cause a round off error
at the input of the function which might cause a few ULP
of difference
(one ULP
for the input value which could be more than the ULP
at 1.0
).
However, over the range of our input, this should not cause more than 16 ULP
of error.
We'll measure the error in terms of the single precision ULP
at 1.0
(i.e. std::numeric_limits<float>::epsilon()
).
Since the resolution of the AWG is 16-bit or 256 ULP
, anything around 1 ULP
basically
won't show up without more than 100 traps. Even around 10 ULP
should be pretty safe
after averaging over all the channels.
In terms of the implementations to test, we have our CPU implementation for scalars and SSE2, AVX, AVX2 and AVX512 on x86 as well as ASIMD/NEON on AArch64 (the scalar version on x86 and AArch64 are also slightly different in the rounding). All of these use the same algorithm but could have slightly different rounding. For OpenCL, we have the following platforms and devices.
-
Intel's OpenCL CPU runtime on the i7-6700K and i9-10885H CPUs.
-
Intel's Compute OpenCL runtime for the GPUs on the two CPUs above which are UHD 530 and UHD 640 respectively.
-
AMD's ROCm OpenCL driver on AMD Radeon RX 5500 XT
This is the most interesting one to test.
The results of the tests can be found in data/cl-compute-accuracy
and data/cpu-compute-accuracy
for the OpenCL implementations and our CPU implementation respectively. The plot for the results are shown below.
The x
axes are the input to the sin
function
and the y
axes are the error in unit of single precision ULP
at 1.0
.
Errors around and below 10
are generally acceptable.
For our CPU version, we see that all the implementations shows very similar results and
the error is no more than 10 to 11 ULP. We can also see the polynomial behavior of the error
within each pi/2
range and maxes out on both sides within the range
to minimize the worst error.
The AMD GPU implementation, especially the native one, is also remarkably good.
Some of the errors at larger input could be caused by the rounding of the input
as mentioned before. The rounding should be easier when calculating sinpi
so it'll be interesting to check it's behavior there after we figured out a way to
directly use that instruction on the AMD GPU.
The Intel GPU sin
function behaves OK although it seems that it might be out of spec.
The native_sin
error is a little too large for us.
The Intel CPU sin
and native_sin
implementations appears to be terrible
which is also a bit surprising. It seems to exceed the error bound from the OpenCL spec
even within the first period. I guess we are not using it to replace our implementation...
Now we know what accuracy each option has, let's check the performance.