Skip to content

Computation Accuracy of Various sin Implemenations

Yichao Yu edited this page Jan 9, 2021 · 3 revisions

On the CPU, we use an implementation of the sin function (more precisely sin(pi * x) / pi, but will be scaled to sin for all tests below) that uses the lowest order of polynomial to give us enough accuracy. The polynomial is numerically optimized to minimize the maximum error for x within [0, 0.5] and is slightly different from the truncated Taylor series, which has the maximum error at large x instead of distributing it evenly across the whole range.

On the GPU, there are more than one choices with potentially different accuracy and speed tradeoffs. The OpenCL standard provides sin, half_sin and native_sin functions and we could potentially also port our implementation to the GPU as well. We are the most interested in the sin function, which should have a guaranteed accuracy of 4 ulp or the native_sin, which should be the fastest option available (although it seems that AMD's GPU implements sinpi natively so if we are not limited to OpenCL C, we might be able to get a little faster by calling the native sinpi instead, e.g. via the llvm.amdgcn.sin.f32 intrinsic). If none of these seem like the best option, we can then have a look at half_sin.

In order to test this, we'll compare the computation with the double precision result from the system libm or libmvec when available. For both the double precision (i.e. baseline) and the single precision (i.e. to be tested) version, the input to the function is calculated as double precision. For some input values, this might cause a round off error at the input of the function which might cause a few ULP of difference (one ULP for the input value which could be more than the ULP at 1.0). However, over the range of our input, this should not cause more than 16 ULP of error.

We'll measure the error in terms of the single precision ULP at 1.0 (i.e. std::numeric_limits<float>::epsilon()). Since the resolution of the AWG is 16-bit or 256 ULP, anything around 1 ULP basically won't show up without more than 100 traps. Even around 10 ULP should be pretty safe after averaging over all the channels.

In terms of the implementations to test, we have our CPU implementation for scalars and SSE2, AVX, AVX2 and AVX512 on x86 as well as ASIMD/NEON on AArch64 (the scalar version on x86 and AArch64 are also slightly different in the rounding). All of these use the same algorithm but could have slightly different rounding. For OpenCL, we have the following platforms and devices.

  1. Intel's OpenCL CPU runtime on the i7-6700K and i9-10885H CPUs.

  2. Intel's Compute OpenCL runtime for the GPUs on the two CPUs above which are UHD 530 and UHD 640 respectively.

  3. AMD's ROCm OpenCL driver on AMD Radeon RX 5500 XT

    This is the most interesting one to test.

The results of the tests can be found in data/cl-compute-accuracy and data/cpu-compute-accuracy for the OpenCL implementations and our CPU implementation respectively. The plot for the results are shown below. The x axes are the input to the sin function and the y axes are the error in unit of single precision ULP at 1.0. Errors around and below 10 are generally acceptable.

Accuracy Result

For our CPU version, we see that all the implementations shows very similar results and the error is no more than 10 to 11 ULP. We can also see the polynomial behavior of the error within each pi/2 range and maxes out on both sides within the range to minimize the worst error.

The AMD GPU implementation, especially the native one, is also remarkably good. Some of the errors at larger input could be caused by the rounding of the input as mentioned before. The rounding should be easier when calculating sinpi so it'll be interesting to check it's behavior there after we figured out a way to directly use that instruction on the AMD GPU.

The Intel GPU sin function behaves OK although it seems that it might be out of spec. The native_sin error is a little too large for us.

The Intel CPU sin and native_sin implementations appears to be terrible which is also a bit surprising. It seems to exceed the error bound from the OpenCL spec even within the first period. I guess we are not using it to replace our implementation...

Now we know what accuracy each option has, let's check the performance.