ch06.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch06">
  <title>Polynomial multiplication</title>
  <para><indexterm><primary>Polynomial multiplication</primary></indexterm>This chapter discusses polynomial multiplication used with CRC codes and the GCM mode of operation for AES.</para>
  <section id="poly_crc">
    <title>CRC-32 and CRC-32C</title>
    <para><indexterm><primary>CRC-32</primary></indexterm><indexterm><primary>Polynomial multiplication</primary><secondary>CRC-32</secondary></indexterm>CRC checksums on POWER8 are nothing like the SSE4 or Aarch64 CRC instructions. <indexterm><primary>Anton Blanchard</primary></indexterm>Please refer to Anton Blanchard's GitHub at <ulink url="https://github.com/antonblanchard/crc32-vpmsum">CRC32/vpmsum</ulink> for the discussion and sample code.</para>
  </section>
  <section id="poly_gcm">
    <title>GCM mode</title>
    <para><indexterm><primary>GCM mode</primary></indexterm><indexterm><primary>Polynomial multiplication</primary><secondary>GCM mode</secondary></indexterm>POWER8 GCM mode is implemented using the <systemitem>vpmsumd</systemitem> instruction. GCM uses the double-word variant and to perform 64x64 → 128-bit polynomial multiplies. The GCC builtin is <indexterm><primary>__builtin_crypto_vpmsumd</primary></indexterm><systemitem>__builtin_crypto_vpmsumd</systemitem>, and the XLC intrinsic is <indexterm><primary>__vpmsumd</primary></indexterm><systemitem>__vpmsumd</systemitem>.</para>
    <para><systemitem>vpmsumd</systemitem> creates two 128-bit products and xor's them together. One product is from the multiplication of the low dwords, and the second product is from the multiplication of the high dwords. The trick to using <systemitem>vpmsumd</systemitem> is to ensure one of the products is 0. You can ensure one of the products is 0 by setting one of the 64-bit dwords to 0.</para>
    <para>The output of <systemitem>vpmsumd</systemitem> is endian sensitive like the <systemitem>vec_sld</systemitem> instruction. On big-endian systems the result of a multiply is <systemitem>{a,b}</systemitem>, where <systemitem>a</systemitem> and <systemitem>b</systemitem> are double words in a vector. The same multiplication on a little-endian system produces <systemitem>{b,a}</systemitem>. The swapping can be seen with the test program below.</para>
    <programlisting><?code-font-size 75% ?>$ cat test.cxx
#include &lt;altivec.h&gt;
#undef vector
typedef __vector unsigned long long uint64x2_p;

int main(int argc, char* argv[])
{
  uint64x2_p a = {0, 04};
  uint64x2_p b = {0, 64};
  uint64x2_p c = __builtin_crypto_vpmsumd(a, b);

  return 0;
}</programlisting>
    <para>Running the program on <systemitem>gcc112</systemitem>, which is <systemitem>ppc64-le</systemitem>, results in the following output. Notice the output is <systemitem>{0x100, 0x0}</systemitem>.</para>
    <programlisting><?code-font-size 75% ?>(gdb) r
Starting program: /home/test/test.exe
Breakpoint 1, main (argc=0x1, argv=0x3ffffffff4b8) at test.cxx:11
11        return 0;
(gdb) p c
$1 = {0x100, 0x0}</programlisting>
    <para>And running the program on <systemitem>gcc119</systemitem>, which is <systemitem>ppc64-be</systemitem>, results in the following output. Notice the output is <systemitem>{0x0, 0x100}</systemitem>.</para>
    <programlisting><?code-font-size 75% ?>(gdb) r
Starting program: /home/test/test.exe
Breakpoint 1, main (argc=0x1, argv=0x2ff22bb8) at test.cxx:11
11        return 0;
(gdb) p c
$1 = {0x0, 0x100}</programlisting>
    <para>Three helper functions are needed for GCM mode. The first two are <systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem>. The functions extract the high and low 64-bit double words and return them in a vector. The third is <systemitem>VecSwapWords</systemitem>. <systemitem>VecSwapWords</systemitem> exchanges 64-bit double words on little-endian systems after the multiplication.</para>
    <para><systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem> return a vector padded on the left with 0's. That means the return vector already has one of the terms set to 0 so <systemitem>vpmsumd</systemitem> will behave like ARM's <indexterm><primary>pmull_p64</primary></indexterm><systemitem>pmull_p64</systemitem> or Intel's <indexterm><primary>_mm_clmulepi64_si128</primary></indexterm><systemitem>_mm_clmulepi64_si128</systemitem>.</para>
    <para>The source code for the three functions are shown below. Shifts and rotates are preferred over permutes or masks because shifts are generally faster and use fewer instructions.</para>
    <programlisting><?code-font-size 75% ?><indexterm><primary>VecGetHigh</primary></indexterm>uint64x2_p VecGetHigh(uint64x2_p val)
{
    return VecShiftRight&lt;8&gt;(val);
}

<indexterm><primary>VecGetLow</primary></indexterm>uint64x2_p VecGetLow(uint64x2_p val)
{
    return VecShiftRight&lt;8&gt;(VecShiftLeft&lt;8&gt;(val));
}

<indexterm><primary>VecSwapWords</primary></indexterm>template &lt;class T&gt;
VecSwapWords(const T val)
{
    return (T)VecRotateLeft&lt;8&gt;(val);
}</programlisting>
    <para>Using <systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem> the function <systemitem>VecPolyMultiply</systemitem> can be implemented as follows using the <systemitem>vpmsumd</systemitem> builtin.</para>
    <programlisting><?code-font-size 75% ?><indexterm><primary>VecPolyMultiply</primary></indexterm>uint64x2_p
VecPolyMultiply(uint64x2_p a, uint64x2_p b)
{
#if defined(__xlc__) || defined(__xlC__)
#  if __BIG_ENDIAN__
     return __vpmsumd(VecGetLow(a), VecGetLow(b));
#  else
     return VecSwapWords(__vpmsumd(VecGetLow(a), VecGetLow(b)));
#  endif
#else
#  if __BIG_ENDIAN__
     return __builtin_crypto_vpmsumd(
                VecGetLow(a), VecGetLow(b));
#  else
     return VecSwapWords(__builtin_crypto_vpmsumd(
                VecGetLow(a), VecGetLow(b)));
#  endif
#endif
}</programlisting>
    <para>The code listed above provides Intel's <systemitem>_mm_clmulepi64_si128(a, b, 0x00)</systemitem> or ARM's <systemitem>pmull_p64</systemitem>. The function may be better named <systemitem>VecPolyMult00</systemitem> because it multiplies the two low 64-bit double words. The table below shows how to create the four variations needed for GCM mode.</para>
    <informaltable frame="all">
      <tgroup align="center" cols="3">
        <colspec colname="c1" colwidth="2.0in"/>
        <colspec colname="c2" colwidth="2.0in"/>
        <colspec colname="c3" colwidth="2.0in"/>
        <thead>
          <row>
            <entry align="center">Function</entry>
            <entry align="center">Parameter <systemitem>a</systemitem></entry>
            <entry align="center">Parameter <systemitem>b</systemitem></entry>
          </row>
        </thead>
        <tbody>
          <row>
            <entry align="center">
              <systemitem>VecPolyMult00</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetLow(a)</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetLow(b)</systemitem>
            </entry>
          </row>
          <row>
            <entry align="center">
              <systemitem>VecPolyMult01</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetLow(a)</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetHigh(b)</systemitem>
            </entry>
          </row>
          <row>
            <entry align="center">
              <systemitem>VecPolyMult10</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetHigh(a)</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetLow(b)</systemitem>
            </entry>
          </row>
          <row>
            <entry align="center">
              <systemitem>VecPolyMult11</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetHigh(a)</systemitem>
            </entry>
            <entry align="center">
              <systemitem>VecGetHigh(b)</systemitem>
            </entry>
          </row>
        </tbody>
      </tgroup>
    </informaltable>
    <para>The next steps are implement the GCM multiplication and reduction routines. The code for multiplication is easy and shown below. The code for reduction is the tricky one and left as an exercise to the reader.</para>
    <programlisting><?code-font-size 75% ?><indexterm><primary>GCM_Multiply</primary></indexterm><indexterm><primary>GCM_Reduce</primary></indexterm>uint64x2_p
GCM_Multiply(uint64x2_p x, uint64x2_p h)
{
    uint64x2_p c0 = VecPolyMult00(x, h);
    uint64x2_p c1 = VecXor(VecPolyMult01(x, h), VecPolyMult10(x, h));
    uint64x2_p c2 = VecPolyMult11(x, h);

    return GCM_Reduce(c0, c1, c2);
}</programlisting>
    <para>Implementing <systemitem>GCM_Multiply</systemitem> and <systemitem>GCM_Reduce</systemitem> will require some forethought. You will likely use a reflected algorithm, and nearly everything gets turned on its head. For example, GCM's vpmsum wrapper swaps words on big-endian systems and leaves words alone on little-endian systems. As another example GCM's vpmsum wrapper reads low words with <systemitem>VecGetHigh</systemitem> and high words with <systemitem>VecGetLow</systemitem> to ensure the correct words are presented to the function call.</para>
    <para>Also see the head notes for POWER8 in the Crypto++ implementation at <ulink url="https://github.com/weidai11/cryptopp/blob/master/gcm-simd.cpp"><systemitem>gcm-simd.cpp</systemitem></ulink>. If you are feeling adventurous then you can find the Cryptogams implementation for OpenSSL at <ulink url="https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghashp8-ppc.pl"><systemitem>ghashp8-ppc.pl
</systemitem></ulink>.</para>
  </section>
</chapter>