-
Notifications
You must be signed in to change notification settings - Fork 0
/
ch06.xml
155 lines (150 loc) · 9.84 KB
/
ch06.xml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd">
<chapter id="ch06">
<title>Polynomial multiplication</title>
<para><indexterm><primary>Polynomial multiplication</primary></indexterm>This chapter discusses polynomial multiplication used with CRC codes and the GCM mode of operation for AES.</para>
<section id="poly_crc">
<title>CRC-32 and CRC-32C</title>
<para><indexterm><primary>CRC-32</primary></indexterm><indexterm><primary>Polynomial multiplication</primary><secondary>CRC-32</secondary></indexterm>CRC checksums on POWER8 are nothing like the SSE4 or Aarch64 CRC instructions. <indexterm><primary>Anton Blanchard</primary></indexterm>Please refer to Anton Blanchard's GitHub at <ulink url="https://github.com/antonblanchard/crc32-vpmsum">CRC32/vpmsum</ulink> for the discussion and sample code.</para>
</section>
<section id="poly_gcm">
<title>GCM mode</title>
<para><indexterm><primary>GCM mode</primary></indexterm><indexterm><primary>Polynomial multiplication</primary><secondary>GCM mode</secondary></indexterm>POWER8 GCM mode is implemented using the <systemitem>vpmsumd</systemitem> instruction. GCM uses the double-word variant and to perform 64x64 → 128-bit polynomial multiplies. The GCC builtin is <indexterm><primary>__builtin_crypto_vpmsumd</primary></indexterm><systemitem>__builtin_crypto_vpmsumd</systemitem>, and the XLC intrinsic is <indexterm><primary>__vpmsumd</primary></indexterm><systemitem>__vpmsumd</systemitem>.</para>
<para><systemitem>vpmsumd</systemitem> creates two 128-bit products and xor's them together. One product is from the multiplication of the low dwords, and the second product is from the multiplication of the high dwords. The trick to using <systemitem>vpmsumd</systemitem> is to ensure one of the products is 0. You can ensure one of the products is 0 by setting one of the 64-bit dwords to 0.</para>
<para>The output of <systemitem>vpmsumd</systemitem> is endian sensitive like the <systemitem>vec_sld</systemitem> instruction. On big-endian systems the result of a multiply is <systemitem>{a,b}</systemitem>, where <systemitem>a</systemitem> and <systemitem>b</systemitem> are double words in a vector. The same multiplication on a little-endian system produces <systemitem>{b,a}</systemitem>. The swapping can be seen with the test program below.</para>
<programlisting><?code-font-size 75% ?>$ cat test.cxx
#include <altivec.h>
#undef vector
typedef __vector unsigned long long uint64x2_p;
int main(int argc, char* argv[])
{
uint64x2_p a = {0, 04};
uint64x2_p b = {0, 64};
uint64x2_p c = __builtin_crypto_vpmsumd(a, b);
return 0;
}</programlisting>
<para>Running the program on <systemitem>gcc112</systemitem>, which is <systemitem>ppc64-le</systemitem>, results in the following output. Notice the output is <systemitem>{0x100, 0x0}</systemitem>.</para>
<programlisting><?code-font-size 75% ?>(gdb) r
Starting program: /home/test/test.exe
Breakpoint 1, main (argc=0x1, argv=0x3ffffffff4b8) at test.cxx:11
11 return 0;
(gdb) p c
$1 = {0x100, 0x0}</programlisting>
<para>And running the program on <systemitem>gcc119</systemitem>, which is <systemitem>ppc64-be</systemitem>, results in the following output. Notice the output is <systemitem>{0x0, 0x100}</systemitem>.</para>
<programlisting><?code-font-size 75% ?>(gdb) r
Starting program: /home/test/test.exe
Breakpoint 1, main (argc=0x1, argv=0x2ff22bb8) at test.cxx:11
11 return 0;
(gdb) p c
$1 = {0x0, 0x100}</programlisting>
<para>Three helper functions are needed for GCM mode. The first two are <systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem>. The functions extract the high and low 64-bit double words and return them in a vector. The third is <systemitem>VecSwapWords</systemitem>. <systemitem>VecSwapWords</systemitem> exchanges 64-bit double words on little-endian systems after the multiplication.</para>
<para><systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem> return a vector padded on the left with 0's. That means the return vector already has one of the terms set to 0 so <systemitem>vpmsumd</systemitem> will behave like ARM's <indexterm><primary>pmull_p64</primary></indexterm><systemitem>pmull_p64</systemitem> or Intel's <indexterm><primary>_mm_clmulepi64_si128</primary></indexterm><systemitem>_mm_clmulepi64_si128</systemitem>.</para>
<para>The source code for the three functions are shown below. Shifts and rotates are preferred over permutes or masks because shifts are generally faster and use fewer instructions.</para>
<programlisting><?code-font-size 75% ?><indexterm><primary>VecGetHigh</primary></indexterm>uint64x2_p VecGetHigh(uint64x2_p val)
{
return VecShiftRight<8>(val);
}
<indexterm><primary>VecGetLow</primary></indexterm>uint64x2_p VecGetLow(uint64x2_p val)
{
return VecShiftRight<8>(VecShiftLeft<8>(val));
}
<indexterm><primary>VecSwapWords</primary></indexterm>template <class T>
VecSwapWords(const T val)
{
return (T)VecRotateLeft<8>(val);
}</programlisting>
<para>Using <systemitem>VecGetHigh</systemitem> and <systemitem>VecGetLow</systemitem> the function <systemitem>VecPolyMultiply</systemitem> can be implemented as follows using the <systemitem>vpmsumd</systemitem> builtin.</para>
<programlisting><?code-font-size 75% ?><indexterm><primary>VecPolyMultiply</primary></indexterm>uint64x2_p
VecPolyMultiply(uint64x2_p a, uint64x2_p b)
{
#if defined(__xlc__) || defined(__xlC__)
# if __BIG_ENDIAN__
return __vpmsumd(VecGetLow(a), VecGetLow(b));
# else
return VecSwapWords(__vpmsumd(VecGetLow(a), VecGetLow(b)));
# endif
#else
# if __BIG_ENDIAN__
return __builtin_crypto_vpmsumd(
VecGetLow(a), VecGetLow(b));
# else
return VecSwapWords(__builtin_crypto_vpmsumd(
VecGetLow(a), VecGetLow(b)));
# endif
#endif
}</programlisting>
<para>The code listed above provides Intel's <systemitem>_mm_clmulepi64_si128(a, b, 0x00)</systemitem> or ARM's <systemitem>pmull_p64</systemitem>. The function may be better named <systemitem>VecPolyMult00</systemitem> because it multiplies the two low 64-bit double words. The table below shows how to create the four variations needed for GCM mode.</para>
<informaltable frame="all">
<tgroup align="center" cols="3">
<colspec colname="c1" colwidth="2.0in"/>
<colspec colname="c2" colwidth="2.0in"/>
<colspec colname="c3" colwidth="2.0in"/>
<thead>
<row>
<entry align="center">Function</entry>
<entry align="center">Parameter <systemitem>a</systemitem></entry>
<entry align="center">Parameter <systemitem>b</systemitem></entry>
</row>
</thead>
<tbody>
<row>
<entry align="center">
<systemitem>VecPolyMult00</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetLow(a)</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetLow(b)</systemitem>
</entry>
</row>
<row>
<entry align="center">
<systemitem>VecPolyMult01</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetLow(a)</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetHigh(b)</systemitem>
</entry>
</row>
<row>
<entry align="center">
<systemitem>VecPolyMult10</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetHigh(a)</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetLow(b)</systemitem>
</entry>
</row>
<row>
<entry align="center">
<systemitem>VecPolyMult11</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetHigh(a)</systemitem>
</entry>
<entry align="center">
<systemitem>VecGetHigh(b)</systemitem>
</entry>
</row>
</tbody>
</tgroup>
</informaltable>
<para>The next steps are implement the GCM multiplication and reduction routines. The code for multiplication is easy and shown below. The code for reduction is the tricky one and left as an exercise to the reader.</para>
<programlisting><?code-font-size 75% ?><indexterm><primary>GCM_Multiply</primary></indexterm><indexterm><primary>GCM_Reduce</primary></indexterm>uint64x2_p
GCM_Multiply(uint64x2_p x, uint64x2_p h)
{
uint64x2_p c0 = VecPolyMult00(x, h);
uint64x2_p c1 = VecXor(VecPolyMult01(x, h), VecPolyMult10(x, h));
uint64x2_p c2 = VecPolyMult11(x, h);
return GCM_Reduce(c0, c1, c2);
}</programlisting>
<para>Implementing <systemitem>GCM_Multiply</systemitem> and <systemitem>GCM_Reduce</systemitem> will require some forethought. You will likely use a reflected algorithm, and nearly everything gets turned on its head. For example, GCM's vpmsum wrapper swaps words on big-endian systems and leaves words alone on little-endian systems. As another example GCM's vpmsum wrapper reads low words with <systemitem>VecGetHigh</systemitem> and high words with <systemitem>VecGetLow</systemitem> to ensure the correct words are presented to the function call.</para>
<para>Also see the head notes for POWER8 in the Crypto++ implementation at <ulink url="https://github.com/weidai11/cryptopp/blob/master/gcm-simd.cpp"><systemitem>gcm-simd.cpp</systemitem></ulink>. If you are feeling adventurous then you can find the Cryptogams implementation for OpenSSL at <ulink url="https://github.com/openssl/openssl/blob/master/crypto/modes/asm/ghashp8-ppc.pl"><systemitem>ghashp8-ppc.pl
</systemitem></ulink>.</para>
</section>
</chapter>