Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

InvMixColumn() optimization #159

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

dmitrystu
Copy link
Contributor

This will optimize GF(2^8) math in InvMixColumns().

With arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release)

arm-none-eabi-gcc -mcpu=cortex-m3 --function-sections -Os -c aes.c -o aes.o
arm-none-eabi-nm -S aes.o
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 0) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 000001ac t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- BEFORE (MULTIPLY_AS_A_FUNCTION = 1) ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000164 t InvCipher
00000000 0000007c t KeyExpansion
00000000 00000032 t Multiply
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime
--- AFTER ---
00000000 00000028 t AddRoundKey
00000000 00000082 T AES_CBC_decrypt_buffer
00000000 00000054 T AES_CBC_encrypt_buffer
00000000 0000007a T AES_CTR_xcrypt_buffer
00000000 00000014 T AES_ctx_set_iv
00000000 0000000a T AES_ECB_decrypt
00000000 0000000a T AES_ECB_encrypt
00000000 00000004 T AES_init_ctx
00000000 00000022 T AES_init_ctx_iv
00000000 00000100 t Cipher
00000000 00000140 t InvCipher
00000000 0000007c t KeyExpansion
00000100 0000000b r Rcon
0000010b 00000100 r rsbox
00000000 00000100 r sbox
00000000 00000012 t xtime

@kokke
Copy link
Owner

kokke commented Dec 20, 2021

Hi @dmitrystu and thanks for contributing this optimization with great comments! Sorry for keeping you waiting so long.

I think the original purpose of the multiply-as-a-function was for 8-bit devices. But I can't remember anymore. @DamonHD do you have any insights into the implications for the 8-bit crowd?

@dmitrystu
Copy link
Contributor Author

dmitrystu commented Dec 21, 2021

That's not a problem. I'll try to explain this commit to you.
No matter what's the reason to use Multiply as a function or as a macro. This commit optimizes a number of the math operations.
Each Multiply(x,y) over GF8 generally uses 10 multiply ⊗2 and 4 ⊕. In some cases, the number of operations will be reduced by the compiler.
So, originally A' = (A ⊗ 0x0e) ⊕ (B ⊗ 0x0b) ⊕ (C ⊗ 0x0d) ⊕ (D ⊗ 0x09)
Since 0x0E = 8 ⊕ 4 ⊕ 2 -> A⊗0x0E = (A⊗8) ⊕ (A⊗4) ⊕ (A⊗2)
Therefore
A' = (A⊗8 ⊕ A⊗4 ⊕ A⊗2) ⊕ (B⊗8 ⊕ B⊗2 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C) ⊕ (D⊗8 ⊕ D)
B' = (A⊗8 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B⊗2) ⊕ (C⊗8 ⊕ C⊗2 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D)
C' = (A⊗8 ⊕ A⊗4 ⊕ A) ⊕ (B⊗8 ⊕ B) ⊕ (C⊗8 ⊕ C⊗4 ⊕ C⊗2) ⊕ (D⊗8 ⊕ D⊗2 ⊕ D)
D' = (A⊗8 ⊕ A⊗2 ⊕ A) ⊕ (B⊗8 ⊕ B⊗4 ⊕ B) ⊕ (C⊗8 ⊕ C) ⊕ (D⊗8 ⊕ D⊗4 ⊕ D⊗2)
You will get this if you expand a Multiply(x,y) macro for the A', B', C', D'
We have a 16 X⊗8, 8 X⊗4, 8 X⊗2, and 40 ⊕ ops.
Let's optimize something
Since (A ⊕ B) ⊕ C = A ⊕ (B ⊕ C) = A ⊕ B ⊕ C and A⊗(B ⊕ C) = (A⊗B) ⊕ (A⊗C) and A ⊕ A = 0
Let T = (A ⊕ B ⊕ C ⊕ D)⊗8 ⊕ (A ⊕ B ⊕ C ⊕ D)
Then
A' = T ⊕ (A ⊕ C)⊗4 ⊕ (A ⊕ B)⊗2 ⊕ A
B' = T ⊕ (B ⊕ D)⊗4 ⊕ (B ⊕ C)⊗2 ⊕ B
C' = T ⊕ (A ⊕ C)⊗4 ⊕ (C ⊕ D)⊗2 ⊕ C
D' = T ⊕ (B ⊕ D)⊗4 ⊕ (D ⊕ A)⊗2 ⊕ D
Let calculate T as
T = (A ⊕ B ⊕ C ⊕ D); T = T ⊕ T⊗8
And reuse (A ⊕ C)⊗4 and (B ⊕ D)⊗4

Num of ops Old New Remark
⊗8 16 1 xtime(xtime(xtime(X)))
⊗4 8 2 xtime(xtime(X))
⊗2 8 4 xtime(X)
40 18 XOR

Not bad, huh?

@kokke
Copy link
Owner

kokke commented Dec 21, 2021

Thanks for the explanation - I was also able to follow the comments you've added in the code :) Good job

Not bad, huh?

No it's quite brilliant - thank you for contributing :)

I would still like to hear from @DamonHD w.r.t. implications for the 8-bit guys, but I think this change will be an improvement for those chips as well.

@DamonHD
Copy link

DamonHD commented Dec 21, 2021

I'd say that whatever we have in our lib was what we found worked well enough*, but note that we adopted this code (the original is also checked in). I think that this is the relevant file from ours?

https://github.com/opentrv/OTAESGCM/blob/master/content/OTAESGCM/utility/OTAESGCM_OTAESGCM.cpp

Rgds

Damon

*Full packaging and encryption (or decode) of a ~63 byte frame on an ATmega328p running ~1MHz CPU is ~0.5s IIRC. We only needed to do one every couple of minutes in the most common case, and CPU was already only a tiny fraction of our energy budget, so we didn't fiddle further. Other systems' constraints will be different.

PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)

@dmitrystu
Copy link
Contributor Author

IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.

@kokke
Copy link
Owner

kokke commented Dec 22, 2021

IMO this code shouldn't be vulnerable against time-based or power consumption-based attacks because it contains no extra branches inside, just some precomputations.

and

PS. I had a thought: be careful of optimising the computation if its is data- or key- dependent else you may leak bits by different run-times or power usage etc. I undid a few such reflexive optimisations once I'd thought about this issue... (Have just finished work and should give myself a break, so will not attempt to digest your code unless you shame me into doing so!)

That boat sailed long ago :P
Many microprocessors can't execute multiplication-instructions in constant time. On small ARM CPUs the execution time (number of cycles) for multiplication depends on how many bits are set in the operands/arguments.

From https://developer.arm.com/documentation/ddi0180/a/instruction-cycle-summary-and-interlocks/instruction-cycle-times/multiplier-cycle-counts

The number of cycles that a multiply instruction takes to complete depends on which instruction it is, and on the value of the multiplier-operand.

See https://www.bearssl.org/ctmul.html (look a bit down the page) for a more general discussion.

BUT I've read a lot on proof-of-concept timing attacks. To the best of my knowledge however, there haven't been (m)any practical attacks in the wild using this technique.

If your hardware/system mostly handles the crypto and not much else (e.g. imagine a MCU in a safe), then it could be vulnerable to having the timing and/or power measured.
If you do handle other stuff (e.g. field-bus communication etc.) it's my experience that you usually introduce enough jitter to make the jitter indistinguishable from white noise -- the timing differences drown in variability from other parts of the software.

It's a super valid point though! And on "real" hardware (e.g. x86 or ARM64) you should definitely use something else than this library to get constant time primitives.

@kokke
Copy link
Owner

kokke commented Dec 22, 2021

Hmm, I just tried comparing the size of the code compiled using this optimization.

It looks like the binary size increases on AVR and stays the same on ARM/Thumb.

So on AVR it doesn't look like a good optimization w.r.t. size (I would need to benchmark to test w.r.t. speed) and for ARM it doesn't make any difference. I will have to look more into this.

For AVR:

/tmp/tiny-AES-c git:(master) 
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1157       0       0    1157     485 aes.o

/tmp/tiny-AES-c git:(master) 
$ cd ../tiny-AES-c-opt/

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix) 
$ avr-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1193       0       0    1193     4a9 aes.o

For ARM/Thumb

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
    883       0       0     883     373 aes.o

/tmp/tiny-AES-c-opt git:(feat/optimize_inv_mix)
$ cd ../cd tiny-AES-c

/tmp/tiny-AES-c git:(master)
$ arm-none-eabi-gcc -c aes.c -DCTR=1 -DECB=0 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
    883       0       0     883     373 aes.o

I tested with these versions of gcc

$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ arm-none-eabi-gcc --version
arm-none-eabi-gcc (15:10.3-2021.07-1) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

@dmitrystu
Copy link
Contributor Author

It's so strange you got a different code size for CTR on AVR. CTR doesn't use decipher. Perhaps you forgot to rebase a feature branch onto the latest master. Here is what I got with the latest master and ECB.

$ arm-none-eabi-gcc --version
rm-none-eabi-gcc (GNU Arm Embedded Toolchain 10.3-2021.07) 10.3.1 20210621 (release)
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ avr-gcc --version
avr-gcc (GCC) 5.4.0
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
   1399       0       0    1399     577 aes.o

$ git checkout feat/optimize_inv_mix 
Switched to branch 'feat/optimize_inv_mix'
$ arm-none-eabi-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os -mthumb ; size aes.o
   text    data     bss     dec     hex filename
   1299       0       0    1299     513 aes.o

$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1593       0       0    1593     639 aes.o

$ git checkout feat/optimize_inv_mix 
Switched to branch 'feat/optimize_inv_mix'
$ avr-gcc -c aes.c -DCTR=0 -DECB=1 -DCBC=0 -Os ; size aes.o
   text    data     bss     dec     hex filename
   1471       0       0    1471     5bf aes.o

@kokke
Copy link
Owner

kokke commented Dec 22, 2021

I meant to compare github.com/kokke/tiny-AES-c (branch=main) with github.com/dmitrystu/tiny-AES-c (branch=feat/optimize_inv_mix) but I might have messed up - I'll check when I get home from work ..

Thanks for checking up 👍

@dmitrystu
Copy link
Contributor Author

FYI. This commit slightly reduces code size for AVR, because size_t on AVR is equal to uint16_t.

@dmitrystu
Copy link
Contributor Author

dmitrystu commented Dec 22, 2021

BTW. Found another possible optimization for the xtime(x). This also guarantees constant execution time (no multiplication, nor branches).

static uint8_t xtime(uint8_t x)
{
  // extend MSB to other bits using arithmetical shift right
  int8_t mask = (int8_t)x >> 7;
  return ((x<<1) ^ ((uint8_t)mask & 0x1b));
}

AVR:

00000058 <xtime>:
  58:   98 2f           mov     r25, r24
  5a:   99 0f           add     r25, r25
  5c:   99 0b           sbc     r25, r25
  5e:   9b 71           andi    r25, 0x1B       ; 27
  60:   88 0f           add     r24, r24
  62:   89 27           eor     r24, r25
  64:   08 95           ret

cortex-m0:

00000034 <xtime>:
  34:   0003            movs    r3, r0
  36:   221b            movs    r2, #27
  38:   b240            sxtb    r0, r0
  3a:   11c0            asrs    r0, r0, #7
  3c:   4010            ands    r0, r2
  3e:   005b            lsls    r3, r3, #1
  40:   4058            eors    r0, r3
  42:   b2c0            uxtb    r0, r0
  44:   4770            bx      lr

It seems that the AVR compiler knows this template.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants