Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size even more: generate s-box table? #228

Open
Midronome opened this issue Jun 10, 2024 · 4 comments
Open

Reduce size even more: generate s-box table? #228

Midronome opened this issue Jun 10, 2024 · 4 comments

Comments

@Midronome
Copy link

Midronome commented Jun 10, 2024

Hi :)

First a HUGE thanks for making this and releasing it publically this way, really awesomely useful :))

I'm implementing this on a tiny uC and I'm trying to reduce (compiled) code size as much as possible. As suggested in the readme, I'm only using CTR.

The aes.c file mentions this:

// The lookup-tables are marked const so they can be placed in read-only storage instead of RAM
// The numbers below can be computed dynamically trading ROM for RAM - 
// This can be useful in (embedded) bootloader applications, where ROM is often limited.
static const uint8_t sbox[256] = {

I've been looking around the net for a way to generate this tables (I'm not using rsbox since I'm only using CTR), the best I could find is this: https://crypto.stackexchange.com/questions/98396/how-does-this-code-calculating-aes-s-box-work
But I am not sure how to get the value of p from the polynomial 1 + x + x^3 + x^4 + x^8 in C code?

Thank you!

@Midronome
Copy link
Author

Nervermind I found it in the comments, p is set to the polynomial with x=2, i.e. p=283.

So putting sbox in RAM and adding this function reduces the code size by 212 bytes :)
(but obviously increases the RAM usage by 265 bytes)

void AES_generateSbox(void) {
	uint32_t p = 283;

	uint32_t t[256];
	uint32_t x = 1;
	for (int i = 0; i < 256; i ++) {
		t[i] = x;
		x ^= (x << 1) ^ ((x >> 7) * p);
	}

	sbox[0] = 0x63;
	for (int i = 0; i < 255; i ++) {
		x = t[255 - i];
		x |= x << 8;
		x ^= (x >> 4) ^ (x >> 5) ^ (x >> 6) ^ (x >> 7);
		sbox[t[i]] = (x ^ 0x63) & 0xFF;
	}
}

I checked and the generated sbox is correct.

If anybody has any ideas on how to reduce the function's code size, I'm happy to hear it :)

@Midronome
Copy link
Author

Midronome commented Jun 10, 2024

Just FYI I found a few more minor optimizations, when compiling for a 32-bit processor. All the operations are byte-based, while for a 32-bit processor doing operations as 32-bit words (when possible) is much more efficient, saving both space and computing time. Fortunately the compiler optimizes already most of the possible ones, and there are many which are not optimizable (for example when working on 4 non-adjacent bytes).

Replacing all the uint8_t i with uint32_t i saved a couple of bytes.

Since I'm using only CTR most of the functions benefit from being inlined, so I added a bunch of "inline" on these functions' declaration... Only to find out that the compiler was inlining them already :D

This last optimization, in KeyExpansion() saved a few more (about 30?):

	// The first round key is the key itself.
	for (i = 0; i < Nk; ++i)
	{
		((uint32_t*)roundKey)[i] = ((uint32_t*)AES_key)[i]; // optimized version of the commented 4 lines below
		/*roundKey[(i * 4) + 0] = AES_key[(i * 4) + 0];
		roundKey[(i * 4) + 1] = AES_key[(i * 4) + 1];
		roundKey[(i * 4) + 2] = AES_key[(i * 4) + 2];
		roundKey[(i * 4) + 3] = AES_key[(i * 4) + 3];*/
	}

With these, as well as removing the AES_ctx structure (fixed key stored in ROM), I reduced the code size down to 684 bytes (including the key in ROM).

@paulmenzel
Copy link

Nice. It’d be great if you sent a merge/pull request.

@algrobman
Copy link

there are a few more cases where the code could be optimized for speed and maybe code size, assuming the buffers are DW aligned. the block structure can be represented by array of 2 uint64_t. uint64_t will be emulated with 32 bit operations on 32 bit machine and utilize power of 64 machine.

  1. memcpy - all data movement are limited to 16 bytes ( AES_BLOCKLEN):
void aes_memcpy(uint8_t *to, const uint8_t *from, size_t len){
// always copies 16 bytes 
    uint64_t *to64, *from64;
    from64 = (uint64_t *)from;
    to64   = (uint64_t *)to;
    to64[0] = from64[0];
    to64[1] = from64[1];
}
  1. XOR with Iv can be recoded as :
static void XorWithIv(uint8_t* buf, const uint8_t* Iv){
    uint64_t *buf_ptr, *Iv_ptr;
    buf_ptr = (uint64_t *) buf;
    Iv_ptr = (uint64_t *) Iv;
    buf_ptr[0] ^= Iv_ptr[0];
    buf_ptr[1] ^= Iv_ptr[1];
}
  1. finally : (there is no need to worry about 8 bit counters overflow and again Xor'ing can be speed up)
void AES_CTR_xcrypt_buffer(struct AES_ctx* ctx, uint8_t* buf, size_t length){

  uint8_t buffer[AES_BLOCKLEN];
  uint64_t *buffer64, *buf64;
  
  size_t i;
  int bi;

  buffer64 = (uint64_t *) buffer;
  buf64 = (uint64_t *) buf;
  
  for (i = 0; i < length; i += AES_BLOCKLEN){
      
      aes_memcpy(buffer, ctx->Iv, AES_BLOCKLEN);
      Cipher((state_t*)buffer,ctx->RoundKey);

      /* Increment Iv and handle overflow */
      for (bi = (AES_BLOCKLEN - 1); bi >= 0; --bi){
        ctx->Iv[bi]++;
        if(ctx->Iv[bi]) { break; }  
      }
 
      buf64[0] ^= buffer64[0];
      buf64[1] ^= buffer64[1];
      buf64 += 2;
  }
}

Probably 4/8 byte operations could be enforced more elegantly having a union in AES_ctx buffers with byte and DW arrays.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants