Skip to content
This repository has been archived by the owner on Aug 18, 2023. It is now read-only.

Manually initialize GcBox contents post-allocation to reduce memory copying #2

Closed
wants to merge 2 commits into from

Conversation

adrian17
Copy link
Collaborator

Ideally, when calling

let data = GcCell::allocate(gc_context, data); // assume  struct Data { x: [u8; 1000] }

You'd assume data to be constructed in-place or moved into newly allocated memory. (the first one isn't really common as stable Rust lacks placement-new-like features). And with the struct being relatively big, you'd expect the compiler to generate a memcpy call to simply move the structure's bytes into place.

The issue currently is that due to either rustc not being smart enough or the gc-arena code not being optimizer friendly (or both), the compiler can memcpy your Data object several times before actually moving it into its final place.

For example here:

        let gc_box = GcBox {
            flags: GcFlags::new(),
            next: Cell::new(self.all.get()),
            value: UnsafeCell::new(t),
        };
        gc_box.flags.set_needs_trace(T::needs_trace());
        let ptr = NonNull::new_unchecked(Box::into_raw(Box::new(gc_box)));

The generated code will firstly do memcpy to move t into the gc_box object on stack, then allocate memory, and then do the second memcpy to move the gc_box object onto heap memory. For some reason, on wasm target the compiler is even worse at optimizing this; at the worst case, I've seen four memcpy calls for a single GC allocation. This can obviously cause unnecessary overhead.

My patch helps the compiler by simplifying the initialization - first we allocate the uninitialized memory, then we manually build the GcBox by moving its fields into place. This way the object t is moved straight into its final place without being moved into intermediate stack variable gc_box.

I was trying to show a comparison on godbolt, but as soon as I drop some layers of abstractions, rustc catches on and generates better code. This is my best attempt: https://godbolt.org/z/aaK75W . You can see that in old() there is one memcpy before allocation and one after, but in new() there is only one memcpy.

Here's a comparison on "production" code, with a decompiled wasm build of https://github.com/ruffle-rs/ruffle/ . In practice, I've seen this cause up to 15-20% speedups in some edge cases.

Before, 4x memcpy:

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:int):int {
  var d:int;
  var e:int;
  var c:int = g_a - 368;
  g_a = c;
  memcpy(c + 16, b, 108);
  a[8]:int = (b = a[8]:int + 120);
  if (a[100]:ubyte == 3) {
    if (b <= a[10]:int) goto B_a;
    a[100]:byte = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  label B_a:
  c[136]:byte = 0;
  c[16]:long = a[11]:long@4;
  memcpy(c + 140, c + 16, 108);
  gc_arena_types_GcFlags_set_needs_trace_h2ec31421f45c37eb(c + 136, 1); // No clue why it wasn't inlined
  memcpy(c + 248, c + 128, 120);
  b = rust_alloc(120, 4);
  if (b) {
    swf_string_SwfStr_as_core_convert_From_str_from_h2a696e0da16f901c( // <- this is actually `static_gc_box()`, compiler deduplicated functions. No clue why it wasn't inlined.
      c + 8,
      d = memcpy(b, c + 248, 120),
      1231000);
    b = c[2]:int;
    (a + 48)[0]:int = (e = c[3]:int);
    a[11]:int = b;
    if (a[100]:ubyte != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = e;
    a[15]:int = b;
    label B_d:
    g_a = c + 368;
    return d;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

After, just two:

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:long_ptr@4):int {
  var d:int;
  var c:long_ptr@4 = g_a - 240;
  g_a = c;
  memcpy(c + 8, b, 108);
  a[8]:int = (d = a[8]:int + 120);
  var e:int = 3;
  b = a[100]:ubyte;
  if (b == 3) {
    if (d <= a[10]:int) goto B_a;
    a[100]:byte = 0;
    b = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  e = b;
  label B_a:
  b = rust_alloc(120, 4);
  if (b) {
    b[0] = c[30];
    d = b + 8;
    d[0]:int = (c + 128)[0]:int;
    d[0]:byte = 4;
    b[0] = a[11]:long@4;
    memcpy(b + 12, c + 8, 108);
    (a + 48)[0]:int = 1210328;
    a[11]:int = b;
    if (e != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = 1210328;
    a[15]:int = b;
    label B_d:
    g_a = c + 240;
    return b;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

And when rust-lang/rust#82806 gets merged into Rustc , with my patch it'll become just one, how it's supposed to work :)

function gc_arena_context_MutationContext_allocate_h107c2f641d37f0af(a:int, b:int):int {
  var e:int;
  var c:int;
  a[8]:int = (e = a[8]:int + 120);
  var d:int = 3;
  c = a[100]:ubyte;
  if (c == 3) {
    if (e <= a[10]:int) goto B_a;
    a[100]:byte = 0;
    c = 0;
  }
  a[3]:double = a[3]:double + 120.0 + 120.0 / a[1]:double;
  d = c;
  label B_a:
  c = rust_alloc(120, 4);
  if (c) {
    c[8]:byte = 4;
    c[0]:long@4 = a[11]:long@4;
    memcpy(c + 12, b, 108);
    (a + 48)[0]:int = 1210328;
    a[11]:int = c;
    if (d != 2) goto B_d;
    if (a[15]:int) goto B_d;
    a[16]:int = 1210328;
    a[15]:int = c;
    label B_d:
    return c;
  }
  alloc_alloc_handle_alloc_error_h9b35ff53f78b3b72(120, 4);
  return unreachable;
}

I made sure the patch passes tests with miri.

@adrian17
Copy link
Collaborator Author

Damn, wanted to post this to the upstream repo :/

@adrian17 adrian17 closed this Mar 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant