`Arc::new` duplicates stack memory #111603

marc0246 · 2023-05-15T15:57:35Z

Running the following through Godbolt, using rustc 1.71.0-nightly (cba1407 2023-05-10) with -C opt-level=3:

use std::sync::Arc;

pub fn test1(x: u64) -> Arc<[u64]> {
    let array = [x; 1000];
    Arc::new(array)
}

I expected rustc to create an 8 KB stack allocation and a memcpy from it to the heap allocation, instead it creates 2 identical stack allocations and 2 memcpys to go along with them:

example::test1:
        push    rbx
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 3712
        movq    xmm0, rdi
        pshufd  xmm0, xmm0, 68
        mov     eax, 18
.LBB0_1:
        movdqu  xmmword ptr [rsp + 8*rax - 144], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 128], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 112], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 96], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 80], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 64], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 48], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 32], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 16], xmm0
        movdqu  xmmword ptr [rsp + 8*rax], xmm0
        add     rax, 20
        cmp     rax, 1018
        jne     .LBB0_1
        lea     rdi, [rsp + 8000]
        mov     rsi, rsp
        mov     edx, 8000
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     edi, 8016
        mov     esi, 8
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
        test    rax, rax
        je      .LBB0_3
        mov     rbx, rax
        mov     qword ptr [rax], 1
        mov     qword ptr [rax + 8], 1
        mov     rdi, rax
        add     rdi, 16
        lea     rsi, [rsp + 8000]
        mov     edx, 8000
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     edx, 1000
        mov     rax, rbx
        add     rsp, 16000
        pop     rbx
        ret
.LBB0_3:
        mov     edi, 8
        mov     esi, 8016
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

Trying to narrow the problem down further, when eliminating the stack allocations entirely:

#![feature(get_mut_unchecked, new_uninit)]

use std::sync::Arc;

pub fn test2(x: u64) -> Arc<[u64]> {
    let mut arc = Arc::new_uninit_slice(1000);
    for elem in unsafe { Arc::get_mut_unchecked(&mut arc) } {
        elem.write(x);
    }
    unsafe { arc.assume_init() }
}

The code duplication still persists, and is limited to the arcinner_layout_for_value_layout call:

example::test2:
        push    r15
        push    r14
        push    r12
        push    rbx
        push    rax
        mov     rbx, rdi
        mov     r12, qword ptr [rip + alloc::sync::arcinner_layout_for_value_layout@GOTPCREL]
        mov     edi, 8
        mov     esi, 8000
        call    r12
        mov     r14, rax
        mov     r15, rdx
        mov     edi, 8
        mov     esi, 8000
        call    r12
        test    rdx, rdx
        je      .LBB0_2
        mov     rdi, rdx
        mov     rsi, rax
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
.LBB0_2:
        test    rax, rax
        je      .LBB0_6
        mov     qword ptr [rax], 1
        mov     qword ptr [rax + 8], 1
        movq    xmm0, rbx
        pshufd  xmm0, xmm0, 68
        mov     ecx, 20
.LBB0_4:
        movdqu  xmmword ptr [rax + 8*rcx - 144], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 128], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 112], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 96], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 80], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 64], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 48], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 32], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 16], xmm0
        movdqu  xmmword ptr [rax + 8*rcx], xmm0
        add     rcx, 20
        cmp     rcx, 1020
        jne     .LBB0_4
        mov     edx, 1000
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        ret
.LBB0_6:
        mov     rdi, r14
        mov     rsi, r15
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

The text was updated successfully, but these errors were encountered:

marc0246 · 2023-05-15T16:07:49Z

@rustbot label +A-codegen +I-slow

nikic · 2023-05-15T19:52:55Z

Godbolt: https://rust.godbolt.org/z/GKWY8nT78

marc0246 · 2023-05-18T06:06:05Z

It seems this is not unique to Arc::new, happens using Rc::new as well.

It may be useful to note that without the let binding, rustc emits a single alloca: https://godbolt.org/z/9j7cevvjs

Fix duplicate `arcinner_layout_for_value_layout` calls when using the uninit `Arc` constructors What this fixes is the duplicate calls to `arcinner_layout_for_value_layout` seen here: https://godbolt.org/z/jr5Gxozhj The issue was discovered alongside rust-lang#111603 but is otherwise unrelated to the duplicate `alloca`s, which remain unsolved. Everything I tried to solve said main issue has failed. As for the duplicate layout calculations, I also tried slapping `#[inline]` and `#[inline(always)]` on everything in sight but the only thing that worked in the end is to dedup the calls by hand.

caojoshua · 2023-05-29T05:03:12Z

This applies to a lot of things that takes an array as input. For example, Box and Vec https://rust.godbolt.org/z/rzoPneT3q

I have a potential fix in https://reviews.llvm.org/D151638. TLDR, marking a tail call makes Alias Analysis smarter, a memcpy can be optimized, and a dead alloca can be removed. I'll elaborate below.

rustc copies by-ref arguments, explaining why there are two allocas. rustc generated code basically memcpy's from alloca1 to alloca2, and then another memcpy from alloca2 to the Arc/Vec/Box.
TailCallElim fails to mark rust_alloc as a tail call. The tail mark means that the callee will not access the callers local memory. The call cannot be marked because the function iterates over GEPs, and the loop latch condition uses the GEP in a icmp, escaping the alloca. The patch I submitted makes sure icmps are not escaping.
MemCpyOptimizer tries to optimize memcpys, but can only do so if the source alloca is not written to since the first memcpy. BasicAA can determine a callee cannot read/write an alloca if the call is a tail call. rust_alloc does not have the tail mark from the previous step, and BasicAA deduces that rust_alloc may read/write caller allocas.

Interesting is that if you take the optimized code that @nikic posted and run it through -passes=memcpyopt,instcombine, the extra alloca is removed https://rust.godbolt.org/z/WcK48WG9c. In the -O3 pipeline, LoopVectorize changes the loop from iterating over GEPS to indicies, removing the icmp against a GEP. TailCallElim is called after, marking rust_alloc as tail, but MemCpyOptimizer is not run again.

caojoshua · 2023-05-29T05:34:53Z

The duplicate alloca could also have been removed if rust_alloc was set to only access inaccessible memory. It is set for libc malloc here. If its correct, maybe rustc can set it for rust_alloc (Big emphasis on if its correct. I don't know about the internals).

caojoshua · 2023-05-29T21:30:07Z

I also want to note that this is the rustc code that emits code for the arrays. It intentionally iterates from the start to end pointer. It would probably not be hard to change the code to iterate over array indices instead, which would help LLVM optimizations, but I'm not sure if that is a desirable change.

Compares of the same object do not leak any bits. This patch introduces getUnderlyingObjectLookThrough. It looks at the output of getUnderlyingObject. If it is a PHI, it looks at all the incoming underlying objects. If all those objects are the same, or the original PHI, we determine that there is a new underlying object. This is similar to getUnderlyingObjects, but provides a more efficient way to find a single underlying object. This is an attempt at solving huge compile time regressions in https://reviews.llvm.org/D152082. First, we only look through a single PHI, not nested PHIs. Second, we only use one callsite. There are likely other callsites that could take advantage of this over the vanilla getUnderlyingObjects. We need to be careful about compile times. Adding this to BasicAA::aliasCheck increases compile times by 3% on local builds. This can hopefully lead to improved rustc generated code in rust-lang/rust#111603. rustc generates pointers comparisons that this patch can identify as non capturing. Differential Revision: https://reviews.llvm.org/D152241

the8472 · 2023-06-11T09:13:55Z

The call cannot be marked because the function iterates over GEPs, and the loop latch condition uses the GEP in a icmp, escaping the alloca. The patch I submitted makes sure icmps are not escaping.

Huh, I was under the impression that iterating over GEPs (ptr = ptr.add(1)) and breaking on pointer equality is the preferred way to do these kinds of loops. We're doing that in several places.

caojoshua · 2023-06-11T22:53:44Z

@the8472 I am not 100% which is better, maybe there are drawbacks to iterating over indices that I am not aware of.

Could you share some examples of where else there might be GEP iterating? I have a change in this LLVM PR which should make LLVM smarter about comparisons against GEP inductions, but I'm questioning whether we should merge it. Examples could help next steps.

the8472 · 2023-06-11T23:08:20Z

It's used for slices and vecs.

the8472 · 2023-06-13T21:43:16Z

This lead me to experiment with various loop shapes for slice::Iter::fold in #106343 and an index-based do-while loop + a zero-len precheck did yield the largest wins out of the approaches I tried.

cg_llvm: use index-based loop in write_operand_repeatedly This should be easier for LLVM to analyze. Fixes #111603 This needs a perf run. [cc](rust-lang/rust#111603 (comment)) `@caojoshua`

Compares of the same object do not leak any bits. This patch introduces getUnderlyingObjectLookThrough. It looks at the output of getUnderlyingObject. If it is a PHI, it looks at all the incoming underlying objects. If all those objects are the same, or the original PHI, we determine that there is a new underlying object. This is similar to getUnderlyingObjects, but provides a more efficient way to find a single underlying object. This is an attempt at solving huge compile time regressions in https://reviews.llvm.org/D152082. First, we only look through a single PHI, not nested PHIs. Second, we only use one callsite. There are likely other callsites that could take advantage of this over the vanilla getUnderlyingObjects. We need to be careful about compile times. Adding this to BasicAA::aliasCheck increases compile times by 3% on local builds. This was inspired by the issues in rust-lang/rust#111603. rustc used to generate pointers comparisons that this patch can identify as non capturing. The code emission was changed in rustc to avoid this behavior, but this patch can help with other cases of pointer inductions.

cg_llvm: use index-based loop in write_operand_repeatedly This should be easier for LLVM to analyze. Fixes #111603 This needs a perf run. [cc](rust-lang/rust#111603 (comment)) `@caojoshua`

marc0246 added the C-bug Category: This is a bug. label May 15, 2023

rustbot added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 15, 2023

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label May 15, 2023

workingjubilee added the I-heavy Issue: Problems and improvements with respect to binary size of generated code. label May 16, 2023

marc0246 mentioned this issue May 16, 2023

Fix duplicate arcinner_layout_for_value_layout calls when using the uninit Arc constructors #111634

Merged

erikdesjardins mentioned this issue Jun 11, 2023

cg_llvm: use index-based loop in write_operand_repeatedly #112516

Merged

bors closed this as completed in 3c554f5 Jun 27, 2023

tgross35 mentioned this issue Aug 4, 2023

(do not merge) Try some inline things with a skipped test #114446

Closed

caojoshua mentioned this issue Dec 3, 2023

[CaptureTracking] Do not capture compares of same object llvm/llvm-project#74228

Closed

the8472 mentioned this issue Jan 4, 2024

Missing CSE caused by for loop on slice #119573

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Arc::new` duplicates stack memory #111603

`Arc::new` duplicates stack memory #111603

marc0246 commented May 15, 2023

marc0246 commented May 15, 2023

nikic commented May 15, 2023

marc0246 commented May 18, 2023

caojoshua commented May 29, 2023

caojoshua commented May 29, 2023

caojoshua commented May 29, 2023

the8472 commented Jun 11, 2023

caojoshua commented Jun 11, 2023

the8472 commented Jun 11, 2023

the8472 commented Jun 13, 2023

Arc::new duplicates stack memory #111603

Arc::new duplicates stack memory #111603

Comments

marc0246 commented May 15, 2023

marc0246 commented May 15, 2023

nikic commented May 15, 2023

marc0246 commented May 18, 2023

caojoshua commented May 29, 2023

caojoshua commented May 29, 2023

caojoshua commented May 29, 2023

the8472 commented Jun 11, 2023

caojoshua commented Jun 11, 2023

the8472 commented Jun 11, 2023

the8472 commented Jun 13, 2023

`Arc::new` duplicates stack memory #111603

`Arc::new` duplicates stack memory #111603