Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arc::new duplicates stack memory #111603

Closed
marc0246 opened this issue May 15, 2023 · 10 comments · Fixed by #112516
Closed

Arc::new duplicates stack memory #111603

marc0246 opened this issue May 15, 2023 · 10 comments · Fixed by #112516
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@marc0246
Copy link
Contributor

Running the following through Godbolt, using rustc 1.71.0-nightly (cba1407 2023-05-10) with -C opt-level=3:

use std::sync::Arc;

pub fn test1(x: u64) -> Arc<[u64]> {
    let array = [x; 1000];
    Arc::new(array)
}

I expected rustc to create an 8 KB stack allocation and a memcpy from it to the heap allocation, instead it creates 2 identical stack allocations and 2 memcpys to go along with them:

example::test1:
        push    rbx
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 4096
        mov     qword ptr [rsp], 0
        sub     rsp, 3712
        movq    xmm0, rdi
        pshufd  xmm0, xmm0, 68
        mov     eax, 18
.LBB0_1:
        movdqu  xmmword ptr [rsp + 8*rax - 144], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 128], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 112], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 96], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 80], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 64], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 48], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 32], xmm0
        movdqu  xmmword ptr [rsp + 8*rax - 16], xmm0
        movdqu  xmmword ptr [rsp + 8*rax], xmm0
        add     rax, 20
        cmp     rax, 1018
        jne     .LBB0_1
        lea     rdi, [rsp + 8000]
        mov     rsi, rsp
        mov     edx, 8000
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     edi, 8016
        mov     esi, 8
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
        test    rax, rax
        je      .LBB0_3
        mov     rbx, rax
        mov     qword ptr [rax], 1
        mov     qword ptr [rax + 8], 1
        mov     rdi, rax
        add     rdi, 16
        lea     rsi, [rsp + 8000]
        mov     edx, 8000
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     edx, 1000
        mov     rax, rbx
        add     rsp, 16000
        pop     rbx
        ret
.LBB0_3:
        mov     edi, 8
        mov     esi, 8016
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

Trying to narrow the problem down further, when eliminating the stack allocations entirely:

#![feature(get_mut_unchecked, new_uninit)]

use std::sync::Arc;

pub fn test2(x: u64) -> Arc<[u64]> {
    let mut arc = Arc::new_uninit_slice(1000);
    for elem in unsafe { Arc::get_mut_unchecked(&mut arc) } {
        elem.write(x);
    }
    unsafe { arc.assume_init() }
}

The code duplication still persists, and is limited to the arcinner_layout_for_value_layout call:

example::test2:
        push    r15
        push    r14
        push    r12
        push    rbx
        push    rax
        mov     rbx, rdi
        mov     r12, qword ptr [rip + alloc::sync::arcinner_layout_for_value_layout@GOTPCREL]
        mov     edi, 8
        mov     esi, 8000
        call    r12
        mov     r14, rax
        mov     r15, rdx
        mov     edi, 8
        mov     esi, 8000
        call    r12
        test    rdx, rdx
        je      .LBB0_2
        mov     rdi, rdx
        mov     rsi, rax
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
.LBB0_2:
        test    rax, rax
        je      .LBB0_6
        mov     qword ptr [rax], 1
        mov     qword ptr [rax + 8], 1
        movq    xmm0, rbx
        pshufd  xmm0, xmm0, 68
        mov     ecx, 20
.LBB0_4:
        movdqu  xmmword ptr [rax + 8*rcx - 144], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 128], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 112], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 96], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 80], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 64], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 48], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 32], xmm0
        movdqu  xmmword ptr [rax + 8*rcx - 16], xmm0
        movdqu  xmmword ptr [rax + 8*rcx], xmm0
        add     rcx, 20
        cmp     rcx, 1020
        jne     .LBB0_4
        mov     edx, 1000
        add     rsp, 8
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        ret
.LBB0_6:
        mov     rdi, r14
        mov     rsi, r15
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2
@marc0246 marc0246 added the C-bug Category: This is a bug. label May 15, 2023
@marc0246
Copy link
Contributor Author

@rustbot label +A-codegen +I-slow

@rustbot rustbot added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 15, 2023
@nikic
Copy link
Contributor

nikic commented May 15, 2023

Godbolt: https://rust.godbolt.org/z/GKWY8nT78

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label May 15, 2023
@workingjubilee workingjubilee added the I-heavy Issue: Problems and improvements with respect to binary size of generated code. label May 16, 2023
@marc0246
Copy link
Contributor Author

It seems this is not unique to Arc::new, happens using Rc::new as well.

It may be useful to note that without the let binding, rustc emits a single alloca: https://godbolt.org/z/9j7cevvjs

bors added a commit to rust-lang-ci/rust that referenced this issue May 22, 2023
Fix duplicate `arcinner_layout_for_value_layout` calls when using the uninit `Arc` constructors

What this fixes is the duplicate calls to `arcinner_layout_for_value_layout` seen here: https://godbolt.org/z/jr5Gxozhj

The issue was discovered alongside rust-lang#111603 but is otherwise unrelated to the duplicate `alloca`s, which remain unsolved. Everything I tried to solve said main issue has failed.

As for the duplicate layout calculations, I also tried slapping `#[inline]` and `#[inline(always)]` on everything in sight but the only thing that worked in the end is to dedup the calls by hand.
@caojoshua
Copy link

This applies to a lot of things that takes an array as input. For example, Box and Vec https://rust.godbolt.org/z/rzoPneT3q

I have a potential fix in https://reviews.llvm.org/D151638. TLDR, marking a tail call makes Alias Analysis smarter, a memcpy can be optimized, and a dead alloca can be removed. I'll elaborate below.

  1. rustc copies by-ref arguments, explaining why there are two allocas. rustc generated code basically memcpy's from alloca1 to alloca2, and then another memcpy from alloca2 to the Arc/Vec/Box.
  2. TailCallElim fails to mark rust_alloc as a tail call. The tail mark means that the callee will not access the callers local memory. The call cannot be marked because the function iterates over GEPs, and the loop latch condition uses the GEP in a icmp, escaping the alloca. The patch I submitted makes sure icmps are not escaping.
  3. MemCpyOptimizer tries to optimize memcpys, but can only do so if the source alloca is not written to since the first memcpy. BasicAA can determine a callee cannot read/write an alloca if the call is a tail call. rust_alloc does not have the tail mark from the previous step, and BasicAA deduces that rust_alloc may read/write caller allocas.

Interesting is that if you take the optimized code that @nikic posted and run it through -passes=memcpyopt,instcombine, the extra alloca is removed https://rust.godbolt.org/z/WcK48WG9c. In the -O3 pipeline, LoopVectorize changes the loop from iterating over GEPS to indicies, removing the icmp against a GEP. TailCallElim is called after, marking rust_alloc as tail, but MemCpyOptimizer is not run again.

@caojoshua
Copy link

The duplicate alloca could also have been removed if rust_alloc was set to only access inaccessible memory. It is set for libc malloc here. If its correct, maybe rustc can set it for rust_alloc (Big emphasis on if its correct. I don't know about the internals).

@caojoshua
Copy link

I also want to note that this is the rustc code that emits code for the arrays. It intentionally iterates from the start to end pointer. It would probably not be hard to change the code to iterate over array indices instead, which would help LLVM optimizations, but I'm not sure if that is a desirable change.

caojoshua added a commit to caojoshua/llvm-project that referenced this issue Jun 6, 2023
Compares of the same object do not leak any bits.

This patch introduces getUnderlyingObjectLookThrough. It looks at the
output of getUnderlyingObject. If it is a PHI, it looks at all the
incoming underlying objects. If all those objects are the same, or the
original PHI, we determine that there is a new underlying object. This
is similar to getUnderlyingObjects, but provides a more efficient way to
find a single underlying object.

This is an attempt at solving huge compile time regressions in
https://reviews.llvm.org/D152082. First, we only look through a single
PHI, not nested PHIs. Second, we only use one callsite. There are likely
other callsites that could take advantage of this over the vanilla
getUnderlyingObjects. We need to be careful about compile times. Adding
this to BasicAA::aliasCheck increases compile times by 3% on local
builds.

This can hopefully lead to improved rustc generated code in
rust-lang/rust#111603. rustc generates
pointers comparisons that this patch can identify as non capturing.

Differential Revision: https://reviews.llvm.org/D152241
@the8472
Copy link
Member

the8472 commented Jun 11, 2023

The call cannot be marked because the function iterates over GEPs, and the loop latch condition uses the GEP in a icmp, escaping the alloca. The patch I submitted makes sure icmps are not escaping.

Huh, I was under the impression that iterating over GEPs (ptr = ptr.add(1)) and breaking on pointer equality is the preferred way to do these kinds of loops. We're doing that in several places.

@caojoshua
Copy link

@the8472 I am not 100% which is better, maybe there are drawbacks to iterating over indices that I am not aware of.

Could you share some examples of where else there might be GEP iterating? I have a change in this LLVM PR which should make LLVM smarter about comparisons against GEP inductions, but I'm questioning whether we should merge it. Examples could help next steps.

@the8472
Copy link
Member

the8472 commented Jun 11, 2023

It's used for slices and vecs.

@the8472
Copy link
Member

the8472 commented Jun 13, 2023

This lead me to experiment with various loop shapes for slice::Iter::fold in #106343 and an index-based do-while loop + a zero-len precheck did yield the largest wins out of the approaches I tried.

@bors bors closed this as completed in 3c554f5 Jun 27, 2023
oli-obk pushed a commit to oli-obk/miri that referenced this issue Jun 28, 2023
cg_llvm: use index-based loop in write_operand_repeatedly

This should be easier for LLVM to analyze.

Fixes #111603

This needs a perf run.

[cc](rust-lang/rust#111603 (comment)) `@caojoshua`
caojoshua added a commit to caojoshua/llvm-project that referenced this issue Dec 3, 2023
Compares of the same object do not leak any bits.

This patch introduces getUnderlyingObjectLookThrough. It looks at the
output of getUnderlyingObject. If it is a PHI, it looks at all the
incoming underlying objects. If all those objects are the same, or the
original PHI, we determine that there is a new underlying object. This
is similar to getUnderlyingObjects, but provides a more efficient way to
find a single underlying object.

This is an attempt at solving huge compile time regressions in
https://reviews.llvm.org/D152082. First, we only look through a single
PHI, not nested PHIs. Second, we only use one callsite. There are likely
other callsites that could take advantage of this over the vanilla
getUnderlyingObjects. We need to be careful about compile times. Adding
this to BasicAA::aliasCheck increases compile times by 3% on local
builds.

This was inspired by the issues in
rust-lang/rust#111603. rustc used to generate
pointers comparisons that this patch can identify as non capturing. The
code emission was changed in rustc to avoid this behavior, but this
patch can help with other cases of pointer inductions.
lnicola pushed a commit to lnicola/rust-analyzer that referenced this issue Apr 7, 2024
cg_llvm: use index-based loop in write_operand_repeatedly

This should be easier for LLVM to analyze.

Fixes #111603

This needs a perf run.

[cc](rust-lang/rust#111603 (comment)) `@caojoshua`
RalfJung pushed a commit to RalfJung/rust-analyzer that referenced this issue Apr 27, 2024
cg_llvm: use index-based loop in write_operand_repeatedly

This should be easier for LLVM to analyze.

Fixes #111603

This needs a perf run.

[cc](rust-lang/rust#111603 (comment)) `@caojoshua`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-codegen Area: Code generation A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-bug Category: This is a bug. I-heavy Issue: Problems and improvements with respect to binary size of generated code. I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants