Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks #78582

Merged
merged 4 commits into from
Apr 17, 2024

Conversation

DianQK
Copy link
Member

@DianQK DianQK commented Jan 18, 2024

Fixes #78578.

Duplicating a BB which has both multiple predecessors and successors will result in a complex CFG and also may cause huge amount of PHI nodes. See #78578 (comment) for a detailed description of the limit.

@llvmbot
Copy link
Collaborator

llvmbot commented Jan 18, 2024

@llvm/pr-subscribers-backend-x86

Author: Quentin Dian (DianQK)

Changes

Fixes #78578.

We should add a count check to the predecessors to avoid the code size explosion.

I found a strange argument during my investigation.

static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
cl::Hidden);

We didn't use -tail-dup-limit while setting a meaningless value.

Also, it may be that an issue with AsmPrinter is causing this use case to print two line breaks. This makes the test case fail. I haven't checked, but I don't think it at least affects the review.


Patch is 29.43 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/78582.diff

3 Files Affected:

  • (modified) llvm/lib/CodeGen/TailDuplicator.cpp (+7)
  • (modified) llvm/test/CodeGen/X86/mul-constant-result.ll (+135-188)
  • (added) llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll (+242)
diff --git a/llvm/lib/CodeGen/TailDuplicator.cpp b/llvm/lib/CodeGen/TailDuplicator.cpp
index 5ed67bd0a121ed..e76d63d3c0d66f 100644
--- a/llvm/lib/CodeGen/TailDuplicator.cpp
+++ b/llvm/lib/CodeGen/TailDuplicator.cpp
@@ -76,6 +76,11 @@ static cl::opt<bool>
 static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
                                       cl::Hidden);
 
+static cl::opt<unsigned> TailDupPredSizeLimit(
+    "tail-dup-pred-size-limit",
+    cl::desc("Maximum predecessors to consider tail duplicating."), cl::init(8),
+    cl::Hidden);
+
 void TailDuplicator::initMF(MachineFunction &MFin, bool PreRegAlloc,
                             const MachineBranchProbabilityInfo *MBPIin,
                             MBFIWrapper *MBFIin,
@@ -565,6 +570,8 @@ bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
   if (TailBB.isSuccessor(&TailBB))
     return false;
 
+  if (TailDupPredSizeLimit < TailBB.pred_size())
+    return false;
   // Set the limit on the cost to duplicate. When optimizing for size,
   // duplicate only one, because one branch instruction can be eliminated to
   // compensate for the duplication.
diff --git a/llvm/test/CodeGen/X86/mul-constant-result.ll b/llvm/test/CodeGen/X86/mul-constant-result.ll
index 1f9e7a93ad0b90..73c764a3f53da1 100644
--- a/llvm/test/CodeGen/X86/mul-constant-result.ll
+++ b/llvm/test/CodeGen/X86/mul-constant-result.ll
@@ -28,162 +28,132 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X86-NEXT:  .LBB0_4:
 ; X86-NEXT:    decl %ecx
 ; X86-NEXT:    cmpl $31, %ecx
-; X86-NEXT:    ja .LBB0_35
+; X86-NEXT:    ja .LBB0_31
 ; X86-NEXT:  # %bb.5:
 ; X86-NEXT:    jmpl *.LJTI0_0(,%ecx,4)
 ; X86-NEXT:  .LBB0_6:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_7:
-; X86-NEXT:    .cfi_def_cfa_offset 8
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
-; X86-NEXT:    jmp .LBB0_9
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
 ; X86-NEXT:  .LBB0_8:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $4, %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_10:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_10:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_13:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_11:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
-; X86-NEXT:    jmp .LBB0_14
-; X86-NEXT:  .LBB0_15:
+; X86-NEXT:    leal (%eax,%ecx,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_13:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_16:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_17:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_15:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    jmp .LBB0_12
-; X86-NEXT:  .LBB0_19:
+; X86-NEXT:    leal (%eax,%eax,4), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_17:
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_20:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_18:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_21:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_19:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_22:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_20:
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_23:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_21:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_24:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_22:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_14:
 ; X86-NEXT:    leal (%eax,%ecx,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_25:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_23:
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_26:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_24:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
 ; X86-NEXT:    leal (%eax,%ecx,4), %ecx
-; X86-NEXT:    jmp .LBB0_9
-; X86-NEXT:  .LBB0_27:
+; X86-NEXT:    addl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_25:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $4, %eax
-; X86-NEXT:    jmp .LBB0_28
-; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    subl %ecx, %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_26:
 ; X86-NEXT:    leal (,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_30:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_27:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
-; X86-NEXT:    jmp .LBB0_32
-; X86-NEXT:  .LBB0_31:
+; X86-NEXT:    leal (%eax,%ecx,2), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    leal (%eax,%eax,4), %ecx
-; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%ecx,2), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_34:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_29:
+; X86-NEXT:    leal (%eax,%eax,8), %eax
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_30:
 ; X86-NEXT:    movl %eax, %ecx
 ; X86-NEXT:    shll $5, %ecx
-; X86-NEXT:    jmp .LBB0_38
-; X86-NEXT:  .LBB0_35:
+; X86-NEXT:    jmp .LBB0_33
+; X86-NEXT:  .LBB0_31:
 ; X86-NEXT:    xorl %eax, %eax
-; X86-NEXT:  .LBB0_36:
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_37:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_32:
 ; X86-NEXT:    leal (%eax,%eax,2), %ecx
 ; X86-NEXT:    shll $3, %ecx
-; X86-NEXT:  .LBB0_38:
+; X86-NEXT:  .LBB0_33:
 ; X86-NEXT:    subl %eax, %ecx
 ; X86-NEXT:    movl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_39:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_34:
 ; X86-NEXT:    shll $2, %eax
-; X86-NEXT:  .LBB0_12:
 ; X86-NEXT:    leal (%eax,%eax,4), %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_40:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_35:
 ; X86-NEXT:    shll $3, %eax
-; X86-NEXT:    jmp .LBB0_18
-; X86-NEXT:  .LBB0_41:
+; X86-NEXT:    jmp .LBB0_39
+; X86-NEXT:  .LBB0_36:
 ; X86-NEXT:    leal (%eax,%eax,8), %ecx
 ; X86-NEXT:    leal (%ecx,%ecx,2), %ecx
 ; X86-NEXT:    addl %eax, %eax
-; X86-NEXT:  .LBB0_9:
 ; X86-NEXT:    addl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_42:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_37:
 ; X86-NEXT:    leal (%eax,%eax), %ecx
 ; X86-NEXT:    shll $5, %eax
-; X86-NEXT:  .LBB0_28:
 ; X86-NEXT:    subl %ecx, %eax
-; X86-NEXT:    popl %esi
-; X86-NEXT:    .cfi_def_cfa_offset 4
-; X86-NEXT:    retl
-; X86-NEXT:  .LBB0_43:
-; X86-NEXT:    .cfi_def_cfa_offset 8
+; X86-NEXT:    jmp .LBB0_40
+; X86-NEXT:  .LBB0_38:
 ; X86-NEXT:    leal (%eax,%eax,8), %eax
-; X86-NEXT:  .LBB0_18:
+; X86-NEXT:  .LBB0_39:
 ; X86-NEXT:    leal (%eax,%eax,2), %eax
+; X86-NEXT:  .LBB0_40:
 ; X86-NEXT:    popl %esi
 ; X86-NEXT:    .cfi_def_cfa_offset 4
 ; X86-NEXT:    retl
@@ -199,154 +169,131 @@ define i32 @mult(i32, i32) local_unnamed_addr #0 {
 ; X64-HSW-NEXT:    cmovel %ecx, %eax
 ; X64-HSW-NEXT:    decl %edi
 ; X64-HSW-NEXT:    cmpl $31, %edi
-; X64-HSW-NEXT:    ja .LBB0_31
+; X64-HSW-NEXT:    ja .LBB0_28
 ; X64-HSW-NEXT:  # %bb.1:
 ; X64-HSW-NEXT:    jmpq *.LJTI0_0(,%rdi,8)
 ; X64-HSW-NEXT:  .LBB0_2:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_3:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_4:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $4, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
+; X64-HSW-NEXT:    jmp .LBB0_21
 ; X64-HSW-NEXT:  .LBB0_5:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:  .LBB0_13:
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
 ; X64-HSW-NEXT:  .LBB0_6:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_7:
+; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
 ; X64-HSW-NEXT:  .LBB0_8:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_10:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_9:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_7:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_11:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_10:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,4), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_22
-; X64-HSW-NEXT:  .LBB0_12:
+; X64-HSW-NEXT:    jmp .LBB0_21
+; X64-HSW-NEXT:  .LBB0_11:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_14:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_13:
 ; X64-HSW-NEXT:    shll $4, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_15:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_14:
 ; X64-HSW-NEXT:    shll $2, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_16:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_15:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_17:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_16:
 ; X64-HSW-NEXT:    shll $5, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_18:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_17:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_19:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_18:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_20:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_19:
 ; X64-HSW-NEXT:    addl %eax, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_21:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_20:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,4), %ecx
-; X64-HSW-NEXT:  .LBB0_22:
+; X64-HSW-NEXT:  .LBB0_21:
 ; X64-HSW-NEXT:    addl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_23:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_22:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $4, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_25:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_23:
 ; X64-HSW-NEXT:    leal (,%rax,8), %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_24:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_27:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_25:
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %ecx
 ; X64-HSW-NEXT:    leal (%rax,%rcx,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_30:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_26:
+; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_27:
 ; X64-HSW-NEXT:    movl %eax, %ecx
 ; X64-HSW-NEXT:    shll $5, %ecx
-; X64-HSW-NEXT:    jmp .LBB0_34
-; X64-HSW-NEXT:  .LBB0_31:
+; X64-HSW-NEXT:    jmp .LBB0_30
+; X64-HSW-NEXT:  .LBB0_28:
 ; X64-HSW-NEXT:    xorl %eax, %eax
-; X64-HSW-NEXT:  .LBB0_32:
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_33:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_29:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %ecx
 ; X64-HSW-NEXT:    shll $3, %ecx
-; X64-HSW-NEXT:  .LBB0_34:
+; X64-HSW-NEXT:  .LBB0_30:
 ; X64-HSW-NEXT:    subl %eax, %ecx
 ; X64-HSW-NEXT:    movl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_36:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_31:
 ; X64-HSW-NEXT:    shll $2, %eax
 ; X64-HSW-NEXT:    leal (%rax,%rax,4), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_37:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_32:
 ; X64-HSW-NEXT:    shll $3, %eax
-; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_38:
+; X64-HSW-NEXT:    jmp .LBB0_36
+; X64-HSW-NEXT:  .LBB0_33:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %ecx
 ; X64-HSW-NEXT:    leal (%rcx,%rcx,2), %ecx
 ; X64-HSW-NEXT:    addl %eax, %eax
 ; X64-HSW-NEXT:    addl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_39:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_34:
 ; X64-HSW-NEXT:    leal (%rax,%rax), %ecx
 ; X64-HSW-NEXT:    shll $5, %eax
 ; X64-HSW-NEXT:    subl %ecx, %eax
-; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
-; X64-HSW-NEXT:    retq
-; X64-HSW-NEXT:  .LBB0_40:
+; X64-HSW-NEXT:    jmp .LBB0_37
+; X64-HSW-NEXT:  .LBB0_35:
 ; X64-HSW-NEXT:    leal (%rax,%rax,8), %eax
+; X64-HSW-NEXT:  .LBB0_36:
 ; X64-HSW-NEXT:    leal (%rax,%rax,2), %eax
+; X64-HSW-NEXT:  .LBB0_37:
 ; X64-HSW-NEXT:    # kill: def $eax killed $eax killed $rax
 ; X64-HSW-NEXT:    retq
   %3 = icmp eq i32 %1, 0
diff --git a/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
new file mode 100644
index 00000000000000..47b9fcaa7d6c85
--- /dev/null
+++ b/llvm/test/CodeGen/X86/tail-dup-pred-size-limit.ll
@@ -0,0 +1,242 @@
+; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py UTC_ARGS: --version 4
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=3 < %s | FileCheck %s -check-prefix=LIMIT
+; RUN: llc -mtriple=x86_64-unknown-linux-gnu -stop-after=early-tailduplication -tail-dup-pred-size-limit=4 < %s | FileCheck %s -check-prefix=NOLIMIT
+
+define i32 @foo(ptr %0, i32 %1) {
+  ; LIMIT-LABEL: name: foo
+  ; LIMIT: bb.0 (%ir-block.2):
+  ; LIMIT-NEXT:   successors: %bb.1(0x20000000), %bb.2(0x20000000), %bb.3(0x20000000), %bb.4(0x20000000)
+  ; LIMIT-NEXT:   liveins: $rdi, $esi
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[COPY:%[0-9]+]]:gr32 = COPY $esi
+  ; LIMIT-NEXT:   [[COPY1:%[0-9]+]]:gr64 = COPY $rdi
+  ; LIMIT-NEXT:   [[SHR32ri:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri:%[0-9]+]]:gr32 = AND32ri [[SHR32ri]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG]], %jump-table.0, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.1 (%ir-block.5):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.2 (%ir-block.7):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm1:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri1:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm1]], 1, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.3 (%ir-block.10):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm2:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri2:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm2]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.4 (%ir-block.13):
+  ; LIMIT-NEXT:   successors: %bb.6(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm3:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32ri3:%[0-9]+]]:gr32 = SHR32ri [[MOV32rm3]], 3, implicit-def dead $eflags
+  ; LIMIT-NEXT:   JMP_1 %bb.6
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.5.default.unreachable2:
+  ; LIMIT-NEXT:   successors:
+  ; LIMIT: bb.6 (%ir-block.16):
+  ; LIMIT-NEXT:   successors: %bb.7(0x20000000), %bb.8(0x20000000), %bb.9(0x20000000), %bb.10(0x20000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[PHI:%[0-9]+]]:gr32 = PHI [[SHR32ri3]], %bb.4, [[SHR32ri2]], %bb.3, [[SHR32ri1]], %bb.2, [[MOV32rm]], %bb.1
+  ; LIMIT-NEXT:   [[SHR32ri4:%[0-9]+]]:gr32 = SHR32ri [[COPY]], 2, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[AND32ri1:%[0-9]+]]:gr32 = AND32ri [[SHR32ri4]], 7, implicit-def dead $eflags
+  ; LIMIT-NEXT:   [[SUBREG_TO_REG1:%[0-9]+]]:gr64_nosp = SUBREG_TO_REG 0, killed [[AND32ri1]], %subreg.sub_32bit
+  ; LIMIT-NEXT:   JMP64m $noreg, 8, [[SUBREG_TO_REG1]], %jump-table.1, $noreg :: (load (s64) from jump-table)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.7 (%ir-block.20):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm4:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   JMP_1 %bb.11
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT: bb.8 (%ir-block.22):
+  ; LIMIT-NEXT:   successors: %bb.11(0x80000000)
+  ; LIMIT-NEXT: {{  $}}
+  ; LIMIT-NEXT:   [[MOV32rm5:%[0-9]+]]:gr32 = MOV32rm [[COPY1]], 1, $noreg, 0, $noreg :: (load (s32) from %ir.0)
+  ; LIMIT-NEXT:   [[SHR32r...
[truncated]

@DianQK
Copy link
Member Author

DianQK commented Jan 18, 2024

I can verify that the initial OOM case is back to normal.

Also, when the default branch is removed, the instructions have some improvements.

With the default branch:

Iterations:        100
Instructions:      710800
Total Cycles:      166608
Total uOps:        710800

Dispatch Width:    6
uOps Per Cycle:    4.27
IPC:               4.27
Block RThroughput: 1216.5

Without the default branch:

Iterations:        100
Instructions:      709000
Total Cycles:      158311
Total uOps:        709000

Dispatch Width:    6
uOps Per Cycle:    4.48
IPC:               4.48
Block RThroughput: 1205.0

Text diff?

diff --git a/output.s b/output.s
index 322d0d0..6ca97d0 100644
--- a/output.s
+++ b/output.s
@@ -1,5 +1,5 @@
 	.text
-	.file	"oom_manual.c"
+	.file	"oom_manual2.c"
 	.globl	f1                              # -- Begin function f1
 	.p2align	4, 0x90
 	.type	f1,@function
@@ -33,12805 +33,12788 @@ f1:                                     # @f1
 	movl	%eax, %ecx
 	shrl	%ecx
 	andl	$127, %ecx
-	cmpl	$126, %ecx
-	ja	.LBB0_15
-# %bb.1:
 	jmpq	*.LJTI0_0(,%rcx,8)

I can see many cmpl instructions disappearing.

@dwblaikie
Copy link
Collaborator

@aeubanks mind taking a look at this small patch? Is this limit reasonable/consistent with other similar limits, etc? Do we need more data to back up why this particular bound was chosen?

@aeubanks
Copy link
Contributor

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

@aeubanks
Copy link
Contributor

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened? was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated? or is there an underlying algorithmic problem that we can fix?

@DianQK
Copy link
Member Author

DianQK commented Jan 19, 2024

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors.
The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors.
Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.

    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \

was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so.
I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.)
I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements.
I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

@aeubanks
Copy link
Contributor

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

do we actually know why the previous patch caused things to blow up in this pass? i.e. where the memory usage spike actually happened?

It looks like the default branch of switch creates an if statement (compare instruction)? This results in two successors. The previous patch did one thing, replace the default branch with an unreachable instruction. Then TailDuplicator can duplicate blocks into predecessors. Since we revert the previous patch, we can't reproduce it directly in the main branch. But it can be reproduced by replacing it with an unreachable default branch in the C source code. e.g.

    case 127:  out1 = b[2] >> 31; break; \
    default:  __builtin_unreachable(); break; \

was it just that we were doing too much tail duplication after the other change produced code that tended to be tail duplicated?

I think so. I checked the time consumption of lld with -time-trace (llc -O1 oom.bc -time-trace -time-trace-file=trace.json -filetype=null). It takes 4s to complete on my device. About 1.3s is Early Tail Duplication. And 1.3s is Eliminate PHI nodes for register allocation. (I tried perf, but maybe it's the environment of my machine that makes this unavailable.) I had to use TimeProfiler. tailDuplicateAndUpdate and tailDuplicateAndUpdate occupy half the time, respectively.

or is there an underlying algorithmic problem that we can fix?

I'm not sure, but I can try if I need to. But I don't know much about MIR and performance improvements. I'm also a little concerned about the code size, which seems to increase from about 50K to about 200K for the initial example. :3

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

@bzEq
Copy link
Collaborator

bzEq commented Jan 19, 2024

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG.
You may be interested in https://reviews.llvm.org/D110613.

@weiguozhi
Copy link
Contributor

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

@DianQK
Copy link
Member Author

DianQK commented Jan 21, 2024

the test should be an MIR test right? then it's less prone to various changes affecting the exact codegen of the IR

Yes, but that would make this test case poorly maintained? I'm not sure, but I can change this.

What do you mean "poorly maintained"?

Given that the option works on number of MIR instructions, we should keep the input MIR instruction count consistent.

@aeubanks
I have changed the test case. That was my mistake, I mistakenly thought the mir file would be very large.

if tail duplication is blowing up code size 4x, that definitely seems like a "we're doing too much tail duplication" issue. but hopefully somebody who actually understands the pass better can comment

@aeubanks
I've updated the issue. I added all the details I could.

The cause is early-tailduplication transforms normal switch to computed-gotos, which means a lot of indirect branches are generated, adding quadratic number of edges inside these cases, that's to say at the end of each case, the control flow is able to jump to other cases directly. As a consequence, passes run after early-tailduplication get lag due to a more complex CFG. You may be interested in https://reviews.llvm.org/D110613.

@bzEq
This looks like a very similar patch. But it looks like it's been revert for unknown reasons. What I've found so far is that the main time is early-tailduplication.

If you want to limit the number of tail duplication, I want to see a partial tail duplication base on profile information, similar to MachineBlockPlacement::maybeTailDuplicateBlock.

@weiguozhi
Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

@weiguozhi
Copy link
Contributor

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

@DianQK
Copy link
Member Author

DianQK commented Jan 23, 2024

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

@dwblaikie
Copy link
Collaborator

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

@nikic
Copy link
Contributor

nikic commented Jan 23, 2024

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

@DianQK
Copy link
Member Author

DianQK commented Jan 25, 2024

Hmm, CodeGen is not an area I know much about. Could you explain that? What does profile information refer to, the information provided by the PGO?

Right. Disable tail duplication for blocks with 8 predecessors may hurt performance of some applications. Duplicate blocks into hot predecessors only can still give you the benefit of tail duplication, and at the same time limit the number of duplication. MachineBlockPlacement does the same thing in the late tail duplication (embedded in the MBP).

But the currently given example does not contain profile information. I don't think this is a solution to the current problem to be considered.

I think the argument is: The currently proposed solution may harm performance in some cases - and that that loss can be mitigated at least in the presence of profile information. (& so with that mitigation, maybe the overall cost is low enough to be worth shipping) - also, even in the absence of real profile information I /think/ we have some codepaths that generate heuristic-based "profile" information (but I might be misremembering/misunderstanding) that might mean the mitigation fires even then.

I'm not sure if I want to use PGO, since most applications won't be built with PGO. But I think the limiting conditions here can be adjusted. Perhaps consider the number of instructions duplicated.

@DianQK
Copy link
Member Author

DianQK commented Jan 25, 2024

We should not implement profile-guided optimizations based on hypotheticals. Did you actually run benchmarks with this patch and saw regressions? (How large?) If not, we should do the straightforward thing until there is evidence that something more complex is justified.

I've written down the compilation time in the issue. I simply tried the runtime benchmark.

The oom_manual.c file is to replace the default branch by __builtin_unreachable().

clang -v:

clang version 17.0.6
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /nix/store/j7mhyvnbbm4pk7vriilrc5wvh2kims5p-clang-17.0.6/bin

main.c:

int src(void) {
  return -1;
}

extern int f1(unsigned int *b);

int main(int argc, char **argv) {
  int r = argc;
  unsigned int b[] = { -1, -2, -3 };
  for (int i = 0; i < 1000000; i++) {
    r += f1(b);
  }
  return r;
}

build.sh:

clang -O1 oom_manual.c main.c -o oom_manual
clang -O1 oom_manual2.c main.c -o oom_manual2
ls -lh oom_manual oom_manual2

output:

-rwxr-xr-x 1 dianqk users  56K Jan 25 22:15 oom_manual
-rwxr-xr-x 1 dianqk users 192K Jan 25 22:15 oom_manual2

hyperfine.sh:

hyperfine -i -N --runs 200 --warmup 50 ./oom_manual ./oom_manual2

output:

Benchmark 1: ./oom_manual
  Time (mean ± σ):      12.4 ms ±   0.0 ms    [User: 12.2 ms, System: 0.2 ms]
  Range (min … max):    12.4 ms …  12.6 ms    200 runs

Benchmark 2: ./oom_manual2
  Time (mean ± σ):      14.9 ms ±   0.3 ms    [User: 14.7 ms, System: 0.2 ms]
  Range (min … max):    14.8 ms …  18.8 ms    200 runs

Summary
  ./oom_manual ran
    1.20 ± 0.02 times faster than ./oom_manual2

perf.sh:

function run_perf() {
  echo "perf stat $1"
  perf stat -x \; \
    -e instructions \
    -e instructions:u \
    -e cycles \
    -e task-clock \
    -e branches \
    -e branch-misses \
    $1
}

run_perf ./oom_manual
run_perf ./oom_manual2

output:

perf stat ./oom_manual
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
251193019;;instructions:u;12839070;100.00;3.63;insn per cycle
69202217;;cycles:u;12839070;100.00;5.390;GHz
12.84;msec;task-clock:u;12839070;100.00;0.981;CPUs utilized
56047969;;branches:u;12839070;100.00;4.365;G/sec
2876;;branch-misses:u;12839070;100.00;0.01;of all branches
perf stat ./oom_manual2
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
424193026;;instructions:u;15665683;100.00;5.12;insn per cycle
82892122;;cycles:u;15665683;100.00;5.291;GHz
15.67;msec;task-clock:u;15665683;100.00;0.981;CPUs utilized
32047975;;branches:u;15665683;100.00;2.046;G/sec
2827;;branch-misses:u;15665683;100.00;0.01;of all branches

I am trying to change the code to see the results of different scenarios.

@DianQK
Copy link
Member Author

DianQK commented Feb 5, 2024

I made some progress.

There are usually only two instructions that can be duplicated. Indirect branches increase this limit to 20.

// If the target has hardware branch prediction that can handle indirect
// branches, duplicating them can often make them predictable when there
// are common paths through the code. The limit needs to be high enough
// to allow undoing the effects of tail merging and other optimizations
// that rearrange the predecessors of the indirect branch.
bool HasIndirectbr = false;
if (!TailBB.empty())
HasIndirectbr = TailBB.back().isIndirectBranch();
if (HasIndirectbr && PreRegAlloc)
MaxDuplicateCount = TailDupIndirectBranchSize;

I understand from the comments that this is to improve the accuracy of branch prediction. I want to know that if it is appropriate with numerous of indirect branches. So I did some experimenting on https://github.com/DianQK/llvm-tail-dup-indirect-succ-size.

I'm using the result of instructions:u and branch-misses:u that don't call f1 as a baseline. The arguments of dup are -tail-dup-indirect-size=20, and the arguments of nodup are -tail-dup-indirect-size=2. I think the replication of this scenario may not make sense.

One of my results is as follows:

==================
Iterations: 1
dup: instructions: 418, branches: 32, branch-misses: 64
nodup: instructions: 280, branches: 53, branch-misses: 1
==================
Iterations: 2
dup: instructions: 840, branches: 63, branch-misses: 60
nodup: instructions: 565, branches: 106, branch-misses: 30
==================
Iterations: 3
dup: instructions: 1260, branches: 93, branch-misses: 46
nodup: instructions: 857, branches: 159, branch-misses: 62
==================
Iterations: 4
dup: instructions: 1681, branches: 123, branch-misses: 48
nodup: instructions: 1135, branches: 212, branch-misses: 74
==================
Iterations: 5
dup: instructions: 2102, branches: 152, branch-misses: 46
nodup: instructions: 1434, branches: 265, branch-misses: 67
==================
Iterations: 6
dup: instructions: 2536, branches: 183, branch-misses: 55
nodup: instructions: 1705, branches: 318, branch-misses: 82
==================
Iterations: 7
dup: instructions: 2944, branches: 212, branch-misses: 60
nodup: instructions: 1990, branches: 371, branch-misses: 77
==================
Iterations: 8
dup: instructions: 3371, branches: 243, branch-misses: 45
nodup: instructions: 2279, branches: 424, branch-misses: 57
==================
Iterations: 9
dup: instructions: 3793, branches: 273, branch-misses: 48
nodup: instructions: 2565, branches: 477, branch-misses: 56
==================
Iterations: 10
dup: instructions: 4232, branches: 309, branch-misses: 61
nodup: instructions: 2870, branches: 536, branch-misses: 49
==================
Iterations: 100
dup: instructions: 42138, branches: 3015, branch-misses: 54
nodup: instructions: 28540, branches: 5312, branch-misses: 55
==================
Iterations: 1000
dup: instructions: 421057, branches: 30020, branch-misses: 42
nodup: instructions: 285055, branches: 53018, branch-misses: 84
==================
Iterations: 10000
dup: instructions: 4210078, branches: 300027, branch-misses: 51
nodup: instructions: 2850075, branches: 530024, branch-misses: 15
==================
Iterations: 100000
dup: instructions: 42100100, branches: 3000035, branch-misses: 60
nodup: instructions: 28500098, branches: 5300032, branch-misses: 52
==================
Iterations: 1000000
dup: instructions: 421000146, branches: 30000068, branch-misses: 157
nodup: instructions: 285000149, branches: 53000061, branch-misses: 117
==================
Iterations: 10000000
dup: instructions: 4210000439, branches: 300000341, branch-misses: 749
nodup: instructions: 2850000384, branches: 530000291, branch-misses: 621
==================
Iterations: 100000000
dup: instructions: 42100003182, branches: 3000003075, branch-misses: 6277
nodup: instructions: 28500002671, branches: 5300002564, branch-misses: 5186

If this is the right route, I'll continue to figure out the other two problems.

  • Why does execute more instructions? (Related to increased PHI?)
  • Finding a suitable successor size?

@DianQK
Copy link
Member Author

DianQK commented Feb 24, 2024

@efriedma-quic I've put most of the analysis into this comment.

In a nutshell, I speculate that duplicating critical BBs makes CFG exceptionally complex, especially within loops, which may be the primary reason for increased time consumption by other passes. A critical BB refers to one with multiple predecessors and multiple successors.

@DianQK DianQK changed the title [TailDuplicator] Add a limit on the number of indirect branch successors [TailDuplicator] Add maximum predecessors and successors to consider tail duplicating blocks Feb 24, 2024
@DianQK
Copy link
Member Author

DianQK commented Feb 27, 2024

Ping. If the new changes are suitable, I hope to catch up on the final release of 18.1.0.

@dtcxzyw
Copy link
Member

dtcxzyw commented Mar 22, 2024

Ping.

@DianQK
Copy link
Member Author

DianQK commented Apr 15, 2024

Ping. I assume this is an uncommon scenario, since there has never been feedback on similar compile-time issues before.

Copy link
Contributor

@nikic nikic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The new check looks sufficiently narrow (requires many predecessors and many successors) to just do this.

@DianQK DianQK merged commit 86a7828 into llvm:main Apr 17, 2024
3 of 4 checks passed
@DianQK DianQK deleted the tail-dup-pred-size-limit branch April 17, 2024 12:27
@alexfh
Copy link
Contributor

alexfh commented Apr 22, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@alexfh
Copy link
Contributor

alexfh commented Apr 25, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

@DianQK
Copy link
Member Author

DianQK commented Apr 25, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

@DianQK ^^

It's not blocking us in any way, but it would be nice to ensure nothing wrong is happening here.

I apologize for my late reply. (*^^*)
I think we can adjust the values of tail-dup-succ-size and tail-dup-pred-size, and based on my experiments, this compilation time is acceptable up to 100.

@DianQK
Copy link
Member Author

DianQK commented Apr 25, 2024

I'll verify at the end of the week that increasing to 128 has no noticeable impact on compilation time.

@DianQK
Copy link
Member Author

DianQK commented Apr 27, 2024

We started seeing a ~36% regression in the https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Benchmarks/Misc/evalloop.c benchmark on AArch64. Is this expected?

Could you try with -tail-dup-pred-size=64 -tail-dup-succ-size=64? It seems I can't reproduce it on x86-64.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Sure, I'll run the comparison before and after this commit with -tail-dup-pred-size=64 -tail-dup-succ-size=64. And yes, this only reproduces on AArch64, but not on any flavor of x86-64 I have access to.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Benchmark compiled with clang after this commit with -mllvm=-tail-dup-pred-size=64 -mllvm=-tail-dup-succ-size=64 has no regression compared to before this commit. I'm running with 32 just in case. Also note that this benchmark is in no way considered representative of any real workloads we care about.

@alexfh
Copy link
Contributor

alexfh commented May 2, 2024

Actually, just -mllvm=-tail-dup-succ-size=32 (without touching -tail-dup-pred-size) is enough to fix the regression, while -mllvm=-tail-dup-succ-size=24 -mllvm=-tail-dup-succ-size=24 is not. Thus, the optimal value of -tail-dup-succ-size for this particular benchmark on aarch64 is somewhere between 24 and 32 (with the default value of -tail-dup-pred-size - 16).

@DianQK
Copy link
Member Author

DianQK commented May 6, 2024

Thanks, I will continue to investigate this.

@DianQK
Copy link
Member Author

DianQK commented May 12, 2024

Hmm, I tried LLVM 18 and the main (c8864bc) branch on Raspberry Pi 4 (arm64), but I didn't find any performance issues:


 Performance counter stats for './llvm18':

          1,805.64 msec task-clock:u                     #    0.994 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.476 /sec
     2,653,746,964      cycles:u                         #    1.470 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           247,193      branch-misses:u

       1.815950443 seconds time elapsed

       1.800769000 seconds user
       0.004009000 seconds sys


Sum: 3273600000

 Performance counter stats for './main':

          1,784.68 msec task-clock:u                     #    0.998 CPUs utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
                46      page-faults:u                    #   25.775 /sec
     2,653,681,171      cycles:u                         #    1.487 GHz
     2,656,672,132      instructions:u                   #    1.00  insn per cycle
   <not supported>      branches:u
           241,177      branch-misses:u

       1.788022549 seconds time elapsed

       1.780709000 seconds user
       0.003991000 seconds sys

dtcxzyw added a commit that referenced this pull request Jun 20, 2024
…96089)

This patch reverts #81585 as
#78582 has been landed.
Now clang works well with reproducer
#79993 (comment).
AlexisPerry pushed a commit to llvm-project-tlp/llvm-project that referenced this pull request Jul 9, 2024
…" (llvm#96089)

This patch reverts llvm#81585 as
llvm#78582 has been landed.
Now clang works well with reproducer
llvm#79993 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CodeGen][OOM] Switch statements without default branching result in long compile times or OOM
10 participants