Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary search performance regressed in 1.25 #53823

Closed
Emerentius opened this issue Aug 30, 2018 · 36 comments · Fixed by #128254
Closed

Binary search performance regressed in 1.25 #53823

Emerentius opened this issue Aug 30, 2018 · 36 comments · Fixed by #128254
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@Emerentius
Copy link
Contributor

Emerentius commented Aug 30, 2018

The speedup obtained by the optimizations from #45333 which hit stable in 1.23 were mostly lost again from 1.25 onwards, because the compiler started emitting a branch instead of a cmov.

1.24

14  cmp     qword ptr [rsi + 8*rax], rcx
15  cmova   rax, r8

1.25

18  cmp     qword ptr [rsi + 8*r9], rcx
19  ja      .LBB0_7
20  mov     r8, r9
21  .LBB0_7:

From godbolt: https://godbolt.org/z/DUDl-T

Bench comparison with 1.23 nightly (2017-11-20, 9 days after PR merge)

 name                        old_1.23 ns/iter  new_1.23 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            70                28                         -42  -60.00%   x 2.50 
 binary_search_l1_with_dups  46                28                         -18  -39.13%   x 1.64 
 binary_search_l2            94                42                         -52  -55.32%   x 2.24 
 binary_search_l2_with_dups  68                42                         -26  -38.24%   x 1.62 
 binary_search_l3            267               234                        -33  -12.36%   x 1.14 
 binary_search_l3_with_dups  198               236                         38   19.19%   x 0.84 

and status quo with 1.30 nightly (2018-08-15)

 name                        old_1.30 ns/iter  new_1.30 ns/iter  diff ns/iter   diff %  speedup 
 binary_search_l1            75                68                          -7   -9.33%   x 1.10 
 binary_search_l1_with_dups  46                61                          15   32.61%   x 0.75 
 binary_search_l2            99                89                         -10  -10.10%   x 1.11 
 binary_search_l2_with_dups  71                89                          18   25.35%   x 0.80 
 binary_search_l3            272               245                        -27   -9.93%   x 1.11 
 binary_search_l3_with_dups  212               256                         44   20.75%   x 0.83 
@estebank estebank added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Aug 30, 2018
@Emerentius
Copy link
Contributor Author

Emerentius commented Aug 31, 2018

The likely culprit is the upgrade from LLVM 4 to 6. The upgrade was merged on the 10th of Feb. The binary search was fast on the 9th and slow on the 12th.

@estebank The optimization did hit stable (if only for 2 releases), so does it not need the regression-from-stable-to-stable label?

@estebank
Copy link
Contributor

@Emerentius what do you mean for only 2 releases? Isn't the slowdown present in stable through nightly? Reading the assembly from godbolt it seems to me like there have been some changes but the code causing the slowdown presented in this issue is still there. If that is the case, then it would need that label, but if it was fixed in current stable, then we can close the ticket, as I do not believe we cut point releases for previous stable releases.

@Emerentius
Copy link
Contributor Author

@estebank I mean it was fast in 1.23 stable and slow from 1.25 stable on, so fast for 2 stable releases, then regressed.

@estebank estebank added the regression-from-stable-to-stable Performance or correctness regression from one stable version to another. label Aug 31, 2018
@pnkfelix pnkfelix self-assigned this Nov 29, 2018
@pnkfelix
Copy link
Member

triage: P-medium, since I don't think investigating this takes priority over other things currently being juggled. but also self-assigning since I'm curious and I think I can do some bisection as a background task, assuming I can reproduce the perf regression locally.

@pnkfelix
Copy link
Member

discussed in T-compiler meeting. Re-prioritizing as P-high just for evaluating the scope of the regression (though some good work on that is already evident in this comment thread).

@pnkfelix pnkfelix added P-high High priority and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Nov 29, 2018
@pnkfelix
Copy link
Member

Not sure why I previously removed I-slow; I'm putting it back now.

@pnkfelix pnkfelix added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Dec 12, 2018
@pnkfelix
Copy link
Member

Hey @Emerentius, what is the distinction you are making in the columns with the prefix "old_" and "new_" in the tables you present?

That is, I would have thought that "old_" somehow corresponds to the state prior to the changes from PR #45333 and "new_" corresponds to after PR #45333 landed; but you already established that PR #45333 landed in between the two versions that each have "old_" and "new_" labels applied to them.

So I don't understand what difference has been applied to either the code or the benchmark in between the two columns there. (Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?)

@pnkfelix
Copy link
Member

pnkfelix commented Dec 12, 2018

(In any case I definitely am able to reproduce a regression on my Linux desktop on benches/slice between nightly-2018-02-09 and 2018-02-12, as anticipated/noted by #53823 (comment) ; there's a fair amount of noise between runs when looking at binary_search_l3 in my own machine, but binary_search_l1 is much more consistent in illustrating a 2x regression for me.)

@Emerentius
Copy link
Contributor Author

@pnkfelix

Hypothesis: Is "old_" testing the binary search implementation prior to PR #45333, and "new_" testing the binary search implementation that was added by PR #45333 ?

That is exactly it

@pnkfelix
Copy link
Member

pnkfelix commented Dec 13, 2018

Okay so I've done some preliminary analysis of the scope of the problem.

On my desktop linux machine, I built the benchmark suite in libcore/benches atop the nightly compilers from 2018-02-09 and 2018-02-12. Then I did three runs of the suite on each compiler (for a total of six runs).

Update: On the LLVM ticket associated with this issue that was subsequently filed, someone noted that I didn't indicate the processor model where I gathered this data. Here that is, in a details block.

% cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1728.267
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1545.735
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1588.301
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2202.815
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 4
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1990.709
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 1
initial apicid  : 1
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 5
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2112.084
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 1
cpu cores       : 4
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 6
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 1789.977
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 2
cpu cores       : 4
apicid          : 5
initial apicid  : 5
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 60
model name      : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
stepping        : 3
microcode       : 0x25
cpu MHz         : 2148.265
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_goo\
d nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdra\
nd lahf_lm abm cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts flush_l\
1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7183.27
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

Here is the raw data from that exercise: https://gist.github.com/8f511af7012c750928a6bdbf8efa3e6f

Then I put it into a spreadsheet, took the min (of the three runs) of the reported ns/iter for each benchmark, and asked to see the ratio (i.e. slowdown, where 2.0 is a 2x slowdown, 1.0 is no change, and 0.5 is a 2x speedup), comparing the min for 2018-02-09 to 2018-02-12.

  • (One might argue I should use average instead of minimum... but for understanding compiler output, I go along with the argument presented here that usually the outliers are much slower than normal, rather than faster, and so taking the minimum is more likely to filter out transient noise rather than artificial improvement)

Finally, when looking at the slowdown ratio, I asked for the values > 1.2 to be highlighted in red (so that we'd pay special attention to a 1.2x slowdown) and the values < 0.9 to be highlighted in green (so that we'd also notice reasonable speedups).

Here is that spreadsheet: https://www.dropbox.com/s/qznzchqhibi5o67/three-runs.xlsx?dl=0

@pnkfelix
Copy link
Member

pnkfelix commented Dec 13, 2018

A number of benchmarks improved significantly (i.e. the slowdown ratio is <0.9) over the time period specified. I'm not going to list them here, but there are ~44 of them.

But here are the 9 cases that regressed:

test 2018-02-09 ns/iter (min of 3) 2018-02-12 ns/iter (min of 3) slowdown
char::methods::bench_to_digit_radix_10 16,208 20,078 1.24x
iter::bench_zip_then_skip 2,580 3,899 1.51x
num::bench_u32_from_str_radix_2 17,896 23,223 1.30x
slice::binary_search_l1 21 51 2.43x
slice::binary_search_l1_with_dups 21 46 2.19x
slice::binary_search_l2 31 67 2.16x
slice::binary_search_l2_with_dups 31 67 2.16x
slice::binary_search_l3 108 141 1.31x
slice::binary_search_l3_with_dups 109 142 1.30x

Obviously the latter six cases are the ones that were already being pointed out by this bug.

(The other three entries in the table may or may not have the same core cause.)

Obviously the libcore/benches benchmark should not be the sole decider about whether a regression is significant, but based on these results, the impact does seem to be somewhat restricted in scope...

@pnkfelix
Copy link
Member

pnkfelix commented Dec 14, 2018

Also, there are some pretty interesting discussions among LLVM developers about when they turn cmov into branches. It sounds there are a number of scenarios where this exhibits better performance due to (I think) either 1. the processor's branch prediction or 2. an expensive test condition that you'd rather evaluate just once

@pnkfelix
Copy link
Member

In particular, in the LLVM developer discussions linked above, I saw mention of the following flag: -x86-cmov-converter=false

I gave that a whirl, and the results seem promising:

% rustc +nightly-2018-02-12 -O -C llvm-args=-x86-cmov-converter=false --test lib.rs -o ../benches-2018-02-12.nocmovconv
% rustc +nightly-2018-02-12 -O                                        --test lib.rs -o ../benches-2018-02-12
% rustc +nightly-2018-02-09 -O                                        --test lib.rs -o ../benches-2018-02-09
% ../benches-2018-02-09 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         110 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         112 ns/iter (+/- 7)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12 --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          51 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          46 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          67 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          67 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         143 ns/iter (+/- 1)
test slice::binary_search_l3_with_dups                     ... bench:         142 ns/iter (+/- 0)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

% ../benches-2018-02-12.nocmovconv --bench slice

running 6 tests
test slice::binary_search_l1                               ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l1_with_dups                     ... bench:          21 ns/iter (+/- 0)
test slice::binary_search_l2                               ... bench:          31 ns/iter (+/- 1)
test slice::binary_search_l2_with_dups                     ... bench:          31 ns/iter (+/- 0)
test slice::binary_search_l3                               ... bench:         114 ns/iter (+/- 2)
test slice::binary_search_l3_with_dups                     ... bench:         115 ns/iter (+/- 1)

test result: ok. 0 passed; 0 failed; 0 ignored; 6 measured; 142 filtered out

@pnkfelix
Copy link
Member

pnkfelix commented Dec 14, 2018

To be clear: I am not suggesting that we should turn the x86-cmov-converter pass off by default in rustc.

But:

  1. we should check whether the improved version from https://reviews.llvm.org/D36081 is in our LLVM (and if not, consider cherry-picking it over), and
  2. if we can reproduce this issue on the current LLVM version, then we should file an issue with them upstream.

@nikic
Copy link
Contributor

nikic commented Dec 14, 2018

Just checked, that patch is in our LLVM, and this still reproduces with current LLVM (running generated IR through llc).

@nikic
Copy link
Contributor

nikic commented Dec 14, 2018

I think the issue here might be that cmov conversion does not realize that it will not be possible to sink the cmovd chain into the created branches, because the condition also depends on them. I misunderstood what this pass is intended to do. It's not interested in hoisting anything into the branches, the only point is to remove the dependency on the branch condition through speculation.

@nikic
Copy link
Contributor

nikic commented Dec 14, 2018

I've filed an LLVM issue: https://bugs.llvm.org/show_bug.cgi?id=40027

@nikic nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Dec 15, 2018
@pnkfelix
Copy link
Member

Okay at this point I think I've convinced myself that the scope of this problem is somewhat limited,

So I'm going to downgrade this to P-medium.

It also isn't really going to be addressed on our end (except perhaps via cherry-picking a patch for the LLVM issue).

So I'm tempted to close it, but I'll maybe wait to let other members of @rust-lang/compiler weigh in before I do that.

@pnkfelix pnkfelix added P-medium Medium priority and removed P-high High priority labels Dec 17, 2018
@pnkfelix
Copy link
Member

I'll leave myself assigned for now, but mostly as a reminder to close it in the future.

If someone else wants to take this issue and run with it, feel free to assign yourself and unassign me.

@nikic
Copy link
Contributor

nikic commented Dec 17, 2018

I think the main actionable point here is to try and change the binary search implementation in a way that does not run into this LLVM issue. Not sure how that would look like, but something that interchanges the order of the compare and the load (i.e. already load for the next iteration) might work, because this is a pattern that LLVM specifically checks for.

@arielb1 arielb1 added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Dec 17, 2018
@the8472
Copy link
Member

the8472 commented Jul 4, 2021

Putting this on T-libs if they can find a libs-level solution.

I have tested several older revisions of the binary_search and partition_point implementations. All of them had their conditional moves converted to cmp + branch due to this llvm change.

This LLVM issue comment suggests that the primary heuristic which turns off this optimization only looks for binary tree searches (unpredictable pointer chasing) but not binary searches (unpredictable pointer arithmetic).

Another heuristic related to cost modelling has been added. Modifying its threshold via -C llvm-args=-x86-cmov-converter-threshold=10 does revert the old binary search implementation (see previous comment) to conditional moves. This doesn't work for the current, short-circuiting implementation from #74024 (but running that one on rust 1.24 does produce cmovs).
I don't think there's anything we could do on the libs side to trigger that heuristic with its current default (4).

The options I currently see are:

  1. wait for an llvm fix (might require a new unpredictable intrinsic)
  2. special-case x86 with assembly
  3. tweak llvm-args

Only option 2 would be a pure libs thing. Some of those options would still require changes to the current implementation and benchmark runs. The old while size > 1 approach seems to be easier to turn into cmovs than current while left < right one. Or separate implementations partition_point and binary_search may be needed.

@joshtriplett joshtriplett added P-low Low priority T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. and removed P-medium Medium priority labels Jul 14, 2021
@yaahc
Copy link
Member

yaahc commented Jul 19, 2021

We discussed this issue in our last libs team meeting and concluded that there's nothing left for libs to do, and that we need to wait for the fix to happen upstream in https://bugs.llvm.org/show_bug.cgi?id=40027. Our plan is to hand this over to the compiler team for tracking as discussed in https://rust-lang.zulipchat.com/#narrow/stream/131828-t-compiler/topic/tracking.20an.20old.20regression. For now we've tagged both teams and nominated it so that the compiler team can confirm the handover in their next meeting.

@pnkfelix
Copy link
Member

@yaahc @joshtriplett I'm happy to confirm handover to T-compiler, but the prioritization of this (currently P-low; P-medium as of 8 days ago) essentially means that we're never going to prioritize it for discussion in a meeting. It will just be task hanging out there for interested people to look at and discuss.

So, I just wanted to confirm: Is that an acceptable outcome from the point of view of you all? Or do you think this should be assigned higher priority?

@joshtriplett
Copy link
Member

@pnkfelix That's roughly the outcome we were expecting.

We do feel that this would be worthwhile to have on the radar of a team whose primary purview is LLVM-related fixes that would benefit Rust, but we also recognize that that's a "somebody should", and we don't have a somebody to offer.

@yaahc yaahc removed the T-libs Relevant to the library team, which will review and decide on the PR/issue. label Jul 22, 2021
@yaahc
Copy link
Member

yaahc commented Jul 22, 2021

What Josh said, I've gone ahead and removed the T-libs label, thanks for the confirmation @pnkfelix ^_^.

@vlovich
Copy link

vlovich commented Sep 1, 2023

Could something like https://docs.rs/cmov/latest/cmov/ be used in the interim to force the usage of cmov to recover the performance?

@mohe2015
Copy link
Contributor

mohe2015 commented Sep 1, 2023

Is profile guided optimization able to do anything in this case maybe?

@nikic
Copy link
Contributor

nikic commented Sep 14, 2023

The x86 backend should now respect __builtin_unpredicate for cmov conversion, so if we can expose that in rustc in some way, we can use that.

@the8472
Copy link
Member

the8472 commented Sep 14, 2023

There's a discussion what to do about likely()/unlikely() on IRLO. Adding an unpredictable() was mentioned too.

Amanieu added a commit to Amanieu/rust that referenced this issue Jul 26, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
@Amanieu
Copy link
Member

Amanieu commented Jul 26, 2024

See #128254 for a potential fix.

Amanieu added a commit to Amanieu/rust that referenced this issue Jul 26, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this issue Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this issue Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Amanieu added a commit to Amanieu/rust that referenced this issue Jul 27, 2024
This restores the original binary search implementation from rust-lang#45333
which has the nice property of having a loop count that only depends on
the size of the slice. This, along with explicit conditional moves
from rust-lang#128250, means that the entire binary search loop can be perfectly
predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is
known at compile-time. This results in a very compact code sequence of
3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
bors added a commit to rust-lang-ci/rust that referenced this issue Jul 30, 2024
Rewrite binary search implementation

This PR builds on top of rust-lang#128250, which should be merged first.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Fixes rust-lang#115271
bors added a commit to rust-lang-ci/rust that referenced this issue Aug 2, 2024
Rewrite binary search implementation

This PR builds on top of rust-lang#128250, which should be merged first.

This restores the original binary search implementation from rust-lang#45333 which has the nice property of having a loop count that only depends on the size of the slice. This, along with explicit conditional moves from rust-lang#128250, means that the entire binary search loop can be perfectly predicted by the branch predictor.

Additionally, LLVM is able to unroll the loop when the slice length is known at compile-time. This results in a very compact code sequence of 3-4 instructions per binary search step and zero branches.

Fixes rust-lang#53823
Fixes rust-lang#115271
@bors bors closed this as completed in bb58488 Aug 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority regression-from-stable-to-stable Performance or correctness regression from one stable version to another. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.