Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with LLVM BOLT binary optimizer #224

Closed
6 of 7 tasks
corona10 opened this issue Jan 14, 2022 · 14 comments
Closed
6 of 7 tasks

Experiment with LLVM BOLT binary optimizer #224

corona10 opened this issue Jan 14, 2022 · 14 comments

Comments

@corona10
Copy link

corona10 commented Jan 14, 2022

Related discussion: #184
bpo: https://bugs.python.org/issue46378

Since the experiment is time-consuming work, I am going to leave records for each experiment.
If this experiment is workable, My final target is providing BOLT optimization options to the CPython project.

  • A: no PGO + no LTO vs no PGO + no LTO + BOLT
  • B: PGO vs PGO + BOLT
  • C: PGO + LTO vs PGO + LTO + BOLT
  • Investigate how pyston doing
  • BOLT tuning
  • Add configuration option for BOLT build pipeline https://github.com/corona10/cpython/tree/bolt
  • Profiling data decision: CPython unittest(full or partial like PGO) VS pyperformance benchmark.

cc @gvanrossum @vstinner

@corona10
Copy link
Author

corona10 commented Jan 14, 2022

A: CPython BOLT experiment (No PGO + No LTO)

Instruction Order

  • ./configure --with-bolt # TODO: Provide auto-pipeline for BOLT optimization
  • make -j8
  • perf record -e cycles:u -j any,u -- ./python -m test
  • perf2bolt ./python -p perf.data -o cpython.fdata -w cpython.yaml
  • llvm-bolt ./python -o ./python.bolt -b cpython.yaml -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack
BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 4a4a652f34d00120867757fd19aac3c8d85d9451
BOLT-INFO: first alloc address is 0x400000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-WARNING: Ignored 0 functions due to cold fragments.
BOLT-INFO: 3682 out of 5298 functions in the binary (69.5%) have non-empty execution profile
BOLT-INFO: 64 functions with profile could not be optimized
BOLT-INFO: the input contains 788 (dynamic count : 1142953) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 52068 instructions were shortened
BOLT-INFO: removed 17675 empty blocks
BOLT-INFO: ICF folded 135 out of 5646 functions in 3 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 15.58 KB of code space. Folded functions were called 34753 times based on profile.
BOLT-INFO: basic block reordering modified layout of 2226 (40.39%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 939338 hot bytes from 814106 cold bytes (53.57% of split functions is hot).
BOLT-INFO: 21 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 3612 to 905
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            63813906 : executed forward branches
            17802321 : taken forward branches
            28371100 : executed backward branches
            22335912 : taken backward branches
             3470254 : executed unconditional branches
            17408959 : all function calls
             9925193 : indirect calls
              626761 : PLT calls
           554757067 : executed instructions
           146189948 : executed load instructions
            54930110 : executed store instructions
             1881419 : taken jump table branches
                   0 : taken unknown indirect branches
            95655260 : total branches
            43608487 : taken branches
            52046773 : non-taken conditional branches
            40138233 : taken conditional branches
            92185006 : all conditional branches

            60609655 : executed forward branches (-5.0%)
             7601580 : taken forward branches (-57.3%)
            31575351 : executed backward branches (+11.3%)
            20408417 : taken backward branches (-8.6%)
             2455500 : executed unconditional branches (-29.2%)
            17408959 : all function calls (=)
             9925193 : indirect calls (=)
              626761 : PLT calls (=)
           551364604 : executed instructions (-0.6%)
           146189948 : executed load instructions (=)
            54930110 : executed store instructions (=)
             1881419 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            94640506 : total branches (-1.1%)
            30465497 : taken branches (-30.1%)
            64175009 : non-taken conditional branches (+23.3%)
            28009997 : taken conditional branches (-30.2%)
            92185006 : all conditional branches (=)

BOLT-INFO: SCTC: patched 41 tail calls (39 forward) tail calls (2 backward) from a total of 41 while removing 0 double jumps and removing 35 basic blocks totalling 175 bytes of code. CTCs total execution count is 10571 and the number of times CTCs are taken is 9799.
BOLT-INFO: padding code to 0xc00000 to accommodate hot text
BOLT-INFO: setting _end to 0x8e7e80
BOLT-INFO: setting _end to 0x8e7e80
BOLT-INFO: setting __hot_start to 0xa00000
BOLT-INFO: setting __hot_end to 0xaff028
BOLT-INFO: patched build-id (flipped last bit)

Benchmark

  • base: ./configure
  • bolt: ./configure --with-bolt + binary optimization
Benchmark base bolt
2to3 319 ms 305 ms: 1.05x faster
chaos 87.9 ms 84.3 ms: 1.04x faster
deltablue 5.33 ms 4.91 ms: 1.09x faster
fannkuch 480 ms 474 ms: 1.01x faster
hexiom 7.84 ms 7.54 ms: 1.04x faster
json_dumps 14.9 ms 14.5 ms: 1.03x faster
json_loads 32.9 us 30.7 us: 1.07x faster
logging_format 8.48 us 7.20 us: 1.18x faster
logging_silent 141 ns 131 ns: 1.07x faster
logging_simple 7.70 us 6.58 us: 1.17x faster
meteor_contest 118 ms 114 ms: 1.03x faster
nbody 117 ms 116 ms: 1.01x faster
nqueens 104 ms 98.8 ms: 1.05x faster
pathlib 25.1 ms 23.6 ms: 1.06x faster
pickle 12.3 us 11.8 us: 1.04x faster
pickle_dict 32.9 us 32.0 us: 1.03x faster
pickle_pure_python 399 us 381 us: 1.05x faster
pidigits 218 ms 215 ms: 1.01x faster
pyflate 522 ms 515 ms: 1.01x faster
python_startup 11.0 ms 10.6 ms: 1.04x faster
python_startup_no_site 7.97 ms 7.63 ms: 1.04x faster
raytrace 390 ms 374 ms: 1.04x faster
regex_compile 162 ms 157 ms: 1.03x faster
regex_dna 215 ms 214 ms: 1.00x faster
regex_effbot 3.53 ms 3.47 ms: 1.02x faster
regex_v8 28.5 ms 27.9 ms: 1.02x faster
richards 64.0 ms 58.8 ms: 1.09x faster
scimark_fft 409 ms 388 ms: 1.06x faster
scimark_lu 138 ms 131 ms: 1.06x faster
scimark_monte_carlo 81.9 ms 78.2 ms: 1.05x faster
scimark_sor 147 ms 141 ms: 1.04x faster
scimark_sparse_mat_mult 5.74 ms 5.65 ms: 1.02x faster
spectral_norm 133 ms 129 ms: 1.02x faster
sqlite_synth 4.09 us 3.83 us: 1.07x faster
telco 8.14 ms 7.05 ms: 1.15x faster
unpack_sequence 47.7 ns 47.3 ns: 1.01x faster
unpickle 18.8 us 18.4 us: 1.02x faster
unpickle_list 5.81 us 5.63 us: 1.03x faster
unpickle_pure_python 325 us 306 us: 1.06x faster
xml_etree_iterparse 119 ms 115 ms: 1.03x faster
xml_etree_generate 105 ms 98.7 ms: 1.07x faster
xml_etree_process 72.7 ms 69.1 ms: 1.05x faster
Geometric mean (ref) 1.04x faster

Benchmark hidden because not significant (4): float, go, pickle_list, xml_etree_parse

Binary Compression

  • Baseline (./configure): 26M
  • Without BOLT (./configure --with-bolt): 48M, due to compile option
  • With BOLT(./configure --with-bolt) : 9.7M

Heatmap

ICache Miss

$ perf stat -e instructions,L1-icache-misses -- python -m pyperformance run
Experiment instructions L1-icache-misses ratio
Base 7,587,709,171,226 57,490,869,440 0.7%
BOLT 7,388,373,509,871 28,787,449,397 0.3%

@corona10
Copy link
Author

corona10 commented Jan 16, 2022

B: CPython BOLT experiment (PGO vs PGO + BOLT)

Environment

Binary Size

  • ./configure --enable-optimizations - 25M
  • ./configure --enable-optimizations --with-bolt - 52M
  • ./configure --enable-optimizations --with-bolt + BOLT optimization - 9.8M

Benchmark

Benchmark pgo pgo_bolt
2to3 323 ms 297 ms: 1.09x faster
chaos 81.7 ms 80.5 ms: 1.02x faster
deltablue 4.70 ms 4.65 ms: 1.01x faster
fannkuch 439 ms 452 ms: 1.03x slower
float 94.8 ms 94.0 ms: 1.01x faster
go 160 ms 156 ms: 1.02x faster
hexiom 7.48 ms 7.23 ms: 1.03x faster
json_dumps 13.9 ms 14.1 ms: 1.01x slower
json_loads 28.6 us 28.8 us: 1.01x slower
logging_format 6.95 us 6.65 us: 1.04x faster
logging_silent 125 ns 124 ns: 1.01x faster
logging_simple 6.36 us 6.16 us: 1.03x faster
meteor_contest 115 ms 111 ms: 1.04x faster
nbody 116 ms 118 ms: 1.01x slower
nqueens 95.0 ms 95.7 ms: 1.01x slower
pickle 11.7 us 11.2 us: 1.05x faster
pickle_dict 31.4 us 29.4 us: 1.07x faster
pickle_list 4.70 us 4.41 us: 1.07x faster
pickle_pure_python 364 us 356 us: 1.02x faster
pidigits 206 ms 201 ms: 1.03x faster
pyflate 491 ms 488 ms: 1.01x faster
raytrace 363 ms 360 ms: 1.01x faster
regex_compile 151 ms 151 ms: 1.00x faster
regex_dna 217 ms 207 ms: 1.05x faster
regex_effbot 3.39 ms 3.27 ms: 1.04x faster
regex_v8 26.4 ms 25.7 ms: 1.03x faster
scimark_fft 374 ms 377 ms: 1.01x slower
scimark_lu 132 ms 131 ms: 1.01x faster
scimark_monte_carlo 76.1 ms 75.7 ms: 1.01x faster
scimark_sor 141 ms 140 ms: 1.00x faster
scimark_sparse_mat_mult 5.27 ms 5.32 ms: 1.01x slower
spectral_norm 124 ms 122 ms: 1.02x faster
sqlite_synth 3.91 us 3.70 us: 1.06x faster
telco 7.23 ms 7.10 ms: 1.02x faster
unpack_sequence 51.8 ns 48.5 ns: 1.07x faster
unpickle 16.0 us 15.6 us: 1.03x faster
unpickle_list 5.32 us 5.35 us: 1.01x slower
unpickle_pure_python 289 us 286 us: 1.01x faster
xml_etree_parse 159 ms 156 ms: 1.02x faster
xml_etree_iterparse 113 ms 114 ms: 1.01x slower
xml_etree_generate 95.3 ms 95.1 ms: 1.00x faster
xml_etree_process 67.7 ms 67.1 ms: 1.01x faster
Geometric mean (ref) 1.02x faster

Benchmark hidden because not significant (4): pathlib, python_startup, python_startup_no_site, richards

ICache Miss

Experiment instructions L1-icache-misses ratio
PGO 7,477,886,123,191 46,286,113,460 0.6%
PGO + BOLT 7,070,332,908,269 32,415,421,302 0.4%

Heatmap

@corona10
Copy link
Author

corona10 commented Jan 18, 2022

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

Binary Size

./configure --enable-optimizations --with-lto - 28M
./configure --enable-optimizations --with-lto --with-bolt - 80M
./configure --enable-optimizations --with-lto --with-bolt + BOLT optimization - 9.9M

ICache miss

Experiment instructions L1-icache-misses ratio
PGO + LTO 6,685,195,147,835 49,473,656,139 0.7%
PGO + LTO + BOLT 7,070,332,908,269 32,415,421,302 0.4%

Benchmark

Benchmark pgo_lto pgo_lto_bolt
2to3 300 ms 302 ms: 1.01x slower
chaos 74.3 ms 82.9 ms: 1.12x slower
deltablue 4.22 ms 5.02 ms: 1.19x slower
fannkuch 396 ms 455 ms: 1.15x slower
float 88.4 ms 98.0 ms: 1.11x slower
go 151 ms 177 ms: 1.17x slower
hexiom 6.71 ms 7.86 ms: 1.17x slower
json_dumps 12.9 ms 13.3 ms: 1.03x slower
json_loads 27.3 us 30.6 us: 1.12x slower
logging_format 6.47 us 6.98 us: 1.08x slower
logging_silent 113 ns 136 ns: 1.21x slower
logging_simple 5.92 us 6.30 us: 1.06x slower
meteor_contest 110 ms 116 ms: 1.05x slower
nbody 108 ms 109 ms: 1.01x slower
nqueens 86.7 ms 99.0 ms: 1.14x slower
pathlib 21.4 ms 22.4 ms: 1.05x slower
pickle 11.0 us 11.7 us: 1.06x slower
pickle_dict 30.9 us 32.5 us: 1.05x slower
pickle_pure_python 337 us 385 us: 1.14x slower
pidigits 208 ms 231 ms: 1.11x slower
pyflate 455 ms 529 ms: 1.16x slower
python_startup 10.1 ms 10.5 ms: 1.04x slower
python_startup_no_site 7.36 ms 7.60 ms: 1.03x slower
raytrace 326 ms 371 ms: 1.14x slower
regex_compile 137 ms 156 ms: 1.13x slower
regex_dna 221 ms 203 ms: 1.09x faster
regex_effbot 3.17 ms 3.29 ms: 1.04x slower
regex_v8 25.6 ms 26.0 ms: 1.01x slower
richards 52.8 ms 62.1 ms: 1.18x slower
scimark_fft 324 ms 373 ms: 1.15x slower
scimark_lu 117 ms 133 ms: 1.14x slower
scimark_monte_carlo 73.4 ms 76.9 ms: 1.05x slower
scimark_sor 127 ms 140 ms: 1.10x slower
scimark_sparse_mat_mult 4.54 ms 5.17 ms: 1.14x slower
spectral_norm 107 ms 118 ms: 1.10x slower
sqlite_synth 3.61 us 3.70 us: 1.02x slower
unpack_sequence 44.4 ns 47.6 ns: 1.07x slower
unpickle 14.7 us 17.4 us: 1.18x slower
unpickle_list 5.04 us 4.96 us: 1.01x faster
unpickle_pure_python 256 us 301 us: 1.18x slower
xml_etree_parse 156 ms 169 ms: 1.09x slower
xml_etree_iterparse 106 ms 112 ms: 1.06x slower
xml_etree_generate 86.9 ms 95.6 ms: 1.10x slower
xml_etree_process 60.9 ms 67.2 ms: 1.10x slower
Geometric mean (ref) 1.09x slower

Benchmark hidden because not significant (2): pickle_list, telco

@corona10
Copy link
Author

note

  • PGO + LTO + BOLT is slower than PGO + BOLT: Geometric mean 1.03x slower
  • l1 icache miss ratio is definitely reduced.

@gvanrossum
Copy link
Collaborator

So which do you recommend?

@corona10
Copy link
Author

@gvanrossum

So which do you recommend?

I am investigating why PGO + LTO + BOLT became slower than PGO + BOLT with the following things

  • How pyston handle BOLT?
  • Do we need compile option changes for BOLT?
  • Do we need to change BOLT configuration? (current BOLT configuration is used for gcc not CPython specific)

So if we can get great results by turning those things and if PGO + BOLT or PGO + LTO + BOLT can beat PGO + LTO,
I will decide to adopt the BOLT optimization pass but if not it will not be worth to add it.

I am going to leave experiment notes on this issue, do you think is this too noisy?

@gvanrossum
Copy link
Collaborator

I love reading your notes, that’s what I do.

@corona10
Copy link
Author

corona10 commented Jan 21, 2022

After investigating BOLT itself and pyston uses case with several experiments, I decided to get a profile from pyperformance.

Reasons are the following things.

  1. BOLT is a post-linker optimizer, The origin needs of BOLT is applying real-world service profiles data by getting from real service. This is why BOLT provides a documentation section for service profiling. If they do not want to use real-world workload, using PGO and getting data from the unit test is an easier way. so Pyston actually getting profile data for BOLT from https://github.com/pyston/python-macrobenchmarks.
  2. The binary which is generated from standard unit-tests profiling fails to pass all unit tests, but the binary from pyperformance profiling passes all tests. This is might be the bug of BOLT and it might need to be investigated.

I am going to compare the benchmark result by applying pyperfomance profile data. But due to vendering issue, I may need to provide a tool for BOLT optimizer under Tools/ directory.

@corona10
Copy link
Author

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

Benchmark

Benchmark pgo_lto_base pgo_lto_bolt_pyperformance
2to3 304 ms 280 ms: 1.09x faster
chaos 73.2 ms 73.8 ms: 1.01x slower
float 92.1 ms 89.1 ms: 1.03x faster
hexiom 6.75 ms 6.58 ms: 1.03x faster
json_loads 27.4 us 27.2 us: 1.01x faster
logging_format 6.60 us 6.50 us: 1.02x faster
logging_silent 116 ns 114 ns: 1.01x faster
logging_simple 6.02 us 5.96 us: 1.01x faster
meteor_contest 111 ms 109 ms: 1.01x faster
nbody 108 ms 113 ms: 1.04x slower
pathlib 21.1 ms 21.5 ms: 1.02x slower
pickle 11.4 us 11.0 us: 1.03x faster
pickle_dict 31.2 us 30.1 us: 1.04x faster
pickle_list 4.75 us 4.51 us: 1.05x faster
pickle_pure_python 338 us 342 us: 1.01x slower
pidigits 199 ms 203 ms: 1.02x slower
pyflate 463 ms 457 ms: 1.01x faster
raytrace 327 ms 325 ms: 1.01x faster
regex_compile 138 ms 138 ms: 1.00x faster
regex_dna 218 ms 216 ms: 1.01x faster
regex_effbot 3.11 ms 3.22 ms: 1.04x slower
regex_v8 25.3 ms 26.1 ms: 1.03x slower
richards 53.3 ms 53.6 ms: 1.01x slower
scimark_fft 327 ms 323 ms: 1.01x faster
scimark_monte_carlo 72.4 ms 71.1 ms: 1.02x faster
scimark_sor 128 ms 126 ms: 1.01x faster
scimark_sparse_mat_mult 4.41 ms 4.36 ms: 1.01x faster
spectral_norm 107 ms 106 ms: 1.00x faster
sqlite_synth 3.57 us 3.63 us: 1.02x slower
telco 6.66 ms 6.76 ms: 1.01x slower
unpack_sequence 49.9 ns 46.5 ns: 1.07x faster
unpickle 14.3 us 14.8 us: 1.03x slower
unpickle_pure_python 258 us 255 us: 1.01x faster
xml_etree_parse 154 ms 153 ms: 1.01x faster
xml_etree_generate 86.5 ms 85.6 ms: 1.01x faster
xml_etree_process 61.0 ms 60.3 ms: 1.01x faster
Geometric mean (ref) 1.01x faster

Benchmark hidden because not significant (10): deltablue, fannkuch, go, json_dumps, nqueens, python_startup, python_startup_no_site, scimark_lu, unpickle_list, xml_etree_iterparse

@corona10
Copy link
Author

I am investigating why PGO + LTO + BOLT became slower than PGO + BOLT with the following things

Because I passed the wrong optimization option (--with-optimizations):(
#224 (comment) is tested under valid optimization options.

@corona10
Copy link
Author

D: CPython BOLT experiment (PGO + LTO + BOLT + profiling with -m test vs PGO + LTO + BOLT + profiling with pyperformance)

Environment

Benchmark

Benchmark pgo_lto_bolt_std pgo_lto_bolt_pyperformance
2to3 281 ms 280 ms: 1.00x faster
chaos 74.5 ms 73.8 ms: 1.01x faster
go 152 ms 152 ms: 1.00x faster
hexiom 6.59 ms 6.58 ms: 1.00x faster
json_dumps 13.2 ms 13.2 ms: 1.00x faster
json_loads 27.4 us 27.2 us: 1.01x faster
logging_format 6.47 us 6.50 us: 1.00x slower
logging_simple 5.93 us 5.96 us: 1.01x slower
nqueens 87.2 ms 86.7 ms: 1.01x faster
pathlib 21.6 ms 21.5 ms: 1.00x faster
pickle_list 4.55 us 4.51 us: 1.01x faster
python_startup 10.3 ms 10.2 ms: 1.00x faster
python_startup_no_site 7.43 ms 7.41 ms: 1.00x faster
regex_compile 138 ms 138 ms: 1.00x faster
regex_v8 25.4 ms 26.1 ms: 1.03x slower
scimark_monte_carlo 71.5 ms 71.1 ms: 1.01x faster
Geometric mean (ref) 1.00x faster

Benchmark hidden because not significant (30): deltablue, fannkuch, float, logging_silent, meteor_contest, nbody, pickle, pickle_dict, pickle_pure_python, pidigits, pyflate, raytrace, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, telco, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

@corona10
Copy link
Author

@gvanrossum cc @vstinner

I gathered all benchmark data for BOLT, I am not sure about providing a BOLT optimization pass for 1% performance gain.
IMHO, I think that providing the BOLT optimization option will be proper when the BOLT becomes more stable or CPython binary size becomes bigger than now.
WDYT?

@gvanrossum
Copy link
Collaborator

Thanks for running these extensive tests. It does look like it's not worth making the build process even more complex.

I wonder if a more fruitful approach would be to come up with a better set of training code for PGO? (That belongs in a different issue. :-)

@corona10
Copy link
Author

I think so too :) Let's close the issue, if someone wants to reopen it, I will welcome it too.

kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt from LLVM 14.0.6.

Compared to (a previous attempt)[faster-cpython/ideas#224],
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior which is typically not tested by the small
benchmarks in the pyperformance suite.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt from LLVM 14.0.6.

Compared to (a previous attempt)[faster-cpython/ideas#224],
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior which is typically not tested by the small
benchmarks in the pyperformance suite.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to (a previous attempt)[faster-cpython/ideas#224],
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior which is typically not tested by the small
benchmarks in the pyperformance suite.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to (a previous attempt)[faster-cpython/ideas#224],
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior which is typically not tested by the small
benchmarks in the pyperformance suite.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior which is typically not tested by the small
benchmarks in the pyperformance suite.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 11, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
kmod added a commit to kmod/cpython that referenced this issue Aug 18, 2022
Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.
corona10 added a commit to python/cpython that referenced this issue Aug 18, 2022
* Add support for the BOLT post-link binary optimizer

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt)
provides a fairly large speedup without any code or functionality
changes. It provides roughly a 1% speedup on pyperformance, and a
4% improvement on the Pyston web macrobenchmarks.

It is gated behind an `--enable-bolt` configure arg because not all
toolchains and environments are supported. It has been tested on a
Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6
sources (their binary distribution of this version did not include bolt).

Compared to [a previous attempt](faster-cpython/ideas#224),
this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE
flags which enable much better optimizations from bolt.

The effects of this change are a bit more dependent on CPU microarchitecture
than other changes, since it optimizes i-cache behavior which seems
to be a bit more variable between architectures. The 1%/4% numbers
were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I
got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance
I got a slightly lower speedup (1%/3%).

The low speedup on pyperformance is not entirely unexpected, because
BOLT improves i-cache behavior, and the benchmarks in the pyperformance
suite are small and tend to fit in i-cache.

This change uses the existing pgo profiling task (`python -m test --pgo`),
though I was able to measure about a 1% macrobenchmark improvement by
using the macrobenchmarks as the training task. I personally think that
both the PGO and BOLT tasks should be updated to use macrobenchmarks,
but for the sake of splitting up the work this PR uses the existing pgo task.

* Simplify the build flags

* Add a NEWS entry

* Update Makefile.pre.in

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

* Update configure.ac

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

* Add myself to ACKS

* Add docs

* Other review comments

* fix tab/space issue

* Make it more clear that --enable-bolt is experimental

* Add link to bolt's github page

Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants