Experiment with LLVM BOLT binary optimizer #224

corona10 · 2022-01-14T16:02:19Z

Related discussion: #184
bpo: https://bugs.python.org/issue46378

Since the experiment is time-consuming work, I am going to leave records for each experiment.
If this experiment is workable, My final target is providing BOLT optimization options to the CPython project.

A: no PGO + no LTO vs no PGO + no LTO + BOLT
B: PGO vs PGO + BOLT
C: PGO + LTO vs PGO + LTO + BOLT
Investigate how pyston doing
BOLT tuning
Add configuration option for BOLT build pipeline https://github.com/corona10/cpython/tree/bolt
Profiling data decision: CPython unittest(full or partial like PGO) VS pyperformance benchmark.

cc @gvanrossum @vstinner

corona10 · 2022-01-14T16:02:37Z

A: CPython BOLT experiment (No PGO + No LTO)

environment: AWS c5n.metal
branch: https://github.com/corona10/cpython/tree/bolt
gcc version 10.2.1 20210130 (Red Hat 10.2.1-11) (GCC)

Instruction Order

./configure --with-bolt # TODO: Provide auto-pipeline for BOLT optimization
make -j8
perf record -e cycles:u -j any,u -- ./python -m test
perf2bolt ./python -p perf.data -o cpython.fdata -w cpython.yaml
llvm-bolt ./python -o ./python.bolt -b cpython.yaml -reorder-blocks=cache+ -reorder-functions=hfsort+ -split-functions=3 -split-all-cold -dyno-stats -icf=1 -use-gnu-stack

BOLT-INFO: Target architecture: x86_64
BOLT-INFO: BOLT version: 4a4a652f34d00120867757fd19aac3c8d85d9451
BOLT-INFO: first alloc address is 0x400000
BOLT-WARNING: debug info will be stripped from the binary. Use -update-debug-sections to keep it.
BOLT-INFO: enabling relocation mode
BOLT-INFO: enabling lite mode
BOLT-INFO: pre-processing profile using YAML profile reader
BOLT-WARNING: Ignored 0 functions due to cold fragments.
BOLT-INFO: 3682 out of 5298 functions in the binary (69.5%) have non-empty execution profile
BOLT-INFO: 64 functions with profile could not be optimized
BOLT-INFO: the input contains 788 (dynamic count : 1142953) opportunities for macro-fusion optimization. Will fix instances on a hot path.
BOLT-INFO: 52068 instructions were shortened
BOLT-INFO: removed 17675 empty blocks
BOLT-INFO: ICF folded 135 out of 5646 functions in 3 passes. 0 functions had jump tables.
BOLT-INFO: Removing all identical functions will save 15.58 KB of code space. Folded functions were called 34753 times based on profile.
BOLT-INFO: basic block reordering modified layout of 2226 (40.39%) functions
BOLT-INFO: UCE removed 0 blocks and 0 bytes of code.
BOLT-INFO: splitting separates 939338 hot bytes from 814106 cold bytes (53.57% of split functions is hot).
BOLT-INFO: 21 Functions were reordered by LoopInversionPass
BOLT-INFO: hfsort+ reduced the number of chains from 3612 to 905
BOLT-INFO: program-wide dynostats after all optimizations before SCTC and FOP:

            63813906 : executed forward branches
            17802321 : taken forward branches
            28371100 : executed backward branches
            22335912 : taken backward branches
             3470254 : executed unconditional branches
            17408959 : all function calls
             9925193 : indirect calls
              626761 : PLT calls
           554757067 : executed instructions
           146189948 : executed load instructions
            54930110 : executed store instructions
             1881419 : taken jump table branches
                   0 : taken unknown indirect branches
            95655260 : total branches
            43608487 : taken branches
            52046773 : non-taken conditional branches
            40138233 : taken conditional branches
            92185006 : all conditional branches

            60609655 : executed forward branches (-5.0%)
             7601580 : taken forward branches (-57.3%)
            31575351 : executed backward branches (+11.3%)
            20408417 : taken backward branches (-8.6%)
             2455500 : executed unconditional branches (-29.2%)
            17408959 : all function calls (=)
             9925193 : indirect calls (=)
              626761 : PLT calls (=)
           551364604 : executed instructions (-0.6%)
           146189948 : executed load instructions (=)
            54930110 : executed store instructions (=)
             1881419 : taken jump table branches (=)
                   0 : taken unknown indirect branches (=)
            94640506 : total branches (-1.1%)
            30465497 : taken branches (-30.1%)
            64175009 : non-taken conditional branches (+23.3%)
            28009997 : taken conditional branches (-30.2%)
            92185006 : all conditional branches (=)

BOLT-INFO: SCTC: patched 41 tail calls (39 forward) tail calls (2 backward) from a total of 41 while removing 0 double jumps and removing 35 basic blocks totalling 175 bytes of code. CTCs total execution count is 10571 and the number of times CTCs are taken is 9799.
BOLT-INFO: padding code to 0xc00000 to accommodate hot text
BOLT-INFO: setting _end to 0x8e7e80
BOLT-INFO: setting _end to 0x8e7e80
BOLT-INFO: setting __hot_start to 0xa00000
BOLT-INFO: setting __hot_end to 0xaff028
BOLT-INFO: patched build-id (flipped last bit)

Benchmark

base: ./configure
bolt: ./configure --with-bolt + binary optimization

Benchmark	base	bolt
2to3	319 ms	305 ms: 1.05x faster
chaos	87.9 ms	84.3 ms: 1.04x faster
deltablue	5.33 ms	4.91 ms: 1.09x faster
fannkuch	480 ms	474 ms: 1.01x faster
hexiom	7.84 ms	7.54 ms: 1.04x faster
json_dumps	14.9 ms	14.5 ms: 1.03x faster
json_loads	32.9 us	30.7 us: 1.07x faster
logging_format	8.48 us	7.20 us: 1.18x faster
logging_silent	141 ns	131 ns: 1.07x faster
logging_simple	7.70 us	6.58 us: 1.17x faster
meteor_contest	118 ms	114 ms: 1.03x faster
nbody	117 ms	116 ms: 1.01x faster
nqueens	104 ms	98.8 ms: 1.05x faster
pathlib	25.1 ms	23.6 ms: 1.06x faster
pickle	12.3 us	11.8 us: 1.04x faster
pickle_dict	32.9 us	32.0 us: 1.03x faster
pickle_pure_python	399 us	381 us: 1.05x faster
pidigits	218 ms	215 ms: 1.01x faster
pyflate	522 ms	515 ms: 1.01x faster
python_startup	11.0 ms	10.6 ms: 1.04x faster
python_startup_no_site	7.97 ms	7.63 ms: 1.04x faster
raytrace	390 ms	374 ms: 1.04x faster
regex_compile	162 ms	157 ms: 1.03x faster
regex_dna	215 ms	214 ms: 1.00x faster
regex_effbot	3.53 ms	3.47 ms: 1.02x faster
regex_v8	28.5 ms	27.9 ms: 1.02x faster
richards	64.0 ms	58.8 ms: 1.09x faster
scimark_fft	409 ms	388 ms: 1.06x faster
scimark_lu	138 ms	131 ms: 1.06x faster
scimark_monte_carlo	81.9 ms	78.2 ms: 1.05x faster
scimark_sor	147 ms	141 ms: 1.04x faster
scimark_sparse_mat_mult	5.74 ms	5.65 ms: 1.02x faster
spectral_norm	133 ms	129 ms: 1.02x faster
sqlite_synth	4.09 us	3.83 us: 1.07x faster
telco	8.14 ms	7.05 ms: 1.15x faster
unpack_sequence	47.7 ns	47.3 ns: 1.01x faster
unpickle	18.8 us	18.4 us: 1.02x faster
unpickle_list	5.81 us	5.63 us: 1.03x faster
unpickle_pure_python	325 us	306 us: 1.06x faster
xml_etree_iterparse	119 ms	115 ms: 1.03x faster
xml_etree_generate	105 ms	98.7 ms: 1.07x faster
xml_etree_process	72.7 ms	69.1 ms: 1.05x faster
Geometric mean	(ref)	1.04x faster

Benchmark hidden because not significant (4): float, go, pickle_list, xml_etree_parse

Binary Compression

Baseline (./configure): 26M
Without BOLT (./configure --with-bolt): 48M, due to compile option
With BOLT(./configure --with-bolt) : 9.7M

Heatmap

ICache Miss

$ perf stat -e instructions,L1-icache-misses -- python -m pyperformance run

Experiment	instructions	L1-icache-misses	ratio
Base	7,587,709,171,226	57,490,869,440	0.7%
BOLT	7,388,373,509,871	28,787,449,397	0.3%

corona10 · 2022-01-16T12:03:43Z

B: CPython BOLT experiment (PGO vs PGO + BOLT)

Environment

environment: AWS c5n.metal
branch: https://github.com/corona10/cpython/tree/bolt
gcc version 10.2.1 20210130 (Red Hat 10.2.1-11) (GCC)

Binary Size

./configure --enable-optimizations - 25M
./configure --enable-optimizations --with-bolt - 52M
./configure --enable-optimizations --with-bolt + BOLT optimization - 9.8M

Benchmark

Benchmark	pgo	pgo_bolt
2to3	323 ms	297 ms: 1.09x faster
chaos	81.7 ms	80.5 ms: 1.02x faster
deltablue	4.70 ms	4.65 ms: 1.01x faster
fannkuch	439 ms	452 ms: 1.03x slower
float	94.8 ms	94.0 ms: 1.01x faster
go	160 ms	156 ms: 1.02x faster
hexiom	7.48 ms	7.23 ms: 1.03x faster
json_dumps	13.9 ms	14.1 ms: 1.01x slower
json_loads	28.6 us	28.8 us: 1.01x slower
logging_format	6.95 us	6.65 us: 1.04x faster
logging_silent	125 ns	124 ns: 1.01x faster
logging_simple	6.36 us	6.16 us: 1.03x faster
meteor_contest	115 ms	111 ms: 1.04x faster
nbody	116 ms	118 ms: 1.01x slower
nqueens	95.0 ms	95.7 ms: 1.01x slower
pickle	11.7 us	11.2 us: 1.05x faster
pickle_dict	31.4 us	29.4 us: 1.07x faster
pickle_list	4.70 us	4.41 us: 1.07x faster
pickle_pure_python	364 us	356 us: 1.02x faster
pidigits	206 ms	201 ms: 1.03x faster
pyflate	491 ms	488 ms: 1.01x faster
raytrace	363 ms	360 ms: 1.01x faster
regex_compile	151 ms	151 ms: 1.00x faster
regex_dna	217 ms	207 ms: 1.05x faster
regex_effbot	3.39 ms	3.27 ms: 1.04x faster
regex_v8	26.4 ms	25.7 ms: 1.03x faster
scimark_fft	374 ms	377 ms: 1.01x slower
scimark_lu	132 ms	131 ms: 1.01x faster
scimark_monte_carlo	76.1 ms	75.7 ms: 1.01x faster
scimark_sor	141 ms	140 ms: 1.00x faster
scimark_sparse_mat_mult	5.27 ms	5.32 ms: 1.01x slower
spectral_norm	124 ms	122 ms: 1.02x faster
sqlite_synth	3.91 us	3.70 us: 1.06x faster
telco	7.23 ms	7.10 ms: 1.02x faster
unpack_sequence	51.8 ns	48.5 ns: 1.07x faster
unpickle	16.0 us	15.6 us: 1.03x faster
unpickle_list	5.32 us	5.35 us: 1.01x slower
unpickle_pure_python	289 us	286 us: 1.01x faster
xml_etree_parse	159 ms	156 ms: 1.02x faster
xml_etree_iterparse	113 ms	114 ms: 1.01x slower
xml_etree_generate	95.3 ms	95.1 ms: 1.00x faster
xml_etree_process	67.7 ms	67.1 ms: 1.01x faster
Geometric mean	(ref)	1.02x faster

Benchmark hidden because not significant (4): pathlib, python_startup, python_startup_no_site, richards

ICache Miss

Experiment	instructions	L1-icache-misses	ratio
PGO	7,477,886,123,191	46,286,113,460	0.6%
PGO + BOLT	7,070,332,908,269	32,415,421,302	0.4%

Heatmap

PGO
PGO+BOLT

corona10 · 2022-01-18T09:54:16Z

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

environment: AWS c5n.metal
branch: https://github.com/corona10/cpython/tree/bolt
gcc version 10.2.1 20210130 (Red Hat 10.2.1-11) (GCC)

Binary Size

./configure --enable-optimizations --with-lto - 28M
~~./configure --enable-optimizations --with-lto --with-bolt - 80M~~
~~./configure --enable-optimizations --with-lto --with-bolt + BOLT optimization - 9.9M~~

ICache miss

Experiment	instructions	L1-icache-misses	ratio
PGO + LTO	6,685,195,147,835	49,473,656,139	0.7%
PGO + LTO + BOLT	7,070,332,908,269	32,415,421,302	0.4%

Benchmark

Benchmark	pgo_lto	pgo_lto_bolt
2to3	300 ms	302 ms: 1.01x slower
chaos	74.3 ms	82.9 ms: 1.12x slower
deltablue	4.22 ms	5.02 ms: 1.19x slower
fannkuch	396 ms	455 ms: 1.15x slower
float	88.4 ms	98.0 ms: 1.11x slower
go	151 ms	177 ms: 1.17x slower
hexiom	6.71 ms	7.86 ms: 1.17x slower
json_dumps	12.9 ms	13.3 ms: 1.03x slower
json_loads	27.3 us	30.6 us: 1.12x slower
logging_format	6.47 us	6.98 us: 1.08x slower
logging_silent	113 ns	136 ns: 1.21x slower
logging_simple	5.92 us	6.30 us: 1.06x slower
meteor_contest	110 ms	116 ms: 1.05x slower
nbody	108 ms	109 ms: 1.01x slower
nqueens	86.7 ms	99.0 ms: 1.14x slower
pathlib	21.4 ms	22.4 ms: 1.05x slower
pickle	11.0 us	11.7 us: 1.06x slower
pickle_dict	30.9 us	32.5 us: 1.05x slower
pickle_pure_python	337 us	385 us: 1.14x slower
pidigits	208 ms	231 ms: 1.11x slower
pyflate	455 ms	529 ms: 1.16x slower
python_startup	10.1 ms	10.5 ms: 1.04x slower
python_startup_no_site	7.36 ms	7.60 ms: 1.03x slower
raytrace	326 ms	371 ms: 1.14x slower
regex_compile	137 ms	156 ms: 1.13x slower
regex_dna	221 ms	203 ms: 1.09x faster
regex_effbot	3.17 ms	3.29 ms: 1.04x slower
regex_v8	25.6 ms	26.0 ms: 1.01x slower
richards	52.8 ms	62.1 ms: 1.18x slower
scimark_fft	324 ms	373 ms: 1.15x slower
scimark_lu	117 ms	133 ms: 1.14x slower
scimark_monte_carlo	73.4 ms	76.9 ms: 1.05x slower
scimark_sor	127 ms	140 ms: 1.10x slower
scimark_sparse_mat_mult	4.54 ms	5.17 ms: 1.14x slower
spectral_norm	107 ms	118 ms: 1.10x slower
sqlite_synth	3.61 us	3.70 us: 1.02x slower
unpack_sequence	44.4 ns	47.6 ns: 1.07x slower
unpickle	14.7 us	17.4 us: 1.18x slower
unpickle_list	5.04 us	4.96 us: 1.01x faster
unpickle_pure_python	256 us	301 us: 1.18x slower
xml_etree_parse	156 ms	169 ms: 1.09x slower
xml_etree_iterparse	106 ms	112 ms: 1.06x slower
xml_etree_generate	86.9 ms	95.6 ms: 1.10x slower
xml_etree_process	60.9 ms	67.2 ms: 1.10x slower
Geometric mean	(ref)	1.09x slower

Benchmark hidden because not significant (2): pickle_list, telco

corona10 · 2022-01-18T09:55:48Z

note

PGO + LTO + BOLT is slower than PGO + BOLT: Geometric mean 1.03x slower
l1 icache miss ratio is definitely reduced.

gvanrossum · 2022-01-18T18:29:38Z

So which do you recommend?

corona10 · 2022-01-19T02:05:04Z

@gvanrossum

So which do you recommend?

I am investigating why PGO + LTO + BOLT became slower than PGO + BOLT with the following things

How pyston handle BOLT?
Do we need compile option changes for BOLT?
Do we need to change BOLT configuration? (current BOLT configuration is used for gcc not CPython specific)

So if we can get great results by turning those things and if PGO + BOLT or PGO + LTO + BOLT can beat PGO + LTO,
I will decide to adopt the BOLT optimization pass but if not it will not be worth to add it.

I am going to leave experiment notes on this issue, do you think is this too noisy?

gvanrossum · 2022-01-19T02:18:47Z

I love reading your notes, that’s what I do.

corona10 · 2022-01-21T08:33:34Z

After investigating BOLT itself and pyston uses case with several experiments, I decided to get a profile from pyperformance.

Reasons are the following things.

BOLT is a post-linker optimizer, The origin needs of BOLT is applying real-world service profiles data by getting from real service. This is why BOLT provides a documentation section for service profiling. If they do not want to use real-world workload, using PGO and getting data from the unit test is an easier way. so Pyston actually getting profile data for BOLT from https://github.com/pyston/python-macrobenchmarks.
The binary which is generated from standard unit-tests profiling fails to pass all unit tests, but the binary from pyperformance profiling passes all tests. This is might be the bug of BOLT and it might need to be investigated.

I am going to compare the benchmark result by applying pyperfomance profile data. But due to vendering issue, I may need to provide a tool for BOLT optimizer under Tools/ directory.

corona10 · 2022-01-26T15:31:42Z

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

environment: AWS c5n.metal
branch: https://github.com/corona10/cpython/tree/bolt
gcc version 10.2.1 20210130 (Red Hat 10.2.1-11) (GCC)

Benchmark

Benchmark	pgo_lto_base	pgo_lto_bolt_pyperformance
2to3	304 ms	280 ms: 1.09x faster
chaos	73.2 ms	73.8 ms: 1.01x slower
float	92.1 ms	89.1 ms: 1.03x faster
hexiom	6.75 ms	6.58 ms: 1.03x faster
json_loads	27.4 us	27.2 us: 1.01x faster
logging_format	6.60 us	6.50 us: 1.02x faster
logging_silent	116 ns	114 ns: 1.01x faster
logging_simple	6.02 us	5.96 us: 1.01x faster
meteor_contest	111 ms	109 ms: 1.01x faster
nbody	108 ms	113 ms: 1.04x slower
pathlib	21.1 ms	21.5 ms: 1.02x slower
pickle	11.4 us	11.0 us: 1.03x faster
pickle_dict	31.2 us	30.1 us: 1.04x faster
pickle_list	4.75 us	4.51 us: 1.05x faster
pickle_pure_python	338 us	342 us: 1.01x slower
pidigits	199 ms	203 ms: 1.02x slower
pyflate	463 ms	457 ms: 1.01x faster
raytrace	327 ms	325 ms: 1.01x faster
regex_compile	138 ms	138 ms: 1.00x faster
regex_dna	218 ms	216 ms: 1.01x faster
regex_effbot	3.11 ms	3.22 ms: 1.04x slower
regex_v8	25.3 ms	26.1 ms: 1.03x slower
richards	53.3 ms	53.6 ms: 1.01x slower
scimark_fft	327 ms	323 ms: 1.01x faster
scimark_monte_carlo	72.4 ms	71.1 ms: 1.02x faster
scimark_sor	128 ms	126 ms: 1.01x faster
scimark_sparse_mat_mult	4.41 ms	4.36 ms: 1.01x faster
spectral_norm	107 ms	106 ms: 1.00x faster
sqlite_synth	3.57 us	3.63 us: 1.02x slower
telco	6.66 ms	6.76 ms: 1.01x slower
unpack_sequence	49.9 ns	46.5 ns: 1.07x faster
unpickle	14.3 us	14.8 us: 1.03x slower
unpickle_pure_python	258 us	255 us: 1.01x faster
xml_etree_parse	154 ms	153 ms: 1.01x faster
xml_etree_generate	86.5 ms	85.6 ms: 1.01x faster
xml_etree_process	61.0 ms	60.3 ms: 1.01x faster
Geometric mean	(ref)	1.01x faster

Benchmark hidden because not significant (10): deltablue, fannkuch, go, json_dumps, nqueens, python_startup, python_startup_no_site, scimark_lu, unpickle_list, xml_etree_iterparse

corona10 · 2022-01-26T15:34:12Z

I am investigating why PGO + LTO + BOLT became slower than PGO + BOLT with the following things

Because I passed the wrong optimization option (--with-optimizations):(
#224 (comment) is tested under valid optimization options.

corona10 · 2022-01-26T15:35:59Z

D: CPython BOLT experiment (PGO + LTO + BOLT + profiling with -m test vs PGO + LTO + BOLT + profiling with pyperformance)

Environment

environment: AWS c5n.metal
branch: https://github.com/corona10/cpython/tree/bolt
gcc version 10.2.1 20210130 (Red Hat 10.2.1-11) (GCC)

Benchmark

Benchmark	pgo_lto_bolt_std	pgo_lto_bolt_pyperformance
2to3	281 ms	280 ms: 1.00x faster
chaos	74.5 ms	73.8 ms: 1.01x faster
go	152 ms	152 ms: 1.00x faster
hexiom	6.59 ms	6.58 ms: 1.00x faster
json_dumps	13.2 ms	13.2 ms: 1.00x faster
json_loads	27.4 us	27.2 us: 1.01x faster
logging_format	6.47 us	6.50 us: 1.00x slower
logging_simple	5.93 us	5.96 us: 1.01x slower
nqueens	87.2 ms	86.7 ms: 1.01x faster
pathlib	21.6 ms	21.5 ms: 1.00x faster
pickle_list	4.55 us	4.51 us: 1.01x faster
python_startup	10.3 ms	10.2 ms: 1.00x faster
python_startup_no_site	7.43 ms	7.41 ms: 1.00x faster
regex_compile	138 ms	138 ms: 1.00x faster
regex_v8	25.4 ms	26.1 ms: 1.03x slower
scimark_monte_carlo	71.5 ms	71.1 ms: 1.01x faster
Geometric mean	(ref)	1.00x faster

Benchmark hidden because not significant (30): deltablue, fannkuch, float, logging_silent, meteor_contest, nbody, pickle, pickle_dict, pickle_pure_python, pidigits, pyflate, raytrace, regex_dna, regex_effbot, richards, scimark_fft, scimark_lu, scimark_sor, scimark_sparse_mat_mult, spectral_norm, sqlite_synth, telco, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_iterparse, xml_etree_generate, xml_etree_process

corona10 · 2022-01-26T15:42:49Z

@gvanrossum cc @vstinner

I gathered all benchmark data for BOLT, I am not sure about providing a BOLT optimization pass for 1% performance gain.
IMHO, I think that providing the BOLT optimization option will be proper when the BOLT becomes more stable or CPython binary size becomes bigger than now.
WDYT?

gvanrossum · 2022-01-26T16:31:08Z

Thanks for running these extensive tests. It does look like it's not worth making the build process even more complex.

I wonder if a more fruitful approach would be to come up with a better set of training code for PGO? (That belongs in a different issue. :-)

corona10 · 2022-01-26T16:33:21Z

I think so too :) Let's close the issue, if someone wants to reopen it, I will welcome it too.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt from LLVM 14.0.6. Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to (a previous attempt)[faster-cpython/ideas#224], this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior which is typically not tested by the small benchmarks in the pyperformance suite. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task.

* Add support for the BOLT post-link binary optimizer Using [bolt](https://github.com/llvm/llvm-project/tree/main/bolt) provides a fairly large speedup without any code or functionality changes. It provides roughly a 1% speedup on pyperformance, and a 4% improvement on the Pyston web macrobenchmarks. It is gated behind an `--enable-bolt` configure arg because not all toolchains and environments are supported. It has been tested on a Linux x86_64 toolchain, using llvm-bolt built from the LLVM 14.0.6 sources (their binary distribution of this version did not include bolt). Compared to [a previous attempt](faster-cpython/ideas#224), this commit uses bolt's preferred "instrumentation" approach, as well as adds some non-PIE flags which enable much better optimizations from bolt. The effects of this change are a bit more dependent on CPU microarchitecture than other changes, since it optimizes i-cache behavior which seems to be a bit more variable between architectures. The 1%/4% numbers were collected on an Intel Skylake CPU, and on an AMD Zen 3 CPU I got a slightly larger speedup (2%/4%), and on a c6i.xlarge EC2 instance I got a slightly lower speedup (1%/3%). The low speedup on pyperformance is not entirely unexpected, because BOLT improves i-cache behavior, and the benchmarks in the pyperformance suite are small and tend to fit in i-cache. This change uses the existing pgo profiling task (`python -m test --pgo`), though I was able to measure about a 1% macrobenchmark improvement by using the macrobenchmarks as the training task. I personally think that both the PGO and BOLT tasks should be updated to use macrobenchmarks, but for the sake of splitting up the work this PR uses the existing pgo task. * Simplify the build flags * Add a NEWS entry * Update Makefile.pre.in Co-authored-by: Dong-hee Na <donghee.na92@gmail.com> * Update configure.ac Co-authored-by: Dong-hee Na <donghee.na92@gmail.com> * Add myself to ACKS * Add docs * Other review comments * fix tab/space issue * Make it more clear that --enable-bolt is experimental * Add link to bolt's github page Co-authored-by: Dong-hee Na <donghee.na92@gmail.com>

corona10 closed this as completed Jan 26, 2022

corona10 mentioned this issue Apr 11, 2022

Let's do some comparative benchmarking #300

Closed

kmod mentioned this issue Aug 11, 2022

gh-90536: Add support for the BOLT post-link binary optimizer python/cpython#95908

Merged

corona10 mentioned this issue May 5, 2023

Benchmark noise #551

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with LLVM BOLT binary optimizer #224

Experiment with LLVM BOLT binary optimizer #224

corona10 commented Jan 14, 2022 •

edited

Loading

corona10 commented Jan 14, 2022 •

edited

Loading

corona10 commented Jan 16, 2022 •

edited

Loading

corona10 commented Jan 18, 2022 •

edited

Loading

corona10 commented Jan 18, 2022

gvanrossum commented Jan 18, 2022

corona10 commented Jan 19, 2022

gvanrossum commented Jan 19, 2022

corona10 commented Jan 21, 2022 •

edited

Loading

corona10 commented Jan 26, 2022

corona10 commented Jan 26, 2022

corona10 commented Jan 26, 2022

corona10 commented Jan 26, 2022

gvanrossum commented Jan 26, 2022

corona10 commented Jan 26, 2022

Experiment with LLVM BOLT binary optimizer #224

Experiment with LLVM BOLT binary optimizer #224

Comments

corona10 commented Jan 14, 2022 • edited Loading

corona10 commented Jan 14, 2022 • edited Loading

A: CPython BOLT experiment (No PGO + No LTO)

Instruction Order

Benchmark

Binary Compression

Heatmap

ICache Miss

corona10 commented Jan 16, 2022 • edited Loading

B: CPython BOLT experiment (PGO vs PGO + BOLT)

Environment

Binary Size

Benchmark

ICache Miss

Heatmap

corona10 commented Jan 18, 2022 • edited Loading

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

Binary Size

ICache miss

Benchmark

corona10 commented Jan 18, 2022

gvanrossum commented Jan 18, 2022

corona10 commented Jan 19, 2022

gvanrossum commented Jan 19, 2022

corona10 commented Jan 21, 2022 • edited Loading

corona10 commented Jan 26, 2022

C: CPython BOLT experiment (PGO + LTO vs PGO + LTO + BOLT)

Environment

Benchmark

corona10 commented Jan 26, 2022

corona10 commented Jan 26, 2022

D: CPython BOLT experiment (PGO + LTO + BOLT + profiling with -m test vs PGO + LTO + BOLT + profiling with pyperformance)

Environment

Benchmark

corona10 commented Jan 26, 2022

gvanrossum commented Jan 26, 2022

corona10 commented Jan 26, 2022

corona10 commented Jan 14, 2022 •

edited

Loading

corona10 commented Jan 14, 2022 •

edited

Loading

corona10 commented Jan 16, 2022 •

edited

Loading

corona10 commented Jan 18, 2022 •

edited

Loading

corona10 commented Jan 21, 2022 •

edited

Loading