Add benchmark suite #542

lapp0 · 2024-01-16T16:42:05Z

benchmark outlines/fsm/regex.py numba function compilation
benchmark outlines/fsm/json_schema.py json to regex (build_regex_from_object)
benchmark outlines/fsm/regex.py to FSM
Document usage

No benchmark coverage for generation. Have some code written to do so, but would like this merged as a baseline before continuing.

Generate benchmarks

`pytest --benchmark-only --benchmark-columns=mean,max`

------------------------------------------ benchmark: 14 tests -------------------------------------------
Name (time in us)                                                 Mean                       Max          
----------------------------------------------------------------------------------------------------------
test_benchmark_json_schema_to_regex[simple_schema]             59.3640 (1.0)            172.8980 (1.49)   
test_benchmark_json_schema_to_regex[complex_schema]            59.3866 (1.00)           116.3650 (1.0)    
test_benchmark_regex_to_fsm[ip]                            59,500.5427 (>1000.0)     66,531.2210 (571.75) 
test_benchmark_regex_to_fsm[time]                          59,145.1800 (996.31)      65,897.8390 (566.30) 
test_benchmark_regex_to_fsm[date]                          60,256.4941 (>1000.0)     63,790.9300 (548.20) 
test_benchmark_regex_to_fsm[simple_phone]                  60,757.2710 (>1000.0)     63,508.6560 (545.77) 
test_benchmark_regex_to_fsm[ssn]                           63,519.3974 (>1000.0)     77,955.7490 (669.92) 
test_benchmark_regex_to_fsm[complex_phone]                 64,853.8790 (>1000.0)     86,702.4480 (745.09) 
test_benchmark_regex_to_fsm[email]                         64,185.6854 (>1000.0)     70,712.7130 (607.68) 
test_benchmark_regex_to_fsm[url]                           76,687.0116 (>1000.0)     85,416.9130 (734.04) 
test_benchmark_json_schema_to_fsm[simple_schema]           92,248.5355 (>1000.0)    105,986.8340 (910.81) 
test_benchmark_regex_to_fsm[quite_complex]                 90,073.5474 (>1000.0)     94,483.2810 (811.96) 
test_benchmark_json_schema_to_fsm[complex_schema]          96,999.0614 (>1000.0)    102,476.1180 (880.64) 
test_benchmark_compile_numba                            6,087,309.6160 (>1000.0)  9,463,025.1640 (>1000.0)
----------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

Comparing two branches

Generate initial results:

git checkout benchmark-test-suite
pip install .
pytest -s tests/benchmark/test_benchmark_regex_fsm.py --benchmark-only --benchmark-columns=mean,max --benchmark-json=main-profile_regex_fsm.json
pytest -s tests/benchmark/test_benchmark_numba_compile.py --benchmark-only --benchmark-columns=mean,max --benchmark-json=main-profile_compile_numba.json

git checkout fsm-with-trie
pip install .
pytest -s tests/benchmark/test_benchmark_regex_fsm.py --benchmark-only --benchmark-columns=mean,max --benchmark-json=trie-profile_regex_fsm.json
pytest -s tests/benchmark/test_benchmark_numba_compile.py --benchmark-only --benchmark-columns=mean,max --benchmark-json=trie-profile_compile_numba.json

Generate comparisons:

py.test-benchmark compare --sort=fullname --columns=mean,max main-profile_regex_fsm.json trie-profile_regex_fsm.json

-------------------------------------- benchmark: 18 tests ---------------------------------------
Name (time in s)                                                 Mean                Max          
--------------------------------------------------------------------------------------------------
test_benchmark_regex_to_fsm[complex_phone] (main-profile)      1.2403 (7.45)      1.4116 (8.14)   
test_benchmark_regex_to_fsm[complex_phone] (trie-profile)      0.3848 (2.31)      0.4245 (2.45)   
test_benchmark_regex_to_fsm[date] (main-profile)               0.7324 (4.40)      0.7493 (4.32)   
test_benchmark_regex_to_fsm[date] (trie-profile)               0.2405 (1.44)      0.3268 (1.88)   
test_benchmark_regex_to_fsm[email] (main-profile)              1.7789 (10.69)    10.3718 (59.78)  
test_benchmark_regex_to_fsm[email] (trie-profile)              2.2854 (13.73)    13.0081 (74.98)  
test_benchmark_regex_to_fsm[ip] (main-profile)                 0.5345 (3.21)      0.5558 (3.20)   
test_benchmark_regex_to_fsm[ip] (trie-profile)                 0.1739 (1.04)      0.1956 (1.13)   
test_benchmark_regex_to_fsm[quite_complex] (main-profile)     10.0635 (60.46)    10.1402 (58.45)  
test_benchmark_regex_to_fsm[quite_complex] (trie-profile)      3.8243 (22.98)     3.9292 (22.65)  
test_benchmark_regex_to_fsm[simple_phone] (main-profile)       0.3994 (2.40)      0.4185 (2.41)   
test_benchmark_regex_to_fsm[simple_phone] (trie-profile)       0.2163 (1.30)      0.2294 (1.32)   
test_benchmark_regex_to_fsm[ssn] (main-profile)                0.2920 (1.75)      0.3024 (1.74)   
test_benchmark_regex_to_fsm[ssn] (trie-profile)                0.1737 (1.04)      0.1788 (1.03)   
test_benchmark_regex_to_fsm[time] (main-profile)               0.2660 (1.60)      0.2833 (1.63)   
test_benchmark_regex_to_fsm[time] (trie-profile)               0.1664 (1.0)       0.1735 (1.0)    
test_benchmark_regex_to_fsm[url] (main-profile)                0.7149 (4.29)      0.7527 (4.34)   
test_benchmark_regex_to_fsm[url] (trie-profile)                0.9758 (5.86)      1.0571 (6.09)   
--------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

py.test-benchmark compare --sort=fullname --columns=mean,max main-profile_compile_numba.json trie-profile_compile_numba.json

-------------------------------- benchmark: 2 tests -------------------------------
Name (time in s)                                  Mean                Max          
-----------------------------------------------------------------------------------
test_benchmark_compile_numba (main-profile)     6.3623 (1.0)       9.9192 (1.0)    
test_benchmark_compile_numba (trie-profile)     8.4004 (1.32)     13.4900 (1.36)   
-----------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean

tests/benchmark/test_benchmark_regex_fsm.py

rlouf · 2024-01-18T13:16:24Z

tests/benchmark/test_benchmark_regex_fsm.py

+):
+    """Benchmark converting regex to FSM"""
+    regex_str = regex_samples[regex_name]
+    benchmark.pedantic(


Is there any particular reason why you are using the pedantic mode here and in the other benchmarks?

IMO, it's cleaner than

create_rfsm = lambda: RegexFSM(regex_str, tokenizer) benchmark(create_rfsm)

Additionally it allows for fine-grained control of the number of runs.

rlouf · 2024-01-18T13:33:00Z

Thank you for opening a PR! Don't you think it would be best to always benchmark the end-to-end index computation? This is the quantity we care about.

lapp0 · 2024-01-18T14:41:15Z

Thank you for opening a PR! Don't you think it would be best to always benchmark the end-to-end index computation? This is the quantity we care about.

Could you please clarify? We are benchmarking the computation of the RegexFSM index in this PR, which includes create_fsm_index_end_to_end. Which component would you like me to include?

rlouf · 2024-01-18T23:02:36Z

I mean not separating the Numba compilation from the rest of the index compilation. Total time is what we care about. Does that make sense?

lapp0 · 2024-01-18T23:44:10Z

Numba initial compilation is a one time occurrence and takes ~9,400 ms per the posted benchmark. Specifically it's benchmarking the generation of

outlines/fsm/__pycache__/regex.state_scan_tokens-461.py311.nbi
outlines/fsm/__pycache__/regex.create_vocab_trie-471.py311.nbi
outlines/fsm/__pycache__/regex.create_fsm_index_end_to_end-494.py311.nbi
outlines/fsm/__pycache__/regex._walk_fsm-225.py311.nbi
outlines/fsm/__pycache__/regex.create_fsm_info-95.py311.nbi
outlines/fsm/__pycache__/regex.state_scan_tokens-502.py311.nbi

After compilation generating a RegexFSM takes on average between tens to hundreds of ms. I think it makes sense to separate compilation from execution benchmarks.

Otherwise optimizations (or performance degredation) of RegexFSM construction is smaller than the variance of numba compilation time.

rlouf · 2024-01-23T10:15:01Z

I can get on board with that. Would you mind rebasing on main and fix the potential merge conflicts?

rlouf · 2024-01-25T11:37:08Z

Thank you!

brandonwillard linked an issue Jan 16, 2024 that may be closed by this pull request

Add performance benchmarks to test suite #402

Closed

rlouf marked this pull request as draft January 16, 2024 19:43

rlouf changed the title ~~Add Base Benchmark Suite~~ Add benchmark suite Jan 16, 2024

rlouf added enhancement tests Linked to library tests labels Jan 16, 2024

lapp0 marked this pull request as ready for review January 17, 2024 12:02

lapp0 mentioned this pull request Jan 17, 2024

Run-time structured generation benchmarks #549

Open

rlouf reviewed Jan 18, 2024

View reviewed changes

tests/benchmark/test_benchmark_regex_fsm.py Outdated Show resolved Hide resolved

rlouf reviewed Jan 18, 2024

View reviewed changes

lapp0 force-pushed the benchmark-test-suite branch from ea1a0df to 33583a9 Compare January 23, 2024 16:57

Andrew Lapp added 9 commits January 24, 2024 20:41

add base benchmark suite

508d76a

remove unnecessary code

a8f2544

no need for pytest-ordering

e45a723

disable cache

608582d

update caching order

8f1a81f

document simple benchmark test command

9a29cb9

ensure numba is compiles so it doesn't skew benchmarks

dde26ad

more explicit name

40b8365

fix regex import issue in numba

a0289f0

lapp0 force-pushed the benchmark-test-suite branch from 33583a9 to a0289f0 Compare January 25, 2024 02:42

rlouf merged commit d534c2f into outlines-dev:main Jan 25, 2024
5 checks passed

lapp0 mentioned this pull request Apr 21, 2024

Use a trie for scanning during index construction #507

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark suite #542

Add benchmark suite #542

lapp0 commented Jan 16, 2024 •

edited

Loading

rlouf Jan 18, 2024 •

edited

Loading

lapp0 Jan 18, 2024 •

edited

Loading

rlouf commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024

rlouf commented Jan 18, 2024

lapp0 commented Jan 18, 2024 •

edited

Loading

rlouf commented Jan 23, 2024 •

edited

Loading

rlouf commented Jan 25, 2024

Add benchmark suite #542

Add benchmark suite #542

Conversation

lapp0 commented Jan 16, 2024 • edited Loading

Generate benchmarks

pytest --benchmark-only --benchmark-columns=mean,max

Comparing two branches

Generate initial results:

Generate comparisons:

rlouf Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

lapp0 Jan 18, 2024 • edited Loading

Choose a reason for hiding this comment

rlouf commented Jan 18, 2024 • edited Loading

lapp0 commented Jan 18, 2024

rlouf commented Jan 18, 2024

lapp0 commented Jan 18, 2024 • edited Loading

rlouf commented Jan 23, 2024 • edited Loading

rlouf commented Jan 25, 2024

lapp0 commented Jan 16, 2024 •

edited

Loading

`pytest --benchmark-only --benchmark-columns=mean,max`

rlouf Jan 18, 2024 •

edited

Loading

lapp0 Jan 18, 2024 •

edited

Loading

rlouf commented Jan 18, 2024 •

edited

Loading

lapp0 commented Jan 18, 2024 •

edited

Loading

rlouf commented Jan 23, 2024 •

edited

Loading