Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core: matchGrammar now uses a linked list #1909

Merged
merged 6 commits into from
Mar 25, 2020

Conversation

RunDevelopment
Copy link
Member

@RunDevelopment RunDevelopment commented May 25, 2019

This changes the implementation of matchGrammar to use a linked list instead of an array to replace this O(n) (worst case) operation with an O(1) operation.

Prism.tokenize performance on the minified and dev source of Prism with all languages from the download page:

Code Size Array Linked list Gain
Prism JS lang 3.5kB 0.232ms +-0.304µs
10k samples
0.179ms +-0.262µs
10k samples
1.3x faster
Prism Core min 6kB 1.26ms +-3.08µs
1000 samples
0.818ms +-3.45ms
1000 samples
1.54x faster
Prism Core 13kB 2ms +-7µs
500 samples
1.45ms +-5.62µs
500 samples
1.38x faster
Minified
(Core + all languages)
317kB 98.2ms +-0.177ms
50 samples
73.9ms +-0.198ms
50 samples
1.33x faster
Dev
(Core + all languages)
403kB 111.5ms +-4.4ms
50 samples
66.5ms +-0.476ms
50 samples
1.68x faster
Dev² 806kB 3913ms +-53ms
10 samples
193ms +-0.432ms
50 samples
20.3x faster
Some big JSON
using JS grammar
39MB 142s
1 sample
1440ms +-33.5ms
5 samples
98.6x faster
Same JSON
using JSON grammar
39MB 214s
1 sample
876ms +-4.45ms
5 samples
244x faster

(Dev² is the code of dev concatenated with the code of dev)

Test code
const fs = require('fs');
const { performance } = require('perf_hooks');
const Prism = require('./components/prism-core');
const loadLanguages = require('./components/index');

loadLanguages(['javascript']);
const js = Prism.languages['js'];


// load the test code
const code = fs.readFileSync('./testCode.txt', 'utf-8');
console.log(`${code.length} characters`);


function measure(func, sampleCount = 20) {
	var samples = [];
	for (let i = 0; i < sampleCount; i++) {
		const start = performance.now();
		func();
		samples.push(performance.now() - start);
	}
	const e = samples.reduce((y, x) => x + y, 0) / samples.length;
	const std = Math.sqrt(samples.reduce((sum, x) => sum + (x - e) ** 2, 0)) / (samples.length - 1);

	return `${e}ms +-${std}ms`
}


console.log('Warm up');
measure(() => Prism.tokenize(code, js));


console.log('measure');
console.log(`tokenize: ${measure(() => Prism.tokenize(code, js), 50)}`);

Why it isn't always hugely faster

Before we start: Please note that n is not the number of characters in the text but the current number of tokens in the token array/list. This number increases over time and depends on both the text being highlighted and the grammar used.

The gain in performance is significant but less than I expected for switching from an O(n) operation to an O(1) operation.

The main reason for this, I think, is that splice will operate quite often in O(1) which is the case when as many tokens are deleted as inserted.
The easiest example of this is when a regex matches the whole string (not the whole text but a whole token string), in which case the entire string will be replaced with the created token making splice O(1). The more token there already are in the array, the more likely this is to happen because the strings get shorter and shorter.
This means that splice is very likely to operate in O(n) when n is small (there are few tokens compared to the overall text length) and it is very likely to operate in O(1) when n is large.

This is most likely the reason why the array implementation is faster for the minified code than for the dev code. Because all spaces are removed, the last regexes (which are punctuation and operators) will almost always match the entire string. This advantage is enough to combat the increase in greedy rematching which can be seen in the extra milliseconds of the linked list implementation.
Greedy rematching is also the main reason why the gain isn't as big for the minified code.

That being said. For a text large enough, the linked list implementation will always be significantly faster.

So, what now?

The performance gain is only really noticeable for large amounts of text and comes with some extras bytes (~700bytes minified). So this PR won't help the average user which just wants to highlight his/her ~200 lines of code. Half a millisecond faster or not just doesn't make a difference.

Another caveat is that this is technically a breaking change because I changed the parameters of matchGrammar. It's not even possible to use this function from the outside anymore because I don't expose the LinkedList class. Sure enough, the function isn't documented anywhere and could be seen as undefined behavior but this PR might still break someone's code.
(Why do we expose this function in the first place?)

The average user won't noticeable profit from this PR and it might even break some people's code (because matchGrammar is exposed (for some reason) and therefore technically public) but the performance gain for people who highlight large amounts of text is very real.

Please tell me what you think.

@RunDevelopment
Copy link
Member Author

RunDevelopment commented Mar 20, 2020

Here's a current benchmark:

Benchmark
C:\Users\micha\Git\prism>npm run benchmark -- --maxTime=30 --remotesOnly

> prismjs@1.19.0 benchmark C:\Users\micha\Git\prism
> node benchmark/benchmark.js "--maxTime=30" "--remotesOnly"

Found 6 cases with 14 files in total.
Test 2 candidates with Prism.tokenize
Estimated duration: 14m 0s

------------------------------------------------------------

css

  ../../style.css (7 kB)
  | PrismJS@master                      1.18ms ±  0%  538smp 1.38x
  | RunDevelopment@core-linked-list     0.85ms ±  0%  541smp      

------------------------------------------------------------

css!+css-extras (css)

  ../../style.css (7 kB)
  | PrismJS@master                      1.88ms ±  0%  529smp 1.38x
  | RunDevelopment@core-linked-list     1.37ms ±  0%  538smp      

------------------------------------------------------------

javascript

  ../../components.json (26 kB)
  | PrismJS@master                      3.64ms ±  0%  493smp 2.13x
  | RunDevelopment@core-linked-list     1.71ms ±  0%  523smp      
  ../../package-lock.json (196 kB)
  | PrismJS@master                     81.60ms ± 18%  131smp 7.19x
  | RunDevelopment@core-linked-list    11.35ms ±  2%  429smp      
  ../../prism.js (27 kB)
  | PrismJS@master                      4.82ms ±  0%  498smp 1.54x
  | RunDevelopment@core-linked-list     3.14ms ±  0%  509smp      
  ../../scripts/utopia.js (11 kB)
  | PrismJS@master                      1.80ms ±  0%  531smp 1.47x
  | RunDevelopment@core-linked-list     1.22ms ±  0%  533smp      
  https://code.jquery.com/jquery-3.4.1.js (274 kB)
  | PrismJS@master                    590.68ms ± 34%   26smp 20.13x
  | RunDevelopment@core-linked-list    29.35ms ±  0%  331smp      
  prism.min.js (13 kB)
  | PrismJS@master                      2.91ms ±  0%  521smp 1.62x
  | RunDevelopment@core-linked-list     1.79ms ±  0%  528smp      

------------------------------------------------------------

json

  ../../components.json (26 kB)
  | PrismJS@master                      3.15ms ±  0%  506smp 2.77x
  | RunDevelopment@core-linked-list     1.14ms ±  0%  517smp      
  ../../package-lock.json (196 kB)
  | PrismJS@master                     88.24ms ± 15%  115smp 12.42x
  | RunDevelopment@core-linked-list     7.11ms ±  0%  455smp      

------------------------------------------------------------

markup

  ../../download.html (4 kB)
  | PrismJS@master                      0.51ms ±  0%  549smp 1.52x
  | RunDevelopment@core-linked-list     0.34ms ±  0%  549smp      
  ../../index.html (19 kB)
  | PrismJS@master                      2.58ms ±  0%  531smp 1.46x
  | RunDevelopment@core-linked-list     1.77ms ±  0%  526smp      

------------------------------------------------------------

markup!+css+javascript (markup)

  ../../download.html (4 kB)
  | PrismJS@master                      0.84ms ±  0%  541smp 1.44x
  | RunDevelopment@core-linked-list     0.58ms ±  0%  551smp      
  ../../index.html (19 kB)
  | PrismJS@master                      3.23ms ±  0%  524smp 1.35x
  | RunDevelopment@core-linked-list     2.39ms ±  0%  524smp      

------------------------------------------------------------

summary
                                   best  worst  relative
  PrismJS@master                      0     14     4.13x
  RunDevelopment@core-linked-list    14      0     1.00x

@mAAdhaTTah
Copy link
Member

My initial thoughts are here. @RunDevelopment responded here.

I'd love to hear @Golmote's or @LeaVerou's thoughts on the trade-off between bundle size & performance and whether this trade-off is worth it in this case.

This also reduces the minified size
@RunDevelopment
Copy link
Member Author

Regarding bundle size:
Right now, this will increase the size of prism-core.min.js by ~500Bytes (6446B -> 6962B).
(I can probably save another ~100Bytes by using gulp to rename the internal properties of the linked list implementation which uglify doesn't replace with single-character names. Kinda hacky and about the last thing I can do.)

@Golmote
Copy link
Contributor

Golmote commented Mar 24, 2020

This is pretty smart, I like this idea. This doesn't even add a whole lot of code, really.

Regarding BC, the function was never listed on https://prismjs.com/extending.html so I'd say it should be quite safe to break it. Or this PR can join the others for the 2.0 milestone.

@mAAdhaTTah
Copy link
Member

Fair enough, let's do it.

@RunDevelopment
Copy link
Member Author

Then I'll also make matchGrammar a private function since this will break it anyway.

@RunDevelopment RunDevelopment merged commit 2d4c94c into PrismJS:master Mar 25, 2020
@RunDevelopment RunDevelopment deleted the core-linked-list branch March 25, 2020 12:12
quentinvernot pushed a commit to TankerHQ/prismjs that referenced this pull request Sep 11, 2020
The token streams in `matchGrammar` are now backed by a linked list instead of an array. This guarantees O(1) time for all operations.

The `matchGrammar` is now private.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants