Improve performance of parsing #126

mogsie · 2018-09-26T14:42:51Z

Second half of #120:

Looping through strings in Javascript land is a lot slower than having
the JS engine do it natively. String's indexOf does this faster than
looping through each character.

This change deals with parsed data (i.e. the raw text between the
tags), attribute values (what's in the quotes) and inside XML
comments. These three types of data account for a very large portion
of characters in any XML document, leaving behind mainly names of tags
and of attributes.

It might be worth it to rewrite the switch statement, or move the
optimisations into the switch statement itself, but I focused on what
gave most bang for the buck within reason.

sonnyp · 2018-09-29T10:30:56Z

parsers suite:

before: ltx x 817,222 ops/sec ±1.94% (88 runs sampled)
after: ltx x 827,444 ops/sec ±1.26% (93 runs sampled)

Node.js v10.11.0 - Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

sonnyp · 2018-09-29T10:32:10Z

@mogsie this looks good but the benchmark does not report a significant increase in performance. Do you have data that suggests otherwise?

mogsie · 2018-09-29T14:18:26Z

Hmm, that's interesting. It obviously only has an effect on the test if

There is a fair amount of text (including whitespace) between the tags
There are any comments
There are attributes with relatively large values (e.g. 100 characters, typially URLs)

Then again, I did my initial tests on node 6 and node 8, nodejs might have optimized their way out of the problem, but if the document you're testing is <foo bar="baz">quux</foo> then it's not a surprise that there is no change, as it has no comments, almost no text and attributes.

I tried replacing the document with the example atom document from Wikipedia:

before: ltx x 28,432 ops/sec ±0.83% (91 runs sampled)
after: ltx x 33,382 ops/sec ±0.86% (89 runs sampled)

after:

suite XML parsers comparison
sax x 8,447 ops/sec ±1.25% (92 runs sampled)
node-xml x 7,407 ops/sec ±0.59% (89 runs sampled)
libxmljs x 7,865 ops/sec ±0.65% (89 runs sampled)
node-expat x 9,205 ops/sec ±0.49% (92 runs sampled)
ltx x 33,382 ops/sec ±0.86% (89 runs sampled)

Still more than three times as fast as node-expat.

mogsie · 2018-09-30T08:02:30Z

I made a simpler change to show how master doesn't scale well with longer texts.

master, unchanged, on my machine (i7-2670QM @ 2.20GHz) w/node 9.11.2

suite XML parsers comparison
sax x 158,052 ops/sec ±1.39% (85 runs sampled)
node-xml x 130,975 ops/sec ±0.70% (91 runs sampled)
libxmljs x 131,653 ops/sec ±0.73% (93 runs sampled)
node-expat x 227,446 ops/sec ±0.42% (92 runs sampled)
ltx x 538,775 ops/sec ±0.99% (91 runs sampled)
Fastest is "ltx"

If I change the benchmark as follows, just lengthening the attribute and text value, the characteristics change:

diff --git a/benchmarks/parsers.js b/benchmarks/parsers.js
index 4ebee36..0a21b8a 100644
--- a/benchmarks/parsers.js
+++ b/benchmarks/parsers.js
@@ -64,7 +64,7 @@ var suite = new benchmark.Suite('XML parsers comparison')
 parsers.forEach(function (parser) {
   parser.parse('<r>')
   suite.add(parser.name, function () {
-    parser.parse('<foo bar="baz">quux</foo>')
+    parser.parse('<foo bar="urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6">urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6 urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</foo>')
   })
 })

suite XML parsers comparison
sax x 79,411 ops/sec ±0.58% (93 runs sampled)
node-xml x 106,060 ops/sec ±0.88% (90 runs sampled)
libxmljs x 118,203 ops/sec ±0.42% (94 runs sampled)
node-expat x 179,483 ops/sec ±0.37% (94 runs sampled)
ltx x 283,425 ops/sec ±0.84% (92 runs sampled)
Fastest is "ltx"

All tests slow down, some more significantly than others. ltx from 538k to 283k. This PR causes the slowdown to go from 538k to 489k

original
ltx x 538,775 ops/sec ±0.99% (91 runs sampled)
master w/ longer texts
ltx x 283,425 ops/sec ±0.84% (92 runs sampled)
pr 126 w/longer texts
ltx x 489,294 ops/sec ±0.37% (89 runs sampled)

Do you want me to add the longer texts to the benchmark in this PR?

sonnyp · 2018-09-30T10:01:57Z

Do you want me to add the longer texts to the benchmark in this PR?

Yes please.

👍 good job!

This is to ensure that the benchmarking isn't skewed towards small attribute values, and documents without significant portions of text, which it was before xmppjs#126

mogsie · 2018-09-30T18:18:25Z

Done!

Looping through strings in Javascript land is a lot slower than having the JS engine do it natively. String's indexOf does this faster than looping through each character. This change deals with parsed data (i.e. the raw text between the tags), attribute values (what's in the quotes) and inside XML comments. These three types of data account for a very large portion of characters in any XML document, leaving behind mainly names of tags and of attributes. It might be worth it to rewrite the switch statement, or move the optimisations into the switch statement itself.

This is to ensure that the benchmarking isn't skewed towards small attribute values, and documents without significant portions of text, which it was before xmppjs#126

mogsie · 2018-09-30T18:20:34Z

Rebased off master too.

sonnyp · 2018-10-03T20:51:37Z

https://github.com/xmppjs/ltx/releases/tag/v2.8.0

mogsie force-pushed the performance-parser branch from 5ca5cdd to b3c32c1 Compare September 28, 2018 13:19

Erik Mogensen added 2 commits September 30, 2018 20:20

Changed benchmark to reflect larger text/attributes

dc90ab2

This is to ensure that the benchmarking isn't skewed towards small attribute values, and documents without significant portions of text, which it was before xmppjs#126

mogsie force-pushed the performance-parser branch from 01da15d to dc90ab2 Compare September 30, 2018 18:20

sonnyp self-requested a review September 30, 2018 20:34

sonnyp approved these changes Sep 30, 2018

View reviewed changes

sonnyp merged commit 768df16 into xmppjs:master Sep 30, 2018

mogsie deleted the performance-parser branch October 1, 2018 14:36

tm1000 mentioned this pull request Oct 16, 2018

PR #126 should have been in 3.0 #129

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of parsing #126

Improve performance of parsing #126

mogsie commented Sep 26, 2018 •

edited

Loading

sonnyp commented Sep 29, 2018

sonnyp commented Sep 29, 2018

mogsie commented Sep 29, 2018

mogsie commented Sep 30, 2018

sonnyp commented Sep 30, 2018

mogsie commented Sep 30, 2018

mogsie commented Sep 30, 2018

sonnyp commented Oct 3, 2018

Improve performance of parsing #126

Improve performance of parsing #126

Conversation

mogsie commented Sep 26, 2018 • edited Loading

sonnyp commented Sep 29, 2018

sonnyp commented Sep 29, 2018

mogsie commented Sep 29, 2018

mogsie commented Sep 30, 2018

sonnyp commented Sep 30, 2018

mogsie commented Sep 30, 2018

mogsie commented Sep 30, 2018

sonnyp commented Oct 3, 2018

mogsie commented Sep 26, 2018 •

edited

Loading