Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of parsing #126

Merged
merged 2 commits into from
Sep 30, 2018
Merged

Conversation

mogsie
Copy link
Contributor

@mogsie mogsie commented Sep 26, 2018

Second half of #120:

Looping through strings in Javascript land is a lot slower than having
the JS engine do it natively. String's indexOf does this faster than
looping through each character.

This change deals with parsed data (i.e. the raw text between the
tags), attribute values (what's in the quotes) and inside XML
comments. These three types of data account for a very large portion
of characters in any XML document, leaving behind mainly names of tags
and of attributes.

It might be worth it to rewrite the switch statement, or move the
optimisations into the switch statement itself, but I focused on what
gave most bang for the buck within reason.

@sonnyp
Copy link
Member

sonnyp commented Sep 29, 2018

parsers suite:

before: ltx x 817,222 ops/sec ±1.94% (88 runs sampled)
after: ltx x 827,444 ops/sec ±1.26% (93 runs sampled)

Node.js v10.11.0 - Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz

@sonnyp
Copy link
Member

sonnyp commented Sep 29, 2018

@mogsie this looks good but the benchmark does not report a significant increase in performance. Do you have data that suggests otherwise?

@mogsie
Copy link
Contributor Author

mogsie commented Sep 29, 2018

Hmm, that's interesting. It obviously only has an effect on the test if

  • There is a fair amount of text (including whitespace) between the tags
  • There are any comments
  • There are attributes with relatively large values (e.g. 100 characters, typially URLs)

Then again, I did my initial tests on node 6 and node 8, nodejs might have optimized their way out of the problem, but if the document you're testing is <foo bar="baz">quux</foo> then it's not a surprise that there is no change, as it has no comments, almost no text and attributes.

I tried replacing the document with the example atom document from Wikipedia:

before: ltx x 28,432 ops/sec ±0.83% (91 runs sampled)
after: ltx x 33,382 ops/sec ±0.86% (89 runs sampled)

after:

suite XML parsers comparison
sax x 8,447 ops/sec ±1.25% (92 runs sampled)
node-xml x 7,407 ops/sec ±0.59% (89 runs sampled)
libxmljs x 7,865 ops/sec ±0.65% (89 runs sampled)
node-expat x 9,205 ops/sec ±0.49% (92 runs sampled)
ltx x 33,382 ops/sec ±0.86% (89 runs sampled)

Still more than three times as fast as node-expat.

@mogsie
Copy link
Contributor Author

mogsie commented Sep 30, 2018

I made a simpler change to show how master doesn't scale well with longer texts.

master, unchanged, on my machine (i7-2670QM @ 2.20GHz) w/node 9.11.2

suite XML parsers comparison
sax x 158,052 ops/sec ±1.39% (85 runs sampled)
node-xml x 130,975 ops/sec ±0.70% (91 runs sampled)
libxmljs x 131,653 ops/sec ±0.73% (93 runs sampled)
node-expat x 227,446 ops/sec ±0.42% (92 runs sampled)
ltx x 538,775 ops/sec ±0.99% (91 runs sampled)
Fastest is "ltx"

If I change the benchmark as follows, just lengthening the attribute and text value, the characteristics change:

diff --git a/benchmarks/parsers.js b/benchmarks/parsers.js
index 4ebee36..0a21b8a 100644
--- a/benchmarks/parsers.js
+++ b/benchmarks/parsers.js
@@ -64,7 +64,7 @@ var suite = new benchmark.Suite('XML parsers comparison')
 parsers.forEach(function (parser) {
   parser.parse('<r>')
   suite.add(parser.name, function () {
-    parser.parse('<foo bar="baz">quux</foo>')
+    parser.parse('<foo bar="urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6">urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6 urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</foo>')
   })
 })

suite XML parsers comparison
sax x 79,411 ops/sec ±0.58% (93 runs sampled)
node-xml x 106,060 ops/sec ±0.88% (90 runs sampled)
libxmljs x 118,203 ops/sec ±0.42% (94 runs sampled)
node-expat x 179,483 ops/sec ±0.37% (94 runs sampled)
ltx x 283,425 ops/sec ±0.84% (92 runs sampled)
Fastest is "ltx"

All tests slow down, some more significantly than others. ltx from 538k to 283k. This PR causes the slowdown to go from 538k to 489k

  • original
    ltx x 538,775 ops/sec ±0.99% (91 runs sampled)
  • master w/ longer texts
    ltx x 283,425 ops/sec ±0.84% (92 runs sampled)
  • pr 126 w/longer texts
    ltx x 489,294 ops/sec ±0.37% (89 runs sampled)

Do you want me to add the longer texts to the benchmark in this PR?

@sonnyp
Copy link
Member

sonnyp commented Sep 30, 2018

Do you want me to add the longer texts to the benchmark in this PR?

Yes please.

👍 good job!

mogsie pushed a commit to mogsie/ltx that referenced this pull request Sep 30, 2018
This is to ensure that the benchmarking isn't skewed towards small
attribute values, and documents without significant portions of text,
which it was before xmppjs#126
@mogsie
Copy link
Contributor Author

mogsie commented Sep 30, 2018

Done!

Erik Mogensen added 2 commits September 30, 2018 20:20
Looping through strings in Javascript land is a lot slower than having
the JS engine do it natively.  String's indexOf does this faster than
looping through each character.

This change deals with parsed data (i.e. the raw text between the
tags), attribute values (what's in the quotes) and inside XML
comments.  These three types of data account for a very large portion
of characters in any XML document, leaving behind mainly names of tags
and of attributes.

It might be worth it to rewrite the switch statement, or move the
optimisations into the switch statement itself.
This is to ensure that the benchmarking isn't skewed towards small
attribute values, and documents without significant portions of text,
which it was before xmppjs#126
@mogsie
Copy link
Contributor Author

mogsie commented Sep 30, 2018

Rebased off master too.

@sonnyp sonnyp self-requested a review September 30, 2018 20:34
@sonnyp sonnyp merged commit 768df16 into xmppjs:master Sep 30, 2018
@mogsie mogsie deleted the performance-parser branch October 1, 2018 14:36
@sonnyp
Copy link
Member

sonnyp commented Oct 3, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants