-
-
Notifications
You must be signed in to change notification settings - Fork 898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] Nokogiri::XML::Searchable#at_css is 3-4 times slower on 2 MB HTML files comparing to 640 KB files #2133
Comments
👋 Thanks for opening this up, and for looking at #1861 and identifying key differences. My assumption based on a brief scan of what you've posted here is that this is a performance issue with libxml2 and will need to be addressed upstream. I'll probably spend some time tonight reproducing some of your perf analysis to make sure I understand what's going on. The next few weeks of my time are going to be spent focused on getting v1.11.0 shipped, but once I've done that I can probably help put together a reproduction that uses the libxml2 C API (and not Nokogiri). |
@ilyazub Hi, I'm looking into this and I'm a little confused. Generally, I'm seeing that a document that is about three times larger takes about three times longer to search. That doesn't seem unexpected to me, can you help me understand what I'm missing? I'll show my work, and hope that you can tell me what we're doing differently. First, here's the benchmark script I'm running (against the two files you attached in the description): #! /usr/bin/env ruby
require "nokogiri"
require "yaml"
gem "benchmark-ips"
require "benchmark/ips"
puts Nokogiri::VERSION_INFO.to_yaml
big_doc = Nokogiri::HTML.parse(File.read("pc_game_big_shopping.html"))
small_doc = Nokogiri::HTML.parse(File.read("pc_game_small_shopping.html"))
Benchmark.ips do |x|
x.report("big doc css") do
big_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
end
end
x.report("small doc css") do
small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
end
end
end Here's what it emits:
Running that same script with gperftools and generating a report with which doesn't show |
Sorry, just to clarify: I see that |
@flavorjones Yes, you're right. I've chosen the wrong words. I thought that amount of calls of I'm adapting your Gist from #1861 (comment) to my situation: #include <string.h>
#include <stdio.h>
#include <time.h>
#include <libxml/HTMLtree.h>
#include <libxml/xpath.h>
double time_in_microsec()
{
struct timespec tv;
clock_gettime(CLOCK_MONOTONIC_RAW, &tv);
return (double) (tv.tv_sec * 1000000 + tv.tv_nsec / 1000.0);
}
double elapsed_ms_since(double start)
{
return time_in_microsec() - start;
}
void main()
{
htmlDocPtr doc;
xmlXPathContextPtr ctx;
xmlChar *queries[] = {
".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]",
".//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[contains(concat(' ', normalize-space(@class), ' '), ' EI11Pd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' AGVhpb ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' GyDBsd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VQN8fd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VZTCjd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' REX1ub ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' sHaywe ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' eaGTj ')]",
".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]",
".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Y6Fvnc '))][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ')]//*[contains(concat(' ', normalize-space(@class), ' '), ' F7Kwhf ')]//span"
};
size_t query_len = 4;
double start_time ;
doc = htmlReadFile("pc_game_big_shopping.html", NULL, 0);
// ctx = XPathContext.new(node)
xmlXPathInit();
ctx = xmlXPathNewContext(doc);
for (int j = 0; j < query_len; j++) {
start_time = time_in_microsec();
// 100 iterations to emulate bigger amount of real-world searches in document
for (int n = 0; n < 100; n++) {
// ctx.evaluate(path, handler)
xmlXPathEvalExpression(queries[j], ctx);
}
fprintf(stderr, "%-85.85s %f μs\n", queries[j], elapsed_ms_since(start_time));
}
} This results in the not that big difference between big and small document comparing to pc_game_small_shopping.html
.//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | 377.167007 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[conta 1838.653992 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains 2054.356003 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3] 1539.781998 μs
pc_game_big_shopping.html
.//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | 356.565002 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[conta 2671.336014 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains 2323.917999 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3] 1704.442993 μs I guess this issue could be closed as the problem is in libxml2. |
@flavorjones I've found a way to describe what's wrong. TL;DR. In the big document, xmlStrlen is called 15K times for the same string. In the small document, it's being called for 5K times. Three times bigger document and three times more calls. But the strings are the same. My assumption as a person who doesn't know C: the unique string pointer address means that the function was called with the same argument. Here's a small Ruby script that searches an element in the container. require "nokogiri"
html = File.read(ARGV[0])
doc = Nokogiri::HTML.parse(html)
doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") } I recompiled nokogiri with debug information. CFLAGS="-O -ggdb3 -pipe -fno-omit-frame-pointer" gem install nokogiri Then attached the breakpoint via gdb to get the line to attach bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'b xmlStrlen' --ex run --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_big_shopping.html Then attached bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'dprintf xmlstring.c:425, "str: %p\n", str' --ex run --ex quit --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_big_shopping.html |& tee xmlStrlen_big.log
bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'dprintf xmlstring.c:425, "str: %p\n", str' --ex run --ex quit --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_small_shopping.html |& tee xmlStrlen_small.log Then examined logs. $ grep str: xmlStrlen_big.log | sort | uniq -c | sort -rn | head -20
15446 str: 0x112e760
15405 str: 0x112e6d0
13405 str: 0x15ddb80
4006 str: 0x7fffe6c615ad
1962 str: 0x112e740
381 str: 0x7fffe69de010
374 str: 0x15b5520
321 str: 0x1649d60
306 str: 0x174c7a0
306 str: 0x1734430
306 str: 0x15db410
306 str: 0x15b63b0
303 str: 0x174c830
303 str: 0x15db4a0
303 str: 0x15b64a0
298 str: 0x161a8b0
297 str: 0x16e3fa0
291 str: 0x160fd50
287 str: 0x163c320
285 str: 0x16feef0
$ grep str: xmlStrlen_small.log | sort | uniq -c | sort -rn | head -20
5046 str: 0xf3fbc0
5038 str: 0xf3fcb0
1687 str: 0xf40440
906 str: 0xf3fc30
806 str: 0x7fffe6c615ad
431 str: 0xfe0460
420 str: 0x11b53e0
384 str: 0x1067bb0
353 str: 0xfdc760
328 str: 0x11e5ab0
322 str: 0x11e5a40
306 str: 0x11ccea0
303 str: 0x11ccf30
303 str: 0x11bc040
301 str: 0x11a57d0
291 str: 0x11e7740
289 str: 0x1200a70
288 str: 0x11d5660
284 str: 0x11e6d80
284 str: 0x11c3fd0 I have a simple idea about caching the length of strings in memory in |
@ilyazub I'm looking at the queries you're making, and my intuition is telling me that the XPath functions are going to end up calling I'd like to close this -- it seems like libxml2 behavior regardless -- though you can continue to post here if it's helpful or if you think there's something we can do inside Nokogiri to impact this. One performance-related thing occurred to me while reviewing your comments, which is that Nokogiri could register a C function that will more efficiently look for CSS class matches. The combination of |
See #2135 for followup performance work. |
Just an update from #2135 and the resulting PR at #2137. I'm seeing a big improvement in performance. With v1.11.0.rc3:
But on the feature branch it's ~2x faster:
This optimization reduces calls to Anyway, thanks again for poking at this -- I would not have seen the opportunity for a 2x performance improvement without it! |
@flavorjones Thanks for your work! That's the second issue I've opened up and I really enjoy the results and communication with you. |
Just for reference, I've opened an issue in the |
small_html.zip
big_html.zip
Help us reproduce what you're seeing
Nokogiri::XML::Searchable#at_css
is 3-4 times slower on 2 MB HTML files comparing to 640 KB files.I thought it was related to #1861, but in my case,
xmlStrlen
oflibxml2
is being a bottleneck.Below is a small reproduction script with
pprof
output on top of it. I'd happily open an issue in thelibxml2
repository but I wasn't able to create a reproducible example forlibxml2
specifically, so in the example, I've used Nokogiri.xmlStrlen
is being called too much times for big HTML file.For small HTML file
xmlStrlen
is not a bottleneckI've also tried to count the number of times xmlStrlen is being called with unique strings. It seems that not that only 60 of 1.7M calls we with unique strings.
The rough experiment with counting unique
xmlStrlen
calls.I also prepared a benchmark against big and small HTML that shows that
libxml2
is 3-4 times slower on big files.Results of the benchmark.
Expected behavior
Environment
Additional context
The text was updated successfully, but these errors were encountered: