Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use glibc strlen to speed up xmlStrlen #2144

Merged
merged 1 commit into from
Dec 28, 2020

Conversation

ilyazub
Copy link
Contributor

@ilyazub ilyazub commented Dec 23, 2020

What problem is this PR intended to solve?

This PR speeds up Nokogiri by speeding up xmlStrlen. It'd be better to reduce the amount of xmlStrlen calls from Nokogiri, but this patch is an easy win.

Our parsers sped up from 1.4 seconds to 1 second on big HTML documents with this patch.

xmlStrlen benchmark results:

xmlStrlen (entire HTML file): 926171.936981 μs
glibc_xmlStrlen (entire HTML file): 36905.903992 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 25.094584 times

xmlStrlen (average string): 57479.204010 μs
glibc_xmlStrlen (average string): 5802.069000 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 9.905937 times

xmlStrlen (bigger string): 388056.315979 μs
glibc_xmlStrlen (bigger string): 12797.856995 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 30.318382 times

xmlStrlen (smallest string): 15870.046021 μs
glibc_xmlStrlen (smallest string): 6282.208984 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 2.527903 times
Benchmark code

big_html.zip

#include <string.h>
#include <stdio.h>
#include <time.h>

#include "libxml/HTMLtree.h"
#include "libxml/xpath.h"

double time_in_microsec()
{
  struct timespec tv;
  clock_gettime(CLOCK_MONOTONIC_RAW, &tv);

  return (double) (tv.tv_sec * 1000000 + tv.tv_nsec / 1000.0);
}

double elapsed_ms_since(double start)
{
  return time_in_microsec() - start;
}

int glibc_xmlStrlen(const xmlChar *str) {
    if (str == NULL) return(0);

    return strlen((const char*)str);
}

int main(int argc, char *argv[])
{
  FILE *fp = fopen ("big_html.html", "rb");
  char *buffer = NULL;
  size_t len;
  ssize_t bytes_read = getdelim(&buffer, &len, '\0', fp);

  const int N = 1000000;

  double start_time, prev_start_time;

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < 1000; j++) {
    xmlStrlen(buffer);
  }

  prev_start_time = start_time;

  start_time = time_in_microsec();

  for (int j = 0; j < 1000; j++) {
    glibc_xmlStrlen(buffer);
  }

  fprintf(stderr, "xmlStrlen (entire HTML file): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (entire HTML file): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  xmlChar *str = ".//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;

  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (average string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (average string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  str = ".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (bigger string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (bigger string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  str = ".//*[not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (smallest string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (smallest string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));
}
Makefile
CFLAGS=$(shell xml2-config --cflags)
LDFLAGS=$(shell xml2-config --libs)

slow_parsing_benchmark: slow_parsing_benchmark.o
	$(CC) -O2 -o slow_parsing_benchmark slow_parsing_benchmark.o $(LDFLAGS)

slow_parsing_benchmark.o: slow_parsing_benchmark.c
	$(CC) $(CFLAGS) -c -o slow_parsing_benchmark.o slow_parsing_benchmark.c

clean:
	rm -f slow_parsing_benchmark.o slow_parsing_benchmark slow_parsing_benchmark_v2
Sample program that leads to a lot of xmlStrlen calls and summary of CPU profiles of its execution

I can't share the entire parser because it's not open source. Our usage is similar except that instead of ten same searches we have multiple doc.css and doc.at_css calls with different selectors. CSS selectors in doc.css and doc.at_css are translated to XPath by Nokogiri internally.

big_html.zip

require "nokogiri"

html = File.read("big_html.html")

doc = Nokogiri::HTML.parse(html)

10.times do
  doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
    10.times do
      sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
    end
  end
end

Command to execute the sample program with gperftools.

env LD_PRELOAD=/usr/local/lib/libprofiler.so CPUPROFILE=tmp/big_html_nokogiri.prof ./tmp/slow_search_parse.rb

CPU profile before usage of strlen in xmlStrlen.

$ pprof -top tmp/big_html_nokogiri.prof* | head -30
Showing nodes accounting for 1530ms, 58.40% of 2620ms total
Dropped 139 nodes (cum <= 13.10ms)
Showing top 10 nodes out of 180
      flat  flat%   sum%        cum   cum%
     410ms 15.65% 15.65%      410ms 15.65%  atomic_sub_nounderflow (inline)
     360ms 13.74% 29.39%      370ms 14.12%  objspace_malloc_increase.constprop.0
     150ms  5.73% 35.11%      150ms  5.73%  xmlStrlen
     130ms  4.96% 40.08%     1880ms 71.76%  xmlXPathCompOpEval
     130ms  4.96% 45.04%     1870ms 71.37%  xmlXPathNodeCollectAndTest
     110ms  4.20% 49.24%      120ms  4.58%  _int_free
      70ms  2.67% 51.91%      120ms  4.58%  _int_malloc
      60ms  2.29% 54.20%      100ms  3.82%  musable (inline)
      60ms  2.29% 56.49%      610ms 23.28%  ruby_xmalloc
      50ms  1.91% 58.40%       50ms  1.91%  xmlStrdup

CPU profile after usage of strlen in xmlStrlen.

$ pprof -top tmp/big_html_nokogiri.prof* | head -30
Showing nodes accounting for 2300ms, 70.55% of 3260ms total
Dropped 158 nodes (cum <= 16.30ms)
Showing top 20 nodes out of 200
      flat  flat%   sum%        cum   cum%
     490ms 15.03% 15.03%      490ms 15.03%  atomic_sub_nounderflow (inline)
     400ms 12.27% 27.30%      400ms 12.27%  objspace_malloc_increase.constprop.0
     210ms  6.44% 33.74%      250ms  7.67%  _int_free
     140ms  4.29% 38.04%     2610ms 80.06%  xmlXPathNodeCollectAndTest
     130ms  3.99% 42.02%     2620ms 80.37%  xmlXPathCompOpEval
     110ms  3.37% 45.40%      110ms  3.37%  _init
      90ms  2.76% 48.16%      120ms  3.68%  musable (inline)
      80ms  2.45% 50.61%      140ms  4.29%  _int_malloc
      80ms  2.45% 53.07%      790ms 24.23%  ruby_xmalloc
      70ms  2.15% 55.21%      190ms  5.83%  malloc
      70ms  2.15% 57.36%       70ms  2.15%  valuePush
      60ms  1.84% 59.20%       60ms  1.84%  xmlStrdup
      50ms  1.53% 60.74%       50ms  1.53%  tcache_get (inline)
      50ms  1.53% 62.27%       50ms  1.53%  tcache_put (inline)
      50ms  1.53% 63.80%       50ms  1.53%  valuePop 
      50ms  1.53% 65.34%      250ms  7.67%  xmlStrndup
      50ms  1.53% 66.87%       50ms  1.53%  xmlXPathNextAttribute
      40ms  1.23% 68.10%       40ms  1.23%  free
      40ms  1.23% 69.33%       40ms  1.23%  xmlXPathNextDescendant
      40ms  1.23% 70.55%      180ms  5.52%  xmlXPathNodeSetAddUnique

Here are callgrind-style graphs of CPU profiles of the sample program before and after xmlStrlen patch: before.svg, after.svg.

Have you included adequate test coverage?

No.

Does this change affect the behavior of either the C or the Java implementations?

No. Only the performance of C implementation.


See https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 and #2133 for reference.

@flavorjones
Copy link
Member

@ilyazub Thanks for submitting this! The CI failures aren't because of this PR, it's because your base commit is from before I merged some packaging changes. I'm updating CI now and it should go green shortly.

@flavorjones
Copy link
Member

@ilyazub OK, there's something strange going on with Concourse's PR resource and I think the easiest thing to do will be to have you rebase this PR onto master. (I'd do it myself by don't have permissions on your branch.) Would you mind rebasing?

@flavorjones
Copy link
Member

Also please see my note at https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 -- I'm not detecting any noticeable performance improvement by introducing this change. Can you help me write a benchmark that will demonstrate that?

Here's a benchmark skeleton:

# Makefile
cflags="-I/usr/include/libxml2"
ldflags="-lxml2"

all: xpath-benchmark

xpath-benchmark: xpath-benchmark.c
	gcc xpath-benchmark.c $(cflags) $(ldflags) -o xpath-benchmark

clean:
	rm -f *.o xpath-benchmark

PHONY: clean
// xpath-benchmark.c
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

#include "libxml/parserInternals.h"
#include "libxml/xpath.h"
#include "libxml/xpathInternals.h"

long time_in_ms()
{
  struct timeval  tv;
  gettimeofday(&tv, NULL);

  return (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000.0 ;
}

long elapsed_ms_since(long start)
{
  return time_in_ms() - start ;
}

int main(int argc, char** argv)
{
  long start_time ;
  int loop = 10000 ;

  if (argc != 3) {
    fprintf(stderr, "USAGE: %s <html_file> <class_name>\n", argv[0]);
    exit(2);
  }

  char* filename = argv[1];
  char* query = argv[2];

  htmlDocPtr doc = htmlReadFile(filename, NULL, 0);
  if (doc == NULL) {
    exit(1);
  }

  xmlXPathInit();
  xmlXPathContextPtr ctx = xmlXPathNewContext(doc);

  start_time = time_in_ms();
  fprintf(stdout, "searching '%s' with '%s' %d times\n", filename, query, loop);
  for (int count = 0; count < loop; ++count) {
    xmlXPathObjectPtr xpath = xmlXPathEvalExpression(query, ctx);
    if (xpath == NULL) {
      fprintf(stderr, "xpath null for query '%s'\n", query);
      exit(1);
    }
    if (count == 0 && xpath->type == XPATH_NODESET) {
      fprintf(stdout, "NODESET with %d results\n", xpath->nodesetval->nodeNr);
    }
  }

  fprintf(stdout, "%ld ms\n", elapsed_ms_since(start_time));
}

and you run it like:

xpath-benchmark /path/to/file.html "//*[@attr='this is your xpath query']"

and it will output something like:

searching '../2133-deleteme-search-is-slow/pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' JdXsAe ')]' 10000 times
NODESET with 1 results
50627 ms

Is that helpful for you to generate some benchmark results about this change?

@codeclimate
Copy link

codeclimate bot commented Dec 24, 2020

Code Climate has analyzed commit 7ffd48a and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.1% (0.0% change).

View more on Code Climate.

@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 24, 2020

@flavorjones Yes, the benchmark skeleton is helpful for me. I have only added CFLAGS=-O2 and changed the number of iterations from 10,000 to 1,000 because the execution time of 10,000 iterations for my query was about 10 minutes.

Benchmark results

Before the patch

$ ./slow_parsing_benchmark ./pc_game_big_shopping.html "//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]"
searching './pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]' 1000 times
NODESET with 260 results
67855 ms

After the patch

$ ./slow_parsing_benchmark ./pc_game_big_shopping.html "//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]"
searching './pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]' 1000 times
NODESET with 260 results
59767 ms

10% speedup with a patch. I'm not sure why it runs faster with the patch. The flame graph of the sample program from the original post looks the same for nokogiri on master and on the branch of this PR. xmlXPathCompOpEval is dominating before and after the patch.

Flame graph of the Nokogiri program

Flame graph focused on evaluate

image

Flame graph of the entire program

image

Flamegraph of libxml2 benchmark in C with 1 iteration

Before the patch, xmlStrlen was taking more time, but the overall difference not that big.

Overall program

image

Focused on the first xmlXPathCompOpEval call

image

In both cases, xmlXPathNodeCollectAndTest and xmlXPathNodeSetMergeAndClear are taking most of the time of xmlXPathCompOpEval.

image

I'm looking into xmlXPathCompOpEval to understand what it does. The performance of xmlXPathNodeSetMergeAndClear was discussed in #1861.

@flavorjones
Copy link
Member

Thanks for the further research.

Let's talk for a moment about how to pitch this change upstream -- making this change in libxml2 is the ideal outcome I'd like to drive towards. I think it's confusing to talk to Nick about Nokogiri and CSS; but I think if we emphasize your real-world use case and the XPath queries being implemented in C, your data above (10% speedup) demonstrates a good case.

Can I ask that you take the C benchmark and the timing results to that upstream issue to continue the conversation with Nick?

@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 24, 2020

Can I ask that you take the C benchmark and the timing results to that upstream issue to continue the conversation with Nick?

Yes, that's what I'm working on now.

Edit. Added benchmark, timing results, and flame graphs to the upstream issue.

Benchmark results:

xmlStrlen (entire HTML file): 926171.936981 μs
glibc_xmlStrlen (entire HTML file): 36905.903992 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 25.094584 times

xmlStrlen (average string): 57479.204010 μs
glibc_xmlStrlen (average string): 5802.069000 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 9.905937 times

xmlStrlen (bigger string): 388056.315979 μs
glibc_xmlStrlen (bigger string): 12797.856995 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 30.318382 times

xmlStrlen (smallest string): 15870.046021 μs
glibc_xmlStrlen (smallest string): 6282.208984 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 2.527903 times

See https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 for reference.
@flavorjones flavorjones merged commit deb1b62 into sparklemotion:master Dec 28, 2020
flavorjones added a commit that referenced this pull request Dec 28, 2020
@flavorjones
Copy link
Member

@ilyazub Merged! Will be in v1.11.0.rc4. Thanks for your contribution!

@flavorjones flavorjones added this to the v1.11.0 milestone Dec 28, 2020
@ilyazub ilyazub deleted the xmlStrlen-patch branch December 29, 2020 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants