Use glibc strlen to speed up xmlStrlen #2144

ilyazub · 2020-12-23T15:10:34Z

What problem is this PR intended to solve?

This PR speeds up Nokogiri by speeding up xmlStrlen. It'd be better to reduce the amount of xmlStrlen calls from Nokogiri, but this patch is an easy win.

Our parsers sped up from 1.4 seconds to 1 second on big HTML documents with this patch.

xmlStrlen benchmark results:

xmlStrlen (entire HTML file): 926171.936981 μs
glibc_xmlStrlen (entire HTML file): 36905.903992 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 25.094584 times

xmlStrlen (average string): 57479.204010 μs
glibc_xmlStrlen (average string): 5802.069000 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 9.905937 times

xmlStrlen (bigger string): 388056.315979 μs
glibc_xmlStrlen (bigger string): 12797.856995 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 30.318382 times

xmlStrlen (smallest string): 15870.046021 μs
glibc_xmlStrlen (smallest string): 6282.208984 μs
delta (xmlStrlen ÷ glibc_xmlStrlen): 2.527903 times

Benchmark code

big_html.zip

#include <string.h>
#include <stdio.h>
#include <time.h>

#include "libxml/HTMLtree.h"
#include "libxml/xpath.h"

double time_in_microsec()
{
  struct timespec tv;
  clock_gettime(CLOCK_MONOTONIC_RAW, &tv);

  return (double) (tv.tv_sec * 1000000 + tv.tv_nsec / 1000.0);
}

double elapsed_ms_since(double start)
{
  return time_in_microsec() - start;
}

int glibc_xmlStrlen(const xmlChar *str) {
    if (str == NULL) return(0);

    return strlen((const char*)str);
}

int main(int argc, char *argv[])
{
  FILE *fp = fopen ("big_html.html", "rb");
  char *buffer = NULL;
  size_t len;
  ssize_t bytes_read = getdelim(&buffer, &len, '\0', fp);

  const int N = 1000000;

  double start_time, prev_start_time;

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < 1000; j++) {
    xmlStrlen(buffer);
  }

  prev_start_time = start_time;

  start_time = time_in_microsec();

  for (int j = 0; j < 1000; j++) {
    glibc_xmlStrlen(buffer);
  }

  fprintf(stderr, "xmlStrlen (entire HTML file): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (entire HTML file): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  xmlChar *str = ".//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;

  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (average string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (average string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  str = ".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (bigger string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (bigger string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));

  str = ".//*[not(.//a)]";

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    xmlStrlen(str);
  }

  prev_start_time = start_time;
  start_time = time_in_microsec();

  for (int j = 0; j < N; j++) {
    glibc_xmlStrlen(str);
  }

  fprintf(stderr, "xmlStrlen (smallest string): %f μs\n", elapsed_ms_since(prev_start_time));
  fprintf(stderr, "glibc_xmlStrlen (smallest string): %f μs\n", elapsed_ms_since(start_time));
  fprintf(stderr, "delta (xmlStrlen ÷ glibc_xmlStrlen): %f times\n\n", elapsed_ms_since(prev_start_time) / elapsed_ms_since(start_time));
}

Makefile

CFLAGS=$(shell xml2-config --cflags)
LDFLAGS=$(shell xml2-config --libs)

slow_parsing_benchmark: slow_parsing_benchmark.o
	$(CC) -O2 -o slow_parsing_benchmark slow_parsing_benchmark.o $(LDFLAGS)

slow_parsing_benchmark.o: slow_parsing_benchmark.c
	$(CC) $(CFLAGS) -c -o slow_parsing_benchmark.o slow_parsing_benchmark.c

clean:
	rm -f slow_parsing_benchmark.o slow_parsing_benchmark slow_parsing_benchmark_v2

Sample program that leads to a lot of xmlStrlen calls and summary of CPU profiles of its execution

I can't share the entire parser because it's not open source. Our usage is similar except that instead of ten same searches we have multiple doc.css and doc.at_css calls with different selectors. CSS selectors in doc.css and doc.at_css are translated to XPath by Nokogiri internally.

big_html.zip

require "nokogiri"

html = File.read("big_html.html")

doc = Nokogiri::HTML.parse(html)

10.times do
  doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
    10.times do
      sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
    end
  end
end

Command to execute the sample program with gperftools.

env LD_PRELOAD=/usr/local/lib/libprofiler.so CPUPROFILE=tmp/big_html_nokogiri.prof ./tmp/slow_search_parse.rb

CPU profile before usage of strlen in xmlStrlen.

$ pprof -top tmp/big_html_nokogiri.prof* | head -30
Showing nodes accounting for 1530ms, 58.40% of 2620ms total
Dropped 139 nodes (cum <= 13.10ms)
Showing top 10 nodes out of 180
      flat  flat%   sum%        cum   cum%
     410ms 15.65% 15.65%      410ms 15.65%  atomic_sub_nounderflow (inline)
     360ms 13.74% 29.39%      370ms 14.12%  objspace_malloc_increase.constprop.0
     150ms  5.73% 35.11%      150ms  5.73%  xmlStrlen
     130ms  4.96% 40.08%     1880ms 71.76%  xmlXPathCompOpEval
     130ms  4.96% 45.04%     1870ms 71.37%  xmlXPathNodeCollectAndTest
     110ms  4.20% 49.24%      120ms  4.58%  _int_free
      70ms  2.67% 51.91%      120ms  4.58%  _int_malloc
      60ms  2.29% 54.20%      100ms  3.82%  musable (inline)
      60ms  2.29% 56.49%      610ms 23.28%  ruby_xmalloc
      50ms  1.91% 58.40%       50ms  1.91%  xmlStrdup

CPU profile after usage of strlen in xmlStrlen.

$ pprof -top tmp/big_html_nokogiri.prof* | head -30
Showing nodes accounting for 2300ms, 70.55% of 3260ms total
Dropped 158 nodes (cum <= 16.30ms)
Showing top 20 nodes out of 200
      flat  flat%   sum%        cum   cum%
     490ms 15.03% 15.03%      490ms 15.03%  atomic_sub_nounderflow (inline)
     400ms 12.27% 27.30%      400ms 12.27%  objspace_malloc_increase.constprop.0
     210ms  6.44% 33.74%      250ms  7.67%  _int_free
     140ms  4.29% 38.04%     2610ms 80.06%  xmlXPathNodeCollectAndTest
     130ms  3.99% 42.02%     2620ms 80.37%  xmlXPathCompOpEval
     110ms  3.37% 45.40%      110ms  3.37%  _init
      90ms  2.76% 48.16%      120ms  3.68%  musable (inline)
      80ms  2.45% 50.61%      140ms  4.29%  _int_malloc
      80ms  2.45% 53.07%      790ms 24.23%  ruby_xmalloc
      70ms  2.15% 55.21%      190ms  5.83%  malloc
      70ms  2.15% 57.36%       70ms  2.15%  valuePush
      60ms  1.84% 59.20%       60ms  1.84%  xmlStrdup
      50ms  1.53% 60.74%       50ms  1.53%  tcache_get (inline)
      50ms  1.53% 62.27%       50ms  1.53%  tcache_put (inline)
      50ms  1.53% 63.80%       50ms  1.53%  valuePop 
      50ms  1.53% 65.34%      250ms  7.67%  xmlStrndup
      50ms  1.53% 66.87%       50ms  1.53%  xmlXPathNextAttribute
      40ms  1.23% 68.10%       40ms  1.23%  free
      40ms  1.23% 69.33%       40ms  1.23%  xmlXPathNextDescendant
      40ms  1.23% 70.55%      180ms  5.52%  xmlXPathNodeSetAddUnique

Here are callgrind-style graphs of CPU profiles of the sample program before and after xmlStrlen patch: before.svg, after.svg.

Have you included adequate test coverage?

No.

Does this change affect the behavior of either the C or the Java implementations?

No. Only the performance of C implementation.

See https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 and #2133 for reference.

flavorjones · 2020-12-23T16:45:09Z

@ilyazub Thanks for submitting this! The CI failures aren't because of this PR, it's because your base commit is from before I merged some packaging changes. I'm updating CI now and it should go green shortly.

flavorjones · 2020-12-23T17:09:29Z

@ilyazub OK, there's something strange going on with Concourse's PR resource and I think the easiest thing to do will be to have you rebase this PR onto master. (I'd do it myself by don't have permissions on your branch.) Would you mind rebasing?

flavorjones · 2020-12-23T17:57:42Z

Also please see my note at https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 -- I'm not detecting any noticeable performance improvement by introducing this change. Can you help me write a benchmark that will demonstrate that?

Here's a benchmark skeleton:

# Makefile
cflags="-I/usr/include/libxml2"
ldflags="-lxml2"

all: xpath-benchmark

xpath-benchmark: xpath-benchmark.c
	gcc xpath-benchmark.c $(cflags) $(ldflags) -o xpath-benchmark

clean:
	rm -f *.o xpath-benchmark

PHONY: clean

// xpath-benchmark.c
#include <stdlib.h>
#include <stdio.h>
#include <sys/time.h>

#include "libxml/parserInternals.h"
#include "libxml/xpath.h"
#include "libxml/xpathInternals.h"

long time_in_ms()
{
  struct timeval  tv;
  gettimeofday(&tv, NULL);

  return (tv.tv_sec) * 1000 + (tv.tv_usec) / 1000.0 ;
}

long elapsed_ms_since(long start)
{
  return time_in_ms() - start ;
}

int main(int argc, char** argv)
{
  long start_time ;
  int loop = 10000 ;

  if (argc != 3) {
    fprintf(stderr, "USAGE: %s <html_file> <class_name>\n", argv[0]);
    exit(2);
  }

  char* filename = argv[1];
  char* query = argv[2];

  htmlDocPtr doc = htmlReadFile(filename, NULL, 0);
  if (doc == NULL) {
    exit(1);
  }

  xmlXPathInit();
  xmlXPathContextPtr ctx = xmlXPathNewContext(doc);

  start_time = time_in_ms();
  fprintf(stdout, "searching '%s' with '%s' %d times\n", filename, query, loop);
  for (int count = 0; count < loop; ++count) {
    xmlXPathObjectPtr xpath = xmlXPathEvalExpression(query, ctx);
    if (xpath == NULL) {
      fprintf(stderr, "xpath null for query '%s'\n", query);
      exit(1);
    }
    if (count == 0 && xpath->type == XPATH_NODESET) {
      fprintf(stdout, "NODESET with %d results\n", xpath->nodesetval->nodeNr);
    }
  }

  fprintf(stdout, "%ld ms\n", elapsed_ms_since(start_time));
}

and you run it like:

xpath-benchmark /path/to/file.html "//*[@attr='this is your xpath query']"

and it will output something like:

searching '../2133-deleteme-search-is-slow/pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' JdXsAe ')]' 10000 times
NODESET with 1 results
50627 ms

Is that helpful for you to generate some benchmark results about this change?

codeclimate · 2020-12-24T09:24:30Z

Code Climate has analyzed commit 7ffd48a and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.1% (0.0% change).

View more on Code Climate.

ilyazub · 2020-12-24T14:59:44Z

@flavorjones Yes, the benchmark skeleton is helpful for me. I have only added CFLAGS=-O2 and changed the number of iterations from 10,000 to 1,000 because the execution time of 10,000 iterations for my query was about 10 minutes.

Benchmark results

Before the patch

$ ./slow_parsing_benchmark ./pc_game_big_shopping.html "//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]"
searching './pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]' 1000 times
NODESET with 260 results
67855 ms

After the patch

$ ./slow_parsing_benchmark ./pc_game_big_shopping.html "//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]"
searching './pc_game_big_shopping.html' with '//*[contains(concat(' ',@class,' '), ' sh-dlr__list-result ')]//*[contains(concat(' ', @class, ' '), ' hBUZL ')]' 1000 times
NODESET with 260 results
59767 ms

10% speedup with a patch. I'm not sure why it runs faster with the patch. The flame graph of the sample program from the original post looks the same for nokogiri on master and on the branch of this PR. xmlXPathCompOpEval is dominating before and after the patch.

Flame graph of the Nokogiri program

Flame graph focused on evaluate

Flame graph of the entire program

Flamegraph of libxml2 benchmark in C with 1 iteration

Before the patch, xmlStrlen was taking more time, but the overall difference not that big.

Overall program

Focused on the first xmlXPathCompOpEval call

In both cases, xmlXPathNodeCollectAndTest and xmlXPathNodeSetMergeAndClear are taking most of the time of xmlXPathCompOpEval.

I'm looking into xmlXPathCompOpEval to understand what it does. The performance of xmlXPathNodeSetMergeAndClear was discussed in #1861.

flavorjones · 2020-12-24T15:15:57Z

Thanks for the further research.

Let's talk for a moment about how to pitch this change upstream -- making this change in libxml2 is the ideal outcome I'd like to drive towards. I think it's confusing to talk to Nick about Nokogiri and CSS; but I think if we emphasize your real-world use case and the XPath queries being implemented in C, your data above (10% speedup) demonstrates a good case.

Can I ask that you take the C benchmark and the timing results to that upstream issue to continue the conversation with Nick?

ilyazub · 2020-12-24T15:24:03Z

Can I ask that you take the C benchmark and the timing results to that upstream issue to continue the conversation with Nick?

~~Yes, that's what I'm working on now.~~

Edit. Added benchmark, timing results, and flame graphs to the upstream issue.

Benchmark results: xmlStrlen (entire HTML file): 926171.936981 μs glibc_xmlStrlen (entire HTML file): 36905.903992 μs delta (xmlStrlen ÷ glibc_xmlStrlen): 25.094584 times xmlStrlen (average string): 57479.204010 μs glibc_xmlStrlen (average string): 5802.069000 μs delta (xmlStrlen ÷ glibc_xmlStrlen): 9.905937 times xmlStrlen (bigger string): 388056.315979 μs glibc_xmlStrlen (bigger string): 12797.856995 μs delta (xmlStrlen ÷ glibc_xmlStrlen): 30.318382 times xmlStrlen (smallest string): 15870.046021 μs glibc_xmlStrlen (smallest string): 6282.208984 μs delta (xmlStrlen ÷ glibc_xmlStrlen): 2.527903 times See https://gitlab.gnome.org/GNOME/libxml2/-/issues/212 for reference.

flavorjones · 2020-12-28T19:52:56Z

@ilyazub Merged! Will be in v1.11.0.rc4. Thanks for your contribution!

flavorjones added topic/performance needs/more-info vendored/libxml2 labels Dec 23, 2020

ilyazub force-pushed the xmlStrlen-patch branch from 8cc9381 to 34ce1ae Compare December 24, 2020 09:20

ilyazub force-pushed the xmlStrlen-patch branch from 34ce1ae to 7ffd48a Compare December 25, 2020 13:55

flavorjones merged commit deb1b62 into sparklemotion:master Dec 28, 2020

flavorjones added a commit that referenced this pull request Dec 28, 2020

changelog: note the strlen patch from #2144

8ec04bb

flavorjones added this to the v1.11.0 milestone Dec 28, 2020

flavorjones removed the needs/more-info label Dec 28, 2020

ilyazub deleted the xmlStrlen-patch branch December 29, 2020 09:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use glibc strlen to speed up xmlStrlen #2144

Use glibc strlen to speed up xmlStrlen #2144

ilyazub commented Dec 23, 2020

flavorjones commented Dec 23, 2020

flavorjones commented Dec 23, 2020

flavorjones commented Dec 23, 2020

codeclimate bot commented Dec 24, 2020 •

edited

Loading

ilyazub commented Dec 24, 2020 •

edited

Loading

flavorjones commented Dec 24, 2020

ilyazub commented Dec 24, 2020 •

edited

Loading

flavorjones commented Dec 28, 2020

Use glibc strlen to speed up xmlStrlen #2144

Use glibc strlen to speed up xmlStrlen #2144

Conversation

ilyazub commented Dec 23, 2020

flavorjones commented Dec 23, 2020

flavorjones commented Dec 23, 2020

flavorjones commented Dec 23, 2020

codeclimate bot commented Dec 24, 2020 • edited Loading

ilyazub commented Dec 24, 2020 • edited Loading

Benchmark results

Flame graph of the Nokogiri program

Flamegraph of libxml2 benchmark in C with 1 iteration

flavorjones commented Dec 24, 2020

ilyazub commented Dec 24, 2020 • edited Loading

flavorjones commented Dec 28, 2020

codeclimate bot commented Dec 24, 2020 •

edited

Loading

ilyazub commented Dec 24, 2020 •

edited

Loading

ilyazub commented Dec 24, 2020 •

edited

Loading