Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] Nokogiri::XML::Searchable#at_css is 3-4 times slower on 2 MB HTML files comparing to 640 KB files #2133

Closed
ilyazub opened this issue Dec 15, 2020 · 10 comments

Comments

@ilyazub
Copy link
Contributor

ilyazub commented Dec 15, 2020

small_html.zip
big_html.zip

Help us reproduce what you're seeing

Nokogiri::XML::Searchable#at_css is 3-4 times slower on 2 MB HTML files comparing to 640 KB files.

I thought it was related to #1861, but in my case, xmlStrlen of libxml2 is being a bottleneck.

Below is a small reproduction script with pprof output on top of it. I'd happily open an issue in the libxml2 repository but I wasn't able to create a reproducible example for libxml2 specifically, so in the example, I've used Nokogiri.

# -ggdb3 or -gdwarf was needed to properly show 'libxml2' function names in 'perf report' 
CFLAGS="-O2 -ggdb3 -gdwarf -pipe -fno-omit-frame-pointer" gem install nokogiri

# Record perf events for big HTML file
perf record -F99 -e cycles:u -g -o perf_slow.data -- ruby -e 'require "nokogiri"; doc = Nokogiri::HTML.parse(File.read("pc_game_big_shopping.html")); doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") }'

# Record perf events for small HTML file
perf record -F99 -e cycles:u -g -o perf_fast.data -- ruby -e 'require "nokogiri"; doc = Nokogiri::HTML.parse(File.read("pc_game_big_shopping.html")); doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") }'
xmlStrlen is being called too much times for big HTML file.
pprof -text -lines perf_slow.data
File: env
perf-version:5.9.9
perf-command:/usr/bin/perf record -F99 -e cycles:u -g -- ruby -e require "nokogiri"; doc = Nokogiri::HTML.parse(File.read("tmp/pc_game_big_shopping.html")); doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") }
Type: cycles:u_event
Showing nodes accounting for 2493412844, 100% of 2493794837 total
Dropped 3 nodes (cum <= 12468974)
      flat  flat%   sum%        cum   cum%
 378400913 15.17% 15.17%  378400913 15.17%  xmlStrlen /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/xmlstring.c:427 (inline)
 296023164 11.87% 27.04%  296023164 11.87%  test_command ??
 207221433  8.31% 35.35%  222280493  8.91%  ??
 139343283  5.59% 40.94%  139343283  5.59%  get_next_env /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.c:73 (inline)
 136491127  5.47% 46.41%  136491127  5.47%  <unknown>
 136302538  5.47% 51.88%  136302538  5.47%  _dl_sysdep_start /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-sysdep.c:178
 130162253  5.22% 57.10%  130162253  5.22%  ptmalloc_init.part.0 /usr/src/debug/glibc-2.32-20-g5c36293f06/malloc/arena.c:303
 124458341  4.99% 62.09%  124458341  4.99%  __strcpy_avx2 /usr/src/debug/glibc-2.32-20-g5c36293f06/sysdeps/x86_64/multiarch/strcpy-avx2.S:73
 118872306  4.77% 66.86%  118872306  4.77%  tunable_is_name /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.h:121 (inline)
 101239943  4.06% 70.92%  101239943  4.06%  hash_table_raw_insert /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/id_table.c:163 (inline)
  97952346  3.93% 74.84%   97952346  3.93%  yyparse ??
  94091134  3.77% 78.62%   94091134  3.77%  xmlStrlen /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/xmlstring.c:426 (inline)
  93936814  3.77% 82.38%   93936814  3.77%  objspace_malloc_increase.constprop.0 /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:9725
  93707988  3.76% 86.14%   93707988  3.76%  atomic_sub_nounderflow /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:9701 (inline)
  31778965  1.27% 87.42%   31778965  1.27%  htmlParseScript /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:2976
  31367583  1.26% 88.67%   31367583  1.26%  malloc /usr/src/debug/glibc-2.32-20-g5c36293f06/malloc/malloc.c:3043
  30292982  1.21% 89.89%   30292982  1.21%  rb_ary_entry /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/array.c:1515
  30277698  1.21% 91.10%   30277698  1.21%  compile_data_alloc_with_arena /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:859 (inline)
  28942555  1.16% 92.26%   28942555  1.16%  xmlIsMainThread /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/threads.c:796
  27964370  1.12% 93.38%   27964370  1.12%  rb_keyword_given_p /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/eval.c:913
  27674657  1.11% 94.49%   27674657  1.11%  xmlNewText /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:2445
  27607031  1.11% 95.60%   27607031  1.11%  ruby_yyparse /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.c:11947
  24954360  1.00% 96.60%   24954360  1.00%  ruby_yyparse /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.c:5730
  24640271  0.99% 97.59%   24640271  0.99%  xmlNextChar /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/parserInternals.c:429
  23592142  0.95% 98.54%   23592142  0.95%  new_insn_body.constprop.2 /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:1174
  23586726  0.95% 99.48%   23586726  0.95%  enc_compatible_latter /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/encoding.c:908
  12504664   0.5%   100%   12504664   0.5%  __GI___tunables_init /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.c:316
     25257 0.001%   100%  119118202  4.78%  ??
         0     0%   100%  139343283  5.59%  __GI___tunables_init /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.c:295
         0     0%   100%  118872306  4.77%  __GI___tunables_init /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.c:322
         0     0%   100%   28942555  1.16%  __xmlLastError /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/globals.c:729
         0     0%   100%   28942555  1.16%  __xmlRaiseError /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/error.c:600
         0     0%   100%  119092945  4.78%  _dl_start /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/rtld.c:578
         0     0%   100%   93707988  3.76%  atomic_sub_nounderflow /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:9694 (inline)
         0     0%   100%   30277698  1.21%  compile_data_alloc /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:898 (inline)
         0     0%   100%   30277698  1.21%  compile_data_alloc_trace /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:930 (inline)
         0     0%   100%  119092945  4.78%  elf_dynamic_do_Rela /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/do-rel.h:111 (inline)
         0     0%   100%  585528495 23.48%  htmlDoRead /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:6798
         0     0%   100%   24640271  0.99%  htmlParseAttValue /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:2751 (inline)
         0     0%   100%   24640271  0.99%  htmlParseAttribute /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:3561 (inline)
         0     0%   100%  504271012 20.22%  htmlParseContentInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4626
         0     0%   100%   81257483  3.26%  htmlParseContentInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4661
         0     0%   100%  585528495 23.48%  htmlParseDocument /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4823
         0     0%   100%   52314928  2.10%  htmlParseElementInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4476 (inline)
         0     0%   100%   28942555  1.16%  htmlParseElementInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4489 (inline)
         0     0%   100%   28942555  1.16%  htmlParseErr /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:117
         0     0%   100%   24640271  0.99%  htmlParseHTMLAttribute /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:2676
         0     0%   100%  472492047 18.95%  htmlParseScript /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:2971
         0     0%   100%   24640271  0.99%  htmlParseStartTag /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:3844
         0     0%   100%   27674657  1.11%  htmlParseStartTag /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:3930
         0     0%   100%   30277698  1.21%  iseq_compile_each0 /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:7139
         0     0%   100%   30277698  1.21%  new_trace_body /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:1099 (inline)
         0     0%   100%   31394338  1.26%  objspace_malloc_increase /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:9729 (inline)
         0     0%   100%   62313650  2.50%  objspace_malloc_increase /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:9731 (inline)
         0     0%   100%   93707988  3.76%  objspace_xfree /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:10005 (inline)
         0     0%   100%   93707988  3.76%  objspace_xfree /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:10075 (inline)
         0     0%   100%  101239943  4.06%  rb_id_table_insert /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/id_table.c:258
         0     0%   100%  101239943  4.06%  rb_id_table_insert_key /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/id_table.c:250 (inline)
         0     0%   100%  585528495 23.48%  read_memory /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/html_document.c:116
         0     0%   100%   93707988  3.76%  ruby_sized_xfree /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:10162 (inline)
         0     0%   100%   93707988  3.76%  ruby_sized_xfree /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:10165 (inline)
         0     0%   100%   93707988  3.76%  ruby_xfree /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:10172
         0     0%   100%  585528495 23.48%  vm_call_cfunc /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:2539
         0     0%   100%  585528495 23.48%  vm_call_cfunc_with_frame /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:2514 (inline)
         0     0%   100%   27674657  1.11%  xmlNewDocText /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:2656
         0     0%   100%   27674657  1.11%  xmlSAX2AttributeInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/SAX2.c:1347
         0     0%   100%   27674657  1.11%  xmlSAX2StartElement /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/SAX2.c:1755
         0     0%   100%  472492047 18.95%  xmlStrlen /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/xmlstring.c:422
         0     0%   100%  472492047 18.95%  xmlTextConcat /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:7014
For small HTML file xmlStrlen is not a bottleneck
pprof -text -lines perf_fast.data
File: env
perf-version:5.9.9
perf-command:/usr/bin/perf record -F99 -e cycles:u -g -- ruby -e require "nokogiri"; doc = Nokogiri::HTML.parse(File.read("tmp/pc_game_small_shopping.html")); doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") }
Type: cycles:u_event
Showing nodes accounting for 1408616219, 100% of 1408758631 total
Dropped 3 nodes (cum <= 7043793)
      flat  flat%   sum%        cum   cum%
 150657603 10.69% 10.69%  150657603 10.69%  __GI___tunables_init /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-tunables.c:316
 140765453  9.99% 20.69%  140765453  9.99%  map_over_funcs ??
 119820255  8.51% 29.19%  119820255  8.51%  get_common_cache_info /usr/src/debug/glibc-2.32-20-g5c36293f06/sysdeps/x86/cacheinfo.c:604
 118906136  8.44% 37.63%  118906136  8.44%  __strlen_avx2 /usr/src/debug/glibc-2.32-20-g5c36293f06/sysdeps/x86_64/multiarch/strlen-avx2.S:354
 109923727  7.80% 45.44%  109923727  7.80%  ??
 108197998  7.68% 53.12%  108197998  7.68%  __strcpy_avx2 /usr/src/debug/glibc-2.32-20-g5c36293f06/sysdeps/x86_64/multiarch/strcpy-avx2.S:302
  90182399  6.40% 59.52%   90182399  6.40%  xmlStrlen /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/xmlstring.c:427 (inline)
  89818041  6.38% 65.89%   89818041  6.38%  __gconv_transform_utf8_internal /usr/src/debug/glibc-2.32-20-g5c36293f06/iconv/skeleton.c:513
  68485121  4.86% 70.75%   68485121  4.86%  elf_dynamic_do_Rela /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/do-rel.h:137 (inline)
  35310965  2.51% 73.26%   35310965  2.51%  __strcpy_avx2 /usr/src/debug/glibc-2.32-20-g5c36293f06/sysdeps/x86_64/multiarch/strcpy-avx2.S:612
  31722033  2.25% 75.51%   31722033  2.25%  _int_free /usr/src/debug/glibc-2.32-20-g5c36293f06/malloc/malloc.c:4183
  31455066  2.23% 77.75%   31455066  2.23%  xmlFreePropList /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:2072 (inline)
  31307181  2.22% 79.97%   31307181  2.22%  parser_precise_mbclen /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.y:5622 (inline)
  31207582  2.22% 82.18%   31207582  2.22%  hash_search ??
  29605073  2.10% 84.28%   29605073  2.10%  set_shellopts ??
  28412181  2.02% 86.30%   28412181  2.02%  reserved_word /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/defs/keywords:70
  27878391  1.98% 88.28%   27878391  1.98%  vm_getivar /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:1088 (inline)
  26624790  1.89% 90.17%   26624790  1.89%  ruby_yyparse /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.c:5853
  25746505  1.83% 92.00%   25746505  1.83%  htmlTagLookup /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:1260
  25234034  1.79% 93.79%   25234034  1.79%  vm_opt_regexpmatch2 /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:4723 (inline)
  24534698  1.74% 95.53%   24534698  1.74%  ruby_yyparse /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.c:5841
  23631937  1.68% 97.21%   23631937  1.68%  iseq_set_sequence /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:2078 (inline)
  21013430  1.49% 98.70%   21013430  1.49%  _int_malloc /usr/src/debug/glibc-2.32-20-g5c36293f06/malloc/malloc.c:3666
  18175620  1.29%   100%   18175620  1.29%  <unknown>
         0     0%   100%   68485121  4.86%  _dl_relocate_object /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-reloc.c:274
         0     0%   100%   68485121  4.86%  _dl_sysdep_start /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/dl-sysdep.c:252
         0     0%   100%   16454271  1.17%  command_substitute ??
         0     0%   100%   68485121  4.86%  dl_main /usr/src/debug/glibc-2.32-20-g5c36293f06/elf/rtld.c:2341
         0     0%   100%   31455066  2.23%  finalize_list /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:3378
         0     0%   100%  115928904  8.23%  htmlDoRead /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:6798
         0     0%   100%   90182399  6.40%  htmlParseContentInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4626
         0     0%   100%   25746505  1.83%  htmlParseContentInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4661
         0     0%   100%  115928904  8.23%  htmlParseDocument /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4823
         0     0%   100%   25746505  1.83%  htmlParseElementInternal /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:4487 (inline)
         0     0%   100%   90182399  6.40%  htmlParseScript /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/HTMLparser.c:2971
         0     0%   100%   23631937  1.68%  iseq_setup /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/compile.c:1367
         0     0%   100%  115928904  8.23%  read_memory /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/html_document.c:116
         0     0%   100%   31455066  2.23%  run_final /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/gc.c:3359 (inline)
         0     0%   100%   31307181  2.22%  tokadd_mbchar /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/parse.y:6739
         0     0%   100%  115928904  8.23%  vm_call_cfunc /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:2539
         0     0%   100%  115928904  8.23%  vm_call_cfunc_with_frame /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:2514 (inline)
         0     0%   100%   25234034  1.79%  vm_exec_core /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/insns.def:1449
         0     0%   100%   27878391  1.98%  vm_exec_core /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/insns.def:216
         0     0%   100%   27878391  1.98%  vm_getinstancevariable /tmp/ruby-build.20201120193557.320837.FHSKmn/ruby-2.7.2/vm_insnhelper.c:1161 (inline)
         0     0%   100%   31455066  2.23%  xmlFreeDoc /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:1253
         0     0%   100%   31455066  2.23%  xmlFreeNodeList /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:3665
         0     0%   100%   31455066  2.23%  xmlFreeNodeList /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:3704 (inline)
         0     0%   100%   31455066  2.23%  xmlFreePropList /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:2067
         0     0%   100%   90182399  6.40%  xmlStrlen /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/xmlstring.c:422
         0     0%   100%   90182399  6.40%  xmlTextConcat /home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ext/nokogiri/tmp/x86_64-pc-linux-gnu/ports/libxml2/2.9.10/libxml2-2.9.10/tree.c:7014

I've also tried to count the number of times xmlStrlen is being called with unique strings. It seems that not that only 60 of 1.7M calls we with unique strings.

# ./tmp/slow_search_parse.rb
require_relative File.expand_path("#{__dir__}/../config/environment")

# I can't share sources, but parse! is basically a bunch of doc.css and at_css calls
Search.new(q: "pc game", tbm: "shop", file_path: "tmp/pc_game_big_shopping.html").parse!

The rough experiment with counting unique xmlStrlen calls.

bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'dprintf xmlstring.c:423, "str: %p\n", (void *)&str' --ex 'run' --args ruby ./tmp/slow_search_parse.rb > xmlStrlen_str_values.txt

grep str: xmlStrlen_str_values.txt | wc -l
1739731

grep str: xmlStrlen_str_values.txt | sort | uniq -c | sort -rn | head -20
326970 str: 0x7fffffffb0e8
194454 str: 0x7fffffffae88
186880 str: 0x7fffffffafb8
163548 str: 0x7fffffffb5f8
112260 str: 0x7fffffffad58
  97227 str: 0x7fffffffb398
  93440 str: 0x7fffffffb4c8
  77858 str: 0x7fffffffa9c8
  56130 str: 0x7fffffffb268
  55658 str: 0x7fffffffb218
  46710 str: 0x7fffffffac28
  38929 str: 0x7fffffffaed8
  27974 str: 0x7fffffffb728
  23355 str: 0x7fffffffb138
  16016 str: 0x7fffffffaaf8
  7785 str: 0x7fffffffb008
  6200 str: 0x7fffffffa228
  5060 str: 0x7fffffffa358
  5026 str: 0x7fffffffa0f8
  3594 str: 0x7fffffffa488

I also prepared a benchmark against big and small HTML that shows that libxml2 is 3-4 times slower on big files.

# frozen_string_literal: true

html = File.read("tmp/pc_game_big_shopping.html")
small_html = File.read("tmp/pc_game_small_shopping.html")

doc = Nokogiri::HTML.parse(html)
small_doc = Nokogiri::HTML.parse(small_html)

puts Nokogiri::VERSION_INFO.to_yaml
puts "Searching #{doc.css("*").length} nodes"
puts "Searching #{small_doc.css("*").length} nodes"

Benchmark.bm(40) do |x|
  x.report("CSS big_html shopping_results") { doc.css(".sh-dlr__list-result, .sh-dgr__grid-result") }
  x.report("CSS big_html link") { doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".eIuuYe a, a.EI11Pd, a.AGVhpb, a.GyDBsd, a.VQN8fd, a.VZTCjd, a.REX1ub, a.sHaywe, a.eaGTj") } }
  x.report("CSS big_html snippet") { doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") } }
  x.report("CSS big_html extensions") { doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:nth-of-type(3) span, .hBUZL:not(.Rv2Cae):not(.Y6Fvnc):nth-of-type(3) span, .hBUZL.Fxxvzc span, .p7n7Ze .F7Kwhf span") } }

  x.report("CSS small_html shopping_results") { small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result") }
  x.report("CSS small_html link") { small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".eIuuYe a, a.EI11Pd, a.AGVhpb, a.GyDBsd, a.VQN8fd, a.VZTCjd, a.REX1ub, a.sHaywe, a.eaGTj") } }
  x.report("CSS small_html snippet") { small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") } }
  x.report("CSS small_html extensions") { small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:nth-of-type(3) span, .hBUZL:not(.Rv2Cae):not(.Y6Fvnc):nth-of-type(3) span, .hBUZL.Fxxvzc span, .p7n7Ze .F7Kwhf span") } }
end

Benchmark.bm(40) do |x|
  x.report("XPath big_html shopping_results") { doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]") }
  x.report("XPath big_html link") { doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[contains(concat(' ', normalize-space(@class), ' '), ' EI11Pd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' AGVhpb ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' GyDBsd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VQN8fd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VZTCjd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' REX1ub ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' sHaywe ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' eaGTj ')]") } }
  x.report("XPath big_html snippet") { doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]") } }
  x.report("XPath big_html extensions") { doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Y6Fvnc '))][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ')]//*[contains(concat(' ', normalize-space(@class), ' '), ' F7Kwhf ')]//span") } }

  x.report("XPath small_html shopping_results") { small_doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]") }
  x.report("XPath small_html link") { small_doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[contains(concat(' ', normalize-space(@class), ' '), ' EI11Pd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' AGVhpb ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' GyDBsd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VQN8fd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VZTCjd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' REX1ub ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' sHaywe ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' eaGTj ')]") } }
  x.report("XPath small_html snippet") { small_doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]") } }
  x.report("XPath small_html extensions") { small_doc.xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]").each { |sh_r| sh_r.at_xpath(".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Y6Fvnc '))][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ')]//*[contains(concat(' ', normalize-space(@class), ' '), ' F7Kwhf ')]//span") } }
end
Results of the benchmark.
---
warnings: []
nokogiri: 1.10.10
ruby:
  version: 2.7.2
  platform: x86_64-linux
  description: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
  libxslt_path: "/home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  - 0004-libxml2.la-is-in-top_builddir.patch
  - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
  libxslt_patches: []
  compiled: 2.9.10
  loaded: 2.9.10
Searching 7785 nodes
Searching 2544 nodes
                                              user     system      total        real
CSS big_html container                     0.033532   0.000000   0.033532 (  0.033206)
CSS big_html selector1                     0.061557   0.000000   0.061557 (  0.062280)
CSS big_html selector2                     0.084810   0.000000   0.084810 (  0.085594)
CSS big_html selector3                     0.072296   0.000916   0.073212 (  0.073693)
CSS small_html container                   0.009254   0.000029   0.009283 (  0.009238)
CSS small_html selector1                   0.017749   0.000000   0.017749 (  0.018098)
CSS small_html selector2                   0.022495   0.000169   0.022664 (  0.022848)
CSS small_html selector3                   0.020638   0.000000   0.020638 (  0.020951)
                                              user     system      total        real
XPath big_html container                   0.029754   0.000000   0.029754 (  0.029956)
XPath big_html selector1                   0.085694   0.001074   0.086768 (  0.087289)
XPath big_html selector2                   0.104368   0.000000   0.104368 (  0.105062)
XPath big_html selector3                   0.085284   0.000000   0.085284 (  0.085783)
XPath small_html container                 0.009373   0.000835   0.010208 (  0.010034)
XPath small_html selector1                 0.017261   0.000029   0.017290 (  0.017089)
XPath small_html selector2                 0.021228   0.000027   0.021255 (  0.021252)
XPath small_html selector3                 0.020391   0.000000   0.020391 (  0.020350)

Expected behavior

Environment

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.7.2
      platform: x86_64-linux
      description: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
      libxslt_path: "/home/ilyazub/.rbenv/versions/2.7.2/lib/ruby/gems/2.7.0/gems/nokogiri-1.10.10/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

Additional context

@ilyazub ilyazub added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Dec 15, 2020
@flavorjones
Copy link
Member

👋 Thanks for opening this up, and for looking at #1861 and identifying key differences.

My assumption based on a brief scan of what you've posted here is that this is a performance issue with libxml2 and will need to be addressed upstream. I'll probably spend some time tonight reproducing some of your perf analysis to make sure I understand what's going on.

The next few weeks of my time are going to be spent focused on getting v1.11.0 shipped, but once I've done that I can probably help put together a reproduction that uses the libxml2 C API (and not Nokogiri).

@flavorjones
Copy link
Member

@ilyazub Hi, I'm looking into this and I'm a little confused. Generally, I'm seeing that a document that is about three times larger takes about three times longer to search. That doesn't seem unexpected to me, can you help me understand what I'm missing?

I'll show my work, and hope that you can tell me what we're doing differently.

First, here's the benchmark script I'm running (against the two files you attached in the description):

#! /usr/bin/env ruby

require "nokogiri"
require "yaml"

gem "benchmark-ips"
require "benchmark/ips"

puts Nokogiri::VERSION_INFO.to_yaml

big_doc = Nokogiri::HTML.parse(File.read("pc_game_big_shopping.html"))
small_doc = Nokogiri::HTML.parse(File.read("pc_game_small_shopping.html"))

Benchmark.ips do |x|
  x.report("big doc css") do
    big_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
      sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
    end
  end

  x.report("small doc css") do
    small_doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each do |sh_r|
      sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))")
    end
  end
end

Here's what it emits:

---
warnings: []
nokogiri: 1.10.10
ruby:
  version: 2.7.2
  platform: x86_64-linux
  description: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
  engine: ruby
libxml:
  binding: extension
  source: packaged
  libxml2_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-pc-linux-gnu/libxml2/2.9.10"
  libxslt_path: "/home/flavorjones/code/oss/nokogiri/ports/x86_64-pc-linux-gnu/libxslt/1.1.34"
  libxml2_patches:
  - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
  - 0002-Remove-script-macro-support.patch
  - 0003-Update-entities-to-remove-handling-of-ssi.patch
  - 0004-libxml2.la-is-in-top_builddir.patch
  - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
  libxslt_patches: []
  compiled: 2.9.10
  loaded: 2.9.10
Warming up --------------------------------------
         big doc css     1.000  i/100ms
       small doc css     4.000  i/100ms
Calculating -------------------------------------
         big doc css      9.925  (±10.1%) i/s -     50.000  in   5.081904s
       small doc css     38.003  (±10.5%) i/s -    188.000  in   5.050401s

Running that same script with gperftools and generating a report with pprof results in:

2133 prof

which doesn't show xmlStrlen as a bottleneck.

@flavorjones flavorjones added needs/more-info and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Dec 16, 2020
@flavorjones
Copy link
Member

Sorry, just to clarify: I see that xmlStrlen is showing up as ~15% of cpu time on the big doc, and less than that on the small doc; but I come back to the fact that this search executes about three times slower on a doc that's about three times bigger, and I'm not sure I understand why this is unexpected performance.

@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 16, 2020

@flavorjones Yes, you're right. I've chosen the wrong words.

I thought that amount of calls of xmlXPathCompOpEval and xmlStrlen while searching through 2 MB document is more than four times more (non-linear growth) compared to 640 KB document. But it seems that algorithm is not optimal and now I'm trying to formulate my findings to open an issue in the libxml2 repository.

I'm adapting your Gist from #1861 (comment) to my situation:

#include <string.h>
#include <stdio.h>
#include <time.h>
#include <libxml/HTMLtree.h>
#include <libxml/xpath.h>

double time_in_microsec()
{
  struct timespec tv;
  clock_gettime(CLOCK_MONOTONIC_RAW, &tv);

  return (double) (tv.tv_sec * 1000000 + tv.tv_nsec / 1000.0);
}

double elapsed_ms_since(double start)
{
  return time_in_microsec() - start;
}

void main()
{
  htmlDocPtr doc;
  xmlXPathContextPtr ctx;
  xmlChar *queries[] = {
    ".//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dgr__grid-result ')]",
    ".//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[contains(concat(' ', normalize-space(@class), ' '), ' EI11Pd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' AGVhpb ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' GyDBsd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VQN8fd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' VZTCjd ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' REX1ub ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' sHaywe ')] | .//a[contains(concat(' ', normalize-space(@class), ' '), ' eaGTj ')]",
    ".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains(concat(' ', normalize-space(@class), ' '), ' rOwove ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' TxCHDf '))][position() = 1] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae '))][position() = 2] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')) and not(.//span) and not(.//div)] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' dWRflb ')] | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ') and not(.//a)]",
    ".//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and not(contains(concat(' ', normalize-space(@class), ' '), ' Rv2Cae ')) and not(contains(concat(' ', normalize-space(@class), ' '), ' Y6Fvnc '))][position() = 3]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' hBUZL ') and contains(concat(' ', normalize-space(@class), ' '), ' Fxxvzc ')]//span | .//*[contains(concat(' ', normalize-space(@class), ' '), ' p7n7Ze ')]//*[contains(concat(' ', normalize-space(@class), ' '), ' F7Kwhf ')]//span"
  };
  size_t query_len = 4;
  double start_time ;

  doc = htmlReadFile("pc_game_big_shopping.html", NULL, 0);

  // ctx = XPathContext.new(node)
  xmlXPathInit();
  ctx = xmlXPathNewContext(doc);

  for (int j = 0; j < query_len; j++) {
    start_time = time_in_microsec();

    // 100 iterations to emulate bigger amount of real-world searches in document
    for (int n = 0; n < 100; n++) {
      // ctx.evaluate(path, handler)
      xmlXPathEvalExpression(queries[j], ctx);
    }

    fprintf(stderr, "%-85.85s %f μs\n", queries[j], elapsed_ms_since(start_time));
  }
}

This results in the not that big difference between big and small document comparing to at_css calls.

pc_game_small_shopping.html

.//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] |  377.167007 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[conta 1838.653992 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains 2054.356003 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3] 1539.781998 μs

pc_game_big_shopping.html

.//*[contains(concat(' ', normalize-space(@class), ' '), ' sh-dlr__list-result ')] |  356.565002 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' eIuuYe ')]//a | .//a[conta 2671.336014 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ') and not(contains 2323.917999 μs
.//*[contains(concat(' ', normalize-space(@class), ' '), ' na4ICd ')][position() = 3] 1704.442993 μs

I guess this issue could be closed as the problem is in libxml2.

@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 17, 2020

@flavorjones I've found a way to describe what's wrong. xmlStrlen is being called hundreds and even thousands of times for the same strings.

TL;DR. In the big document, xmlStrlen is called 15K times for the same string. In the small document, it's being called for 5K times. Three times bigger document and three times more calls. But the strings are the same. My assumption as a person who doesn't know C: the unique string pointer address means that the function was called with the same argument.

Here's a small Ruby script that searches an element in the container.

require "nokogiri"

html = File.read(ARGV[0])

doc = Nokogiri::HTML.parse(html)

doc.css(".sh-dlr__list-result, .sh-dgr__grid-result").each { |sh_r| sh_r.at_css(".na4ICd:not(.rOwove):not(.TxCHDf):nth-of-type(1), .hBUZL:not(.Rv2Cae):nth-of-type(2), .hBUZL:not(.Rv2Cae):not(.Fxxvzc):not(:has(span)):not(:has(div)), .dWRflb, .p7n7Ze:not(:has(a))") }

I recompiled nokogiri with debug information.

CFLAGS="-O -ggdb3 -pipe -fno-omit-frame-pointer" gem install nokogiri

Then attached the breakpoint via gdb to get the line to attach dprintf later on.

bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'b xmlStrlen' --ex run --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_big_shopping.html

Then attached dprintf via GDB to the xmlStrlen function of libxml2 to print pointer of str variable and log output to the file. I execute this script with a big and small HTML document.

bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'dprintf xmlstring.c:425, "str: %p\n", str' --ex run --ex quit --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_big_shopping.html |& tee xmlStrlen_big.log

bundle exec gdb -q -ex 'set breakpoint pending on' --ex 'thread apply all bt full' --ex 'dprintf xmlstring.c:425, "str: %p\n", str' --ex run --ex quit --args ruby ./tmp/slow_search_parse.rb tmp/pc_game_small_shopping.html |& tee xmlStrlen_small.log

Then examined logs.

$ grep str: xmlStrlen_big.log | sort | uniq -c | sort -rn | head -20
  15446 str: 0x112e760
  15405 str: 0x112e6d0
  13405 str: 0x15ddb80
   4006 str: 0x7fffe6c615ad
   1962 str: 0x112e740
    381 str: 0x7fffe69de010
    374 str: 0x15b5520
    321 str: 0x1649d60
    306 str: 0x174c7a0
    306 str: 0x1734430
    306 str: 0x15db410
    306 str: 0x15b63b0
    303 str: 0x174c830
    303 str: 0x15db4a0
    303 str: 0x15b64a0
    298 str: 0x161a8b0
    297 str: 0x16e3fa0
    291 str: 0x160fd50
    287 str: 0x163c320
    285 str: 0x16feef0

$ grep str: xmlStrlen_small.log | sort | uniq -c | sort -rn | head -20
   5046 str: 0xf3fbc0
   5038 str: 0xf3fcb0
   1687 str: 0xf40440
    906 str: 0xf3fc30
    806 str: 0x7fffe6c615ad
    431 str: 0xfe0460
    420 str: 0x11b53e0
    384 str: 0x1067bb0
    353 str: 0xfdc760
    328 str: 0x11e5ab0
    322 str: 0x11e5a40
    306 str: 0x11ccea0
    303 str: 0x11ccf30
    303 str: 0x11bc040
    301 str: 0x11a57d0
    291 str: 0x11e7740
    289 str: 0x1200a70
    288 str: 0x11d5660
    284 str: 0x11e6d80
    284 str: 0x11c3fd0

I have a simple idea about caching the length of strings in memory in xmlStrlen and compare the performance. I'll post the results of the experiment here.

@flavorjones
Copy link
Member

@ilyazub I'm looking at the queries you're making, and my intuition is telling me that the XPath functions are going to end up calling xmlStrlen a lot of times. Also, inside libxml2, strings and buffers are repeatedly being allocated and freed, so I don't think it's safe to cache string lengths across xmlStrlen calls.

I'd like to close this -- it seems like libxml2 behavior regardless -- though you can continue to post here if it's helpful or if you think there's something we can do inside Nokogiri to impact this.

One performance-related thing occurred to me while reviewing your comments, which is that Nokogiri could register a C function that will more efficiently look for CSS class matches. The combination of contains(concat(' ', normalize-space(@class), ' ')) seems like something we could optimize. I'll open an issue to research doing that.

@flavorjones
Copy link
Member

See #2135 for followup performance work.

@flavorjones
Copy link
Member

Just an update from #2135 and the resulting PR at #2137.

I'm seeing a big improvement in performance. With v1.11.0.rc3:

Warming up --------------------------------------
         big doc css     1.000  i/100ms
       small doc css     6.000  i/100ms
Calculating -------------------------------------
         big doc css     15.478  (± 6.5%) i/s -     78.000  in   5.051894s
       small doc css     61.406  (± 4.9%) i/s -    312.000  in   5.095560s

But on the feature branch it's ~2x faster:

Warming up --------------------------------------
         big doc css     3.000  i/100ms
       small doc css    12.000  i/100ms
Calculating -------------------------------------
         big doc css     30.484  (± 3.3%) i/s -    153.000  in   5.024492s
       small doc css    126.396  (± 4.0%) i/s -    636.000  in   5.042040s

This optimization reduces calls to xmlStrlen as expected. Here's an updated pprof graph:

2133 prof

Anyway, thanks again for poking at this -- I would not have seen the opportunity for a 2x performance improvement without it!

flavorjones added a commit that referenced this issue Dec 18, 2020
@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 21, 2020

@flavorjones Thanks for your work! That's the second issue I've opened up and I really enjoy the results and communication with you.

@ilyazub
Copy link
Contributor Author

ilyazub commented Dec 21, 2020

Just for reference, I've opened an issue in the libxml2 repository about xmlStrlen speed up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants