[bug] The `strict` parsing mode does not raise for invalid HTML documents #2130

dgollahon · 2020-12-08T20:33:30Z

Please describe the bug

Hi,

According to these docs it should be possible to enter a strict parsing mode that raises on malformed XML. This works for XML. It seems implied that strict can be set the same as with Nokogiri::XML, here. Unfortunately, I can't figure out anything that will cause Nokogiri::HTML with strict config to raise.

Hopefully I have not missed any documentation or open issues related to this, but all I ran across was #1806 which was closed for lack of detail. If this is the intended behavior, I think the differences between XML and HTML parse options should be documented more clearly.

Help us reproduce what you're seeing

#! /usr/bin/env ruby

require 'nokogiri'

# OK
Nokogiri::XML('<') # => #(Document:0x3fc297862fc4 { name = "document" })

# OK, there's the error I would expect
begin
  Nokogiri::XML('<', &:strict)
rescue => error
 pp error # => #<Nokogiri::XML::SyntaxError: 1:2: FATAL: StartTag: invalid element name>
end

# OK
 Nokogiri::HTML('<') # => #(Document:0x3fc29751450c { name = "document", children = [ #(DTD:0x3fc2975188dc { name = "html" })]})

# What? Why doesn't this raise?
Nokogiri::HTML('<', &:strict) # => #(Document:0x3fc29752c378 { name = "document", children = [ #(DTD:0x3fc297534780 { name = "html" })]})

# I have tried various other wonky bits of partial HTML and I can't figure out anything that makes a difference / causes `Nokogiri::HTML` to raise.

# NOTE: The `#errors` method does show errors in this case:
Nokogiri::HTML('<').errors # => [#<Nokogiri::XML::SyntaxError: 1:2: ERROR: htmlParseStartTag: invalid element name>]
Nokogiri::XML('<').errors # => [#<Nokogiri::XML::SyntaxError: 1:2: FATAL: StartTag: invalid element name>]

Expected behavior

I would expect Nokogiri to raise a syntax error for HTML documents but nothing is raised, unlike with the XML parsing mode. The #errors array seems to be populated in either case.

Environment

# Nokogiri (1.10.10)
    ---
    warnings: []
    nokogiri: 1.10.10
    ruby:
      version: 2.6.6
      platform: x86_64-darwin19
      description: ruby 2.6.6p146 (2020-03-31 revision 67876) [x86_64-darwin19]
      engine: ruby
    libxml:
      binding: extension
      source: packaged
      libxml2_path: "/Users/gollahon/.rvm/gems/ruby-2.6.6/gems/nokogiri-1.10.10/ports/x86_64-apple-darwin19.6.0/libxml2/2.9.10"
      libxslt_path: "/Users/gollahon/.rvm/gems/ruby-2.6.6/gems/nokogiri-1.10.10/ports/x86_64-apple-darwin19.6.0/libxslt/1.1.34"
      libxml2_patches:
      - 0001-Revert-Do-not-URI-escape-in-server-side-includes.patch
      - 0002-Remove-script-macro-support.patch
      - 0003-Update-entities-to-remove-handling-of-ssi.patch
      - 0004-libxml2.la-is-in-top_builddir.patch
      - 0005-Fix-infinite-loop-in-xmlStringLenDecodeEntities.patch
      libxslt_patches: []
      compiled: 2.9.10
      loaded: 2.9.10

Additional context

I am writing a small scraper and I would like to bail if pages I am trying to parse are corrupt/invalid.

Thanks!

The text was updated successfully, but these errors were encountered:

flavorjones · 2020-12-09T12:52:47Z

@dgollahon Thank you for submitting such a clear issue for this! And I'm sorry that you're having trouble.

I'll get some time later today to reproduce what you're seeing and dig in. My gut is that this is libxml2 acting inconsistently, but I will definitely dig in and tell you what I find.

flavorjones · 2020-12-16T06:52:47Z

@dgollahon Thanks for your patience. I spent some time tonight looking into this, and I'll try to explain what's going on.

TL;DR: Yeah, this is a bug in the CRuby implementation (but not the JRuby one!).

CRuby / libxml2 HTMLparser.c

The logic used in the CRuby/libxml2 implementation for raising an exception on HTML parse errors is as follows:

call the parsing function
if no document is returned, raise an exception
otherwise assume all is well: save all (presumably nonfatal?) errors on the @errors instance variable and return the document

Libxml2's HTML parser considers almost nothing to be fatal and nearly always recovers -- regardless of what parse options are passed in. The RECOVER parse flag is used very sparingly to control specific behavior around CDATA strings and element names, and does not control whether the parse function returns a successfully-parsed document.

In contrast, if we look at libxml2's XML parser in parser.c, the following code is in the top-level wrapper function xmlDoRead:

    xmlParseDocument(ctxt);
    if ((ctxt->wellFormed) || ctxt->recovery)
        ret = ctxt->myDoc;
    else {
        ret = NULL;
        /* ... some code elided ... */
    }
    /* ... some code elided ... */
    return (ret);

That is, unless the parser found zero errors (wellFormed is truthy) or the RECOVER parse option is set, no document is returned.

JRuby / NekoHTML

Here's where it gets messy. The JRuby implementation does not rely on the return value from the parse function to determine success, and instead uses different error handlers (strict or non-strict) to control whether to raise an exception. As a result, it behaves consistently across both XML and HTML parsers, and matches what your expectations are in this issue.

Why has nobody reported this before?

Why has nobody reported this before? I think the honest truth is that there is so much broken HTML out there in the world that the overwhelming majority of use cases need the default behavior -- to act like a browser and try to recover from parse errors as best we can. Put another way, strict parsing of an arbitrary HTML document from the internet is highly likely to fail.

How do we resolve this?

I think the right thing to do here is to change the behavior of the HTML::Document parse functions to raise an exception if any parse error is encountered; but this is likely to be a breaking change for some folks and so I want to think hard about putting it into a patch release.

Here are some tests that all pass on JRuby:

require "helper"

class Test2130 < Nokogiri::TestCase
  describe "HTML::Document.parse" do
    describe "read memory" do
      let(:input) { "<html><body><div" }

      describe "strict parsing" do
        let(:parse_options) { Nokogiri::XML::ParseOptions.new.strict }

        it "raises exception on parse error" do
          assert_raises Nokogiri::SyntaxError do
            Nokogiri::HTML.parse(input, nil, nil, parse_options)
          end
        end
      end

      describe "default options" do
        it "does not raise exception on parse error" do
          doc = Nokogiri::HTML.parse(input)
          assert_operator(doc.errors.length, :>, 0)
        end
      end
    end

    describe "read io" do
      let(:input) { StringIO.new("<html><body><div") }

      describe "strict parsing" do
        let(:parse_options) { Nokogiri::XML::ParseOptions.new.strict }

        it "raises exception on parse error" do
          assert_raises Nokogiri::SyntaxError do
            Nokogiri::HTML.parse(input, nil, nil, parse_options)
          end
        end
      end

      describe "default options" do
        it "does not raise exception on parse error" do
          doc = Nokogiri::HTML.parse(input)
          assert_operator(doc.errors.length, :>, 0)
        end
      end
    end
  end
end

But fail on CRuby in embarrasingly bad ways -- including read_io not trapping errors correctly.

$ be ruby -Ilib:test 2130-html-strict-behavior/test.rb 
/home/flavorjones/code/oss/nokogiri/test/helper.rb:30: version info:
---
warnings: []
nokogiri: 1.11.0.rc3
ruby:
  version: 2.7.2
  platform: x86_64-linux
  gem_platform: x86_64-linux
  description: ruby 2.7.2p137 (2020-10-01 revision 5445e04352) [x86_64-linux]
  engine: ruby
libxml:
  source: system
  iconv_enabled: true
  compiled: 2.9.10
  loaded: 2.9.10
libxslt:
  source: system
  compiled: 1.1.34
  loaded: 1.1.34
Run options: --seed 37272

# Running:

FHTML parser error : Couldn't find end of Start Tag div
<html><body><div
                ^
FF.

Fabulous run in 0.001873s, 2135.2260 runs/s, 2135.2260 assertions/s.

  1) Failure:
HTML::Document.parse::read io::default options#test_0001_does not raise exception on parse error [2130-html-strict-behavior/test.rb:42]:
Expected 0 to be > 0.

  2) Failure:
HTML::Document.parse::read io::strict parsing#test_0001_raises exception on parse error [2130-html-strict-behavior/test.rb:33]:
Nokogiri::SyntaxError expected but nothing was raised.

  3) Failure:
HTML::Document.parse::read memory::strict parsing#test_0001_raises exception on parse error [2130-html-strict-behavior/test.rb:12]:
Nokogiri::SyntaxError expected but nothing was raised.

4 runs, 4 assertions, 3 failures, 0 errors, 0 skips

dgollahon · 2020-12-17T00:02:28Z

Thanks for the update! Interesting that it works on JRuby. Regardless of the fix path, it seems to me that MRI and JRuby should have the same behavior.

Understood re: broken HTML and the risk of breaking users. I am sure some people like me have strict enabled but do not realize it would fail if this bug were resolved. It does seem, however, like something that should be fixed, imo. I wonder how widespread the use is? Also, is this a regression or has it never worked? I suppose you could always wait for a major release if the risk of breakage was bad enough but I am guessing that is not on the horizon.

flavorjones · 2020-12-17T19:25:13Z

It does seem, however, like something that should be fixed

I agree and I'll try to get this into the next release.

is this a regression or has it never worked?

Neither libxml2 nor Nokogiri has ever behaved any differently. It's been like this since the first release of Nokogiri.

This results in (different) failing tests in JRuby and CRuby. Related to #2130

The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130

If the RECOVER parse option is not set, then the HTML.parse (and friends) will now raise an exception corresponding to the first error encountered. We do this in read_io and read_memory because libxml2's HTML parser does not obey RECOVER/NORECOVER. Closes #2130

flavorjones · 2020-12-28T19:24:11Z

PR #2150 will address this.

This results in (different) failing tests in JRuby and CRuby. Related to #2130

The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130

…mode HTML "strict" mode parsing --- **What problem is this PR intended to solve?** There are quite a few things being addressed in this PR, but the most significant bit is consistently obeying the `recover` parse option for HTML documents. As #2130 points out, the CRuby/libxml2 implementation was completely ignoring the `recover` parse option (also know as the `strict` option) for HTML documents. This PR makes the `recover` option behave identically for HTML documents as it does for XML documents. Related, though, the JRuby implementation was incorrectly applying the `recover` parse option in HTML documents, instead using `noerror` and `nowarning` to determine whether to raise an exception. This has been brought in line with the behavior described above. Also related, the `EncodingReader` class which detects encoding in HTML documents was silently swallowing document syntax errors if they occurred in the first "chunk" read from an IO object. This has also been fixed. This PR also introduces some smaller changes: - make JRuby and CRuby implementations consistent in how they handle comparing `XML::Node` to an `XML::Document` using the `<=>` operator - introduce minitest-reporters to the main test suite - skip some irrelevant tests, and restructure others to be faster - fix the annoying JRuby encoding test that fails on some Java JDKs - eating away at the edges of code formatting as I touched some of these rarely-touched files **Have you included adequate test coverage?** Yes. **Does this change affect the behavior of either the C or the Java implementations?** Yes, this changes the behavior of the `strict` (a.k.a. `norecover`) parse option in parsing HTML documents. I've chosen to classify this as a "bugfix that happens to change behavior that people may be depending upon" and so am introducing potentially breaking behavior into a minor release. Apologies to anyone who's inconvenienced by that.

dgollahon added the state/needs-triage Inbox for non-installation-related bug reports or help requests label Dec 8, 2020

flavorjones added topic/fixing-broken-markup and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels Dec 16, 2020

flavorjones added this to the v1.11.0 milestone Dec 16, 2020

flavorjones added the release-blocker Blocking a milestone release label Dec 16, 2020

flavorjones added a commit that referenced this issue Dec 28, 2020

test(feat): establish desired behavior of "strict" HTML parsing

cd05745

This results in (different) failing tests in JRuby and CRuby. Related to #2130

flavorjones added a commit that referenced this issue Dec 28, 2020

fix(jruby): raise exception from strict HTML parsing on err/warn

90a105e

The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130

flavorjones mentioned this issue Dec 28, 2020

HTML "strict" mode parsing #2150

Merged

flavorjones added a commit that referenced this issue Dec 28, 2020

test(feat): establish desired behavior of "strict" HTML parsing

d26707b

This results in (different) failing tests in JRuby and CRuby. Related to #2130

flavorjones added a commit that referenced this issue Dec 28, 2020

fix(jruby): raise exception from strict HTML parsing on err/warn

421a647

The JRuby HTML parser was inconsistent with the CRuby implementation, and didn't use the common-sense interpretation of "norecover". Related to #2130

flavorjones closed this as completed in 1976b97 Dec 29, 2020

flavorjones added the topic/error-handling label Jan 13, 2021

flavorjones mentioned this issue May 13, 2021

[help] Nokogiri::XML::Reader does not recover from parsing errors #2229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] The `strict` parsing mode does not raise for invalid HTML documents #2130

[bug] The `strict` parsing mode does not raise for invalid HTML documents #2130

dgollahon commented Dec 8, 2020

flavorjones commented Dec 9, 2020

flavorjones commented Dec 16, 2020

dgollahon commented Dec 17, 2020

flavorjones commented Dec 17, 2020

flavorjones commented Dec 28, 2020

[bug] The strict parsing mode does not raise for invalid HTML documents #2130

[bug] The strict parsing mode does not raise for invalid HTML documents #2130

Comments

dgollahon commented Dec 8, 2020

flavorjones commented Dec 9, 2020

flavorjones commented Dec 16, 2020

CRuby / libxml2 HTMLparser.c

JRuby / NekoHTML

Why has nobody reported this before?

How do we resolve this?

dgollahon commented Dec 17, 2020

flavorjones commented Dec 17, 2020

flavorjones commented Dec 28, 2020

[bug] The `strict` parsing mode does not raise for invalid HTML documents #2130

[bug] The `strict` parsing mode does not raise for invalid HTML documents #2130