Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

default encoding changed from UTF-8 to ASCII-8BIT (1.7.2 - 1.8.0) #1659

Closed
rabbitt opened this issue Jun 26, 2017 · 2 comments
Closed

default encoding changed from UTF-8 to ASCII-8BIT (1.7.2 - 1.8.0) #1659

rabbitt opened this issue Jun 26, 2017 · 2 comments
Milestone

Comments

@rabbitt
Copy link

rabbitt commented Jun 26, 2017

If this isn't an installation issue ...

What problems are you experiencing?

The default document encoding seems to have switched from UTF-8 to ASCII-8BIT between version 1.7.2 and 1.8.0 (using jruby 9.1.12.0).

What's the output from nokogiri -v?

These were run on OSX, but the same problem exists when running this on CentOS 7 with the following description from nokogiri -v

description: jruby 9.1.12.0 (2.3.3) 2017-06-23 a053617 OpenJDK 64-Bit Server VM 24.141-b02 on 1.7.0_141-mockbuild_2017_05_09_15_35-b00 +jit [linux-x86_64]

Nokogiri (1.7.2)

---
warnings: []
nokogiri: 1.7.2
ruby:
  version: 2.3.3
  platform: java
  description: jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server
    VM 25.131-b11 on 1.8.0_131-b11 +jit [darwin-x86_64]
  engine: jruby
  jruby: 9.1.12.0
xerces: Xerces-J 2.11.0
nekohtml: NekoHTML 1.9.21

and

Nokogiri (1.8.0)

---
warnings: []
nokogiri: 1.8.0
ruby:
  version: 2.3.3
  platform: java
  description: jruby 9.1.12.0 (2.3.3) 2017-06-15 33c6439 Java HotSpot(TM) 64-Bit Server
    VM 25.131-b11 on 1.8.0_131-b11 +jit [darwin-x86_64]
  engine: jruby
  jruby: 9.1.12.0
xerces: Xerces-J 2.11.0
nekohtml: NekoHTML 1.9.21

Can you provide a self-contained script that reproduces what you're seeing?

jruby -e 'gem "nokogiri", "= 1.7.2"; require "nokogiri"; puts "encoding: #{Nokogiri::HTML::Document.new.to_s.encoding}"'

Which outputs: encoding: UTF-8

jruby -e 'gem "nokogiri", "= 1.8.0"; require "nokogiri"; puts "encoding: #{Nokogiri::HTML::Document.new.to_s.encoding}"'

Which outputs: encoding: ASCII-8BIT

@rabbitt
Copy link
Author

rabbitt commented Jun 27, 2017

After more digging (by a coworker of mine, Jason He), it looks like the difference that (cough) makes all the difference, is the use of String.new vs "", in Nokogiri::XML::Node#serialize:

1.7.2 Nokogiri::XML::Node#serialize vs 1.8.0 Nokogiri::XML::Node#serialize

For reference (in both MRI Ruby and JRuby):

String.new.encoding #=> <Encoding:ASCII-8BIT>
"".encoding #=> #=> <Encoding:UTF-8>

@iMacTia
Copy link

iMacTia commented Sep 15, 2017

Same issue also with MRI Ruby 2.4.1

Nokogiri::XML("<?xml version=\"1.0\"?><root><aliens><alien><name>Alf</name></alien></aliens></root>").to_s.encoding
 => #<Encoding:ASCII-8BIT> 

larskanis added a commit to larskanis/nokogiri that referenced this issue Sep 20, 2017
Node#serialize used to return UTF-8 if no encoding was given.
However this got broken in commit 53f9b66.

Since UTF-8 is the default in XML and HTML5 specs, it makes sense to
use UTF-8 in serialize as well and enforce this in the tests.

Fixes sparklemotion#1659
@flavorjones flavorjones added this to the 1.8.2 milestone Nov 13, 2017
mvz added a commit to mvz/happymapper that referenced this issue Dec 31, 2017
Confusion pushed a commit to Confusion/happymapper that referenced this issue Nov 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants