Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emit warnings when dumping binary strings #643

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

casperisfine
Copy link

@casperisfine casperisfine commented Oct 24, 2024

Fix: #642

Because of it's Ruby 1.8 heritage, the C extension doesn't care much about strings encoding. We should get stricter over time.

Because of it's Ruby 1.8 heritage, the C extension doesn't care
much about strings encoding. We should get stricter over time.
Comment on lines +693 to +698
VALUE utf8_string = rb_enc_associate_index(rb_str_dup(source), utf8_encindex);
switch (rb_enc_str_coderange(utf8_string)) {
case ENC_CODERANGE_7BIT:
case ENC_CODERANGE_VALID:
return utf8_string;
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is #138. I'm having second thoughts about deprecating / removing it, as I fear lots of people depend on it, and looking at other parsers they all seem to accept it. So maybe we should just support it in the Java parser?

WDYT @headius @eregon ?

Copy link
Contributor

@headius headius Oct 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, I commented too early and thought this was a YAML issue.

After reexamining the code, it does appear that the json implementation for JRuby works directly with bytes via the ragel parser. It would probably be possible to make it accept invalid characters.

My position remains the same, however. I'm not sure what other parsers you tested, but all json specifications I looked at explicitly require content to be in one of the main unicode UTF encodings. That says to me that providing invalid characters is off spec and should produce an error.

The original RFC from 2006 even says:

JSON text SHALL be encoded in Unicode. The default encoding is UTF-8.

https://www.ietf.org/rfc/rfc4627.txt

The character stream cannot be parsed as Unicode, then it is not up to spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what other parsers you tested,

oj and Python's stdlib.

I totally get your argument, and generally agree, the concern is really about breaking existing (misbehaving) code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm crazy, but I would break it in a minor release so it's easy to revert with another minor release, rather than break it in a major release and have to go back. I doubt the impact will be large. If we put out a minor release that starts erroring on bad input and it doesn't get reported for a few months, it's probably fine.

I think it's also important to remember that Ruby has become much more strict about character encodings over the years so the chance of getting bad content has reduced significantly since the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit a deprecation warning when dealing with incorrectly encoded strings
3 participants