Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove [whitespace character] and use [unicode whitespace character] instead #343

Closed
zudov opened this issue Jul 3, 2015 · 7 comments
Closed

Comments

@zudov
Copy link
Contributor

zudov commented Jul 3, 2015

At the moment we have

  • whitespace character -- a space (U+0020), tab (U+0009), newline (U+000A), line tabulation (U+000B), form feed (U+000C), or carriage return (U+000D).
  • unicode whitespace character -- is any code point in the unicode Zs class, or a tab (U+0009), carriage return (U+000D), newline (U+000A), or form feed (U+000C).

These two are very close to each other, is there a need to have both of them? As I understand it the primary function of [whitespace character] is to restrict various kind of spaces (e.g.  ). I looked through the places where [whitespace character] is used but didn't understand how thing like   would harm there.

In case if we actually don't need this distinction, I propose to remove [whitespace character] and use [unicode whitespace character] in those places. Or better go further and remove name [unicode whitespace character] and change the definition of [whitespace character] to the one that [unicode whitespace character] has at the moment.

@jgm
Copy link
Member

jgm commented Jul 3, 2015

I think it might be useful to be able to insert unicode
nonbreaking spaces in contexts where a space would normally
have a Markdown meaning, but you really just want a space.
But I'm not sure.

@zudov
Copy link
Contributor Author

zudov commented Jul 3, 2015

@jgm I see, despite that usage might be very confusing it goes together with the semantics of a non-breaking space, and I don't see any other workarounds for such cases. Allowing backslash-escaped spaces might also be confusing.

The question is if we want to allow such workarounds. No doubt it can be useful in some cases (e.g. multiple spaces in code spans), but in other cases we might want to enforce "there should be no whitespace here" rule.
For example a link label. That is a separate issue but at the moment it's possible to create link labels consisting of only non-breaking spaces.

@jgm
Copy link
Member

jgm commented Jul 3, 2015

I'm open to being persuaded. There are also things like
zero-width spaces and thin-spaces (useful in formatting
math, for example). I'd have to examine the whole spec with
this in mind before feeling comfortable about the change.

+++ Konstantin Zudov [Jul 02 15 21:19 ]:

[1]@jgm I see, despite that usage might be very confusing it goes
together with the semantics of a non-breaking space, and I don't see
any other workarounds for such cases.

The question is if we want to allow such workarounds. No doubt it can
be useful in some cases (e.g. multiple spaces in code spans), but in
other cases we might want to enforce "there should be no whitespace
here" rule.
For example a link label. That is a separate issue but at the moment
it's possible to create link labels consisting of only non-breaking
spaces.


Reply to this email directly or [2]view it on GitHub.

References

  1. https://github.com/jgm
  2. Remove [whitespace character] and use [unicode whitespace character] instead #343 (comment)

@jgm
Copy link
Member

jgm commented Jul 4, 2015

I see that at least some of the places where we refer to space characters are in definitions of HTML elements. And here's what the HTML5 spec says:

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

The White_Space characters are those that have the Unicode property "White_Space" in the Unicode PropList.txt data file. [UNICODE]

So, they make a similar distinction, and we're going to need it at least for HTML.

@zudov
Copy link
Contributor Author

zudov commented Jul 5, 2015

@jgm That makes sense, perhaps we can close this issue for now.

@jackdouglas
Copy link

@jgm Is "line tabulation (U+000B)" deliberately missing from the "unicode whitespace character" list? If it's a mistake, perhaps it would be simpler to follow if the spec defines "unicode whitespace character" as follows:

A unicode whitespace character is any whitespace character or any code point in the unicode Zs class.

@jackdouglas
Copy link

or maybe:

A unicode whitespace character is any code point in the unicode Zs class or any other whitespace character.

@jgm jgm closed this as completed in b994be4 Jul 10, 2015
TimothyGu added a commit to TimothyGu/commonmark.js that referenced this issue Aug 5, 2016
The spec makes an distinction between "[whitespace]" and "[Unicode
whitespace]": whereas the latter include many additional whitespace
characters, particularly the non-breaking space (U+00A0), the former
does not.

Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12
[CharacterClassEscape], the JavaScript `\s` escape character matches the
characters specified by "Unicode whitespace," but not "whitespace."

To fix this issue, create and use a new regular expression variable that
only matches the limited set of "whitespace" characters.

For additional information, the distinction in the spec was challenged
and reaffirmed by commonmark/commonmark-spec#343.

[whitespace]: http://spec.commonmark.org/0.26/#whitespace-character
[Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character
[CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape
TimothyGu added a commit to TimothyGu/commonmark.js that referenced this issue Aug 5, 2016
The spec makes an distinction between "[whitespace]" and "[Unicode
whitespace]": whereas the latter include many additional whitespace
characters, particularly the non-breaking space (U+00A0), the former
does not.

Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12
[CharacterClassEscape], the JavaScript `\s` escape character matches the
characters specified by "Unicode whitespace," but not "whitespace."

To fix this issue, rename the existing regular expression variable to
`UnicodeWhitespace`, and create and use a new regular expression
variable that only matches the limited set of "whitespace" characters.

For additional information, the distinction in the spec was challenged
and reaffirmed by commonmark/commonmark-spec#343.

[whitespace]: http://spec.commonmark.org/0.26/#whitespace-character
[Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character
[CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape
colinodell added a commit to thephpleague/commonmark that referenced this issue Nov 22, 2016
@commonmark commonmark deleted a comment Apr 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants