Remove [whitespace character] and use [unicode whitespace character] instead #343

zudov · 2015-07-03T03:00:23Z

At the moment we have

whitespace character -- a space (U+0020), tab (U+0009), newline (U+000A), line tabulation (U+000B), form feed (U+000C), or carriage return (U+000D).
unicode whitespace character -- is any code point in the unicode Zs class, or a tab (U+0009), carriage return (U+000D), newline (U+000A), or form feed (U+000C).

These two are very close to each other, is there a need to have both of them? As I understand it the primary function of [whitespace character] is to restrict various kind of spaces (e.g.  ). I looked through the places where [whitespace character] is used but didn't understand how thing like   would harm there.

In case if we actually don't need this distinction, I propose to remove [whitespace character] and use [unicode whitespace character] in those places. Or better go further and remove name [unicode whitespace character] and change the definition of [whitespace character] to the one that [unicode whitespace character] has at the moment.

The text was updated successfully, but these errors were encountered:

jgm · 2015-07-03T04:07:37Z

I think it might be useful to be able to insert unicode
nonbreaking spaces in contexts where a space would normally
have a Markdown meaning, but you really just want a space.
But I'm not sure.

zudov · 2015-07-03T04:19:25Z

@jgm I see, despite that usage might be very confusing it goes together with the semantics of a non-breaking space, and I don't see any other workarounds for such cases. Allowing backslash-escaped spaces might also be confusing.

The question is if we want to allow such workarounds. No doubt it can be useful in some cases (e.g. multiple spaces in code spans), but in other cases we might want to enforce "there should be no whitespace here" rule.
For example a link label. That is a separate issue but at the moment it's possible to create link labels consisting of only non-breaking spaces.

jgm · 2015-07-03T18:46:13Z

I'm open to being persuaded. There are also things like
zero-width spaces and thin-spaces (useful in formatting
math, for example). I'd have to examine the whole spec with
this in mind before feeling comfortable about the change.

+++ Konstantin Zudov [Jul 02 15 21:19 ]:

[1]@jgm I see, despite that usage might be very confusing it goes
together with the semantics of a non-breaking space, and I don't see
any other workarounds for such cases.

The question is if we want to allow such workarounds. No doubt it can
be useful in some cases (e.g. multiple spaces in code spans), but in
other cases we might want to enforce "there should be no whitespace
here" rule.
For example a link label. That is a separate issue but at the moment
it's possible to create link labels consisting of only non-breaking
spaces.

—
Reply to this email directly or [2]view it on GitHub.

References

https://github.com/jgm

Remove [whitespace character] and use [unicode whitespace character] instead #343 (comment)

jgm · 2015-07-04T04:33:13Z

I see that at least some of the places where we refer to space characters are in definitions of HTML elements. And here's what the HTML5 spec says:

The space characters, for the purposes of this specification, are U+0020 SPACE, U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), and U+000D CARRIAGE RETURN (CR).

The White_Space characters are those that have the Unicode property "White_Space" in the Unicode PropList.txt data file. [UNICODE]

So, they make a similar distinction, and we're going to need it at least for HTML.

zudov · 2015-07-05T15:30:56Z

@jgm That makes sense, perhaps we can close this issue for now.

jackdouglas · 2015-07-10T11:50:38Z

@jgm Is "line tabulation (U+000B)" deliberately missing from the "unicode whitespace character" list? If it's a mistake, perhaps it would be simpler to follow if the spec defines "unicode whitespace character" as follows:

A unicode whitespace character is any whitespace character or any code point in the unicode Zs class.

jackdouglas · 2015-07-10T12:13:13Z

or maybe:

A unicode whitespace character is any code point in the unicode Zs class or any other whitespace character.

The spec makes an distinction between "[whitespace]" and "[Unicode whitespace]": whereas the latter include many additional whitespace characters, particularly the non-breaking space (U+00A0), the former does not. Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12 [CharacterClassEscape], the JavaScript `\s` escape character matches the characters specified by "Unicode whitespace," but not "whitespace." To fix this issue, create and use a new regular expression variable that only matches the limited set of "whitespace" characters. For additional information, the distinction in the spec was challenged and reaffirmed by commonmark/commonmark-spec#343. [whitespace]: http://spec.commonmark.org/0.26/#whitespace-character [Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character [CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape

The spec makes an distinction between "[whitespace]" and "[Unicode whitespace]": whereas the latter include many additional whitespace characters, particularly the non-breaking space (U+00A0), the former does not. Per ECMA-262 6th Edition ("ECMAScript 2015") §21.2.2.12 [CharacterClassEscape], the JavaScript `\s` escape character matches the characters specified by "Unicode whitespace," but not "whitespace." To fix this issue, rename the existing regular expression variable to `UnicodeWhitespace`, and create and use a new regular expression variable that only matches the limited set of "whitespace" characters. For additional information, the distinction in the spec was challenged and reaffirmed by commonmark/commonmark-spec#343. [whitespace]: http://spec.commonmark.org/0.26/#whitespace-character [Unicode whitespace]: http://spec.commonmark.org/0.26/#unicode-whitespace-character [CharacterClassEscape]: http://www.ecma-international.org/ecma-262/6.0/#sec-characterclassescape

…tespace (see commonmark/commonmark-spec#343)

jgm closed this as completed in b994be4 Jul 10, 2015

TimothyGu mentioned this issue Aug 5, 2016

Make an distinction between Unicode whitespace and regular whitespace commonmark/commonmark.js#107

Merged

colinodell added a commit to thephpleague/commonmark that referenced this issue Nov 22, 2016

Enforce spec's distinction between Unicode whitespace and regular whi…

d122c0c

…tespace (see commonmark/commonmark-spec#343)

commonmark deleted a comment Apr 5, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove [whitespace character] and use [unicode whitespace character] instead #343

Remove [whitespace character] and use [unicode whitespace character] instead #343

zudov commented Jul 3, 2015

jgm commented Jul 3, 2015

zudov commented Jul 3, 2015

jgm commented Jul 3, 2015

jgm commented Jul 4, 2015

zudov commented Jul 5, 2015

jackdouglas commented Jul 10, 2015

jackdouglas commented Jul 10, 2015

Remove [whitespace character] and use [unicode whitespace character] instead #343

Remove [whitespace character] and use [unicode whitespace character] instead #343

Comments

zudov commented Jul 3, 2015

jgm commented Jul 3, 2015

zudov commented Jul 3, 2015

jgm commented Jul 3, 2015

jgm commented Jul 4, 2015

zudov commented Jul 5, 2015

jackdouglas commented Jul 10, 2015

jackdouglas commented Jul 10, 2015