Should line breaks be width 0 or 1? #60

Manishearth · 2024-06-07T16:02:27Z

control characters are not given the Default_Ignorable_Code_Point property. To avoid security problems, such characters [...], when not interpreted and not displayable by normal rendering, should be displayed in fallback rendering with a fallback glyph, so that there is a visible indication of their presence in the text.

Newline control characters (and other break characters) are typically "displayable by normal rendering", however they do not have a rendering that has anything that could be termed as a "width", so there's no single answer as to what this crate should provide here according to the spec.

Applications using unicode-width ought to be using a higher level protocol to split text on newlines (and other similar things) before feeding it to this crate (and perhaps we should document that). It isn't really possible for this crate to provide such a protocol itself since there are nuances to how different contexts handle newlines (Windows vs Unix being the most obvious one, but wrapping behavior also participates).

However, we should pick some consistent, sensible default for what newlines are treated as. My guess is that we should treat them as width 0 (for both CR and LF, and any other line breaks that exist).

h/t/ @decathorpe for noticing

cc @Jules-Bertholet

The text was updated successfully, but these errors were encountered:

Jules-Bertholet · 2024-06-07T16:21:31Z

The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

decathorpe · 2024-06-07T16:31:50Z

For reference, I noticed this behaviour change when investigating test failures caused by the v0.1.12 → v0.1.13 update in the bwrap crate: https://github.com/micl2e2/bwrap

For example, this test is now failing:
https://github.com/micl2e2/bwrap/blob/master/tests/maybrk.rs#L64

As far as I can tell, the test tries line-wrapping this string into 3 columns:

hel
lo 
wor
ld

The expected result seems to be that there is no change when line-wrapping this to width 3 (since all lines already are width 3), but since newline characters now contribute an additional +1 to the length, the result is completely mangled.

EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n as having width of 1, the logic would need to be adapted to have a special case for an existing \n character immediately after a candidate line-wrapping point.

m-hilgendorf · 2024-06-11T15:27:48Z

Coming here with the same issue, the update from 0.1.12 -> 0.1.13 broke code that relied on newline characters having width = 0. It should at least have been a minor version update, since the API contract was broken.

in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

Then surely it's a different string with different width, no?

Manishearth · 2024-06-11T15:33:59Z

The case for treating as width 1 is that, in a context where for whatever reason the text is forced to be displayed on a single line, most software will replace line breaks with spaces.

That's exceedingly rare, though.

Personally I think the case for the math getting easier especially near existing line break opportunities is compelling enough for it to be 0.

@m-hilgendorf

Then surely it's a different string with different width, no?

No, this is about rendering. When an emoji gets represented as a bunch of tofu that is not a "different string", that is just a rendering choice, and this crate cannot predict all rendering choices. It just makes a best effort.

@decathorpe

EDIT: Maybe this logic for line wrapping was fishy all along, but it worked with how unicode-width determined the displayed widths. When counting \n as having width of 1, the logic would need to be adapted to have a special case for an existing \n character immediately after a candidate line-wrapping point.

Yes, I think this logic is not using this crate correctly and just happened to work. However I think that this kind of pattern should mostly work pleasantly, ideally.

Jules-Bertholet · 2024-06-11T15:46:12Z

That's exceedingly rare, though.

Not really, no? Most any sort of search box will behave like this, for example.

That being said, if the breakage really is widespread, it may be worth reverting to width 0 for 0.1.x and only doing it for 0.2.x.

m-hilgendorf · 2024-06-11T15:48:15Z

My point is about code structure and not rendering, if you have let s: String and call s.replace('\n', ' ') you can always choose which value you call .width() upon to get the desired answer (either treating control characters as zero width, or treating them as non-zero after replacement). By removing the distinction a caller has to deal with it with a special case.

Concretely, I have to change code from

let width = string.width();

to

// Not exactly this, but you get the idea
let width = if string == "\n" { 0 } else { string.width() };

As a result of the change. And this is in line wrapping logic, like you mention.

Manishearth · 2024-06-11T15:48:24Z

Not really, no? Most any sort of search box will behave like this, for example.

Not a terminal, though? And that's an actual text stream transform that happens: it is turned into a space, which makes it the "different string" thing that @m-hilgendorf talks about.

Manishearth · 2024-06-11T15:52:35Z

@m-hilgendorf I assumed @Jules-Bertholet was alluding to the cases where some software will render an unwrapped linebreak as a space or a control character. Rare but happens, that is about rendering. However he's actually talking about cases where the text stream is replaced, where I agree that that should be treated as a replacement.

Jules-Bertholet · 2024-06-11T16:28:36Z

Not a terminal, though?

Is this library intended to be terminal-specific? If it is, there are many other things we should be doing differently. (And in any case, a TUI could perfectly well have such an interface)

And that's an actual text stream transform that happens: it is turned into a space

That's one possible implementation, but not the only one. Merely changing how the line-breaking character is rendered would also be perfectly reasonable. (Rendering it as zero-width would not be reasonable, however.)

As for line-wrapping, AIUI the extra logic needed to handle newlines is just an extension of the work needed to handle end-of-line spaces. For example, wrapping hello world into lines of maximum width 5 should result in hello/world, not hello/ worl/d; the space character immediately following the break point is discarded, just as a newline at that position would be.

Manishearth · 2024-06-11T16:38:15Z

Is this library intended to be terminal-specific?

Not exactly, but it's more used that way.

That's one possible implementation, but not the only one.

Yes, but this behavior is most commonly seen in the wild as a text stream replacement. The choice of implementation has other implications, it's not purely internal.

I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.

Jules-Bertholet · 2024-06-18T18:14:13Z

I have seen the "treat linebreak as a 1-width character" behavior before without it being a text stream replacement but it's incredibly rare IME.

In my testing, GTK and QT both seem to do it this way, so not that rare I would think?

Manishearth · 2024-06-18T19:52:47Z

Are they actually doing that or performing a text stream replacement on pasting?

Jules-Bertholet · 2024-06-18T20:02:09Z

Are they actually doing that or performing a text stream replacement on pasting?

The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)

Manishearth · 2024-06-18T20:59:47Z

Hmm, interesting. That does present a legitimate dilemma here.

decathorpe · 2024-06-19T19:51:31Z

Are they actually doing that or performing a text stream replacement on pasting?

The text stream is not replaced, no—copying back out of the field preserves the newline. (GTK shows a visible arrow glyph, QT a blank space)

This replacement only happens in single-line input fields though, right? So the line break needs to be replaced with something, or things will look broken. That's not really a fair comparison to the line-wrapping use case IMO.

Jules-Bertholet · 2024-06-19T20:30:32Z

One more point in favor of line breaks being width 1, is that nearly all applications will display them as having advance width when they are selected.

It's true that, for the purpose of determining line widths for line wrapping, line break characters should be treated as having width 0, not 1, and therefore line-wrapping logic needs to special-case them if we keep the present assignments. But, as I mention above, they are not the only kind of character that line-wrapping logic might need to special case. I mention word-separating whitespace above; the soft hyphen is another example. Programs performing line-wrapping will differ in how exactly they handle these characters, just as they will differ in which line-breaks they recognize (out of \n \r\n \r etc); unicode-width's current behavior gives the right fallback for when the application does not define any special handling.

Manishearth · 2024-06-20T01:31:59Z

That's relatively convincing.

Perhaps we should document this explicitly; that applications shouldn't feed this crate strings with newlines unless they are ok with them being treated as being on the same line with width 1.

decathorpe · 2024-06-20T15:31:32Z

Ok, so what should crates that do line-wrapping do instead? Replace newlines with spaces before wrapping at a specified width, and treat input string as a single-line string? What about line-wrapping methods that preserve existing line breaks? Both of these will likely require more complicated logic and / or data copying than before.

Manishearth · 2024-06-20T16:59:36Z

@decathorpe process text one line at a time, splitting into lines first. I wouldn't actually replace anything, the idea would be to never ask this crate for computing the width of anything with a newline. The statefullness of this crate does not extend past newlines so if you still need to add them up afterwards you can. I don't think this would be an extra allocation/perf cost/copying, it will complicate the logic a little bit.

Manishearth · 2024-06-20T17:00:48Z

But as far as I can tell for line wrapping you need to pay attention to the newlines anyway

Manishearth mentioned this issue Jun 7, 2024

v0.1.13 breaks semver #55

Closed

RoloEdits mentioned this issue Jun 14, 2024

Updating unicode-width from 0.1.12 to 0.1.13 broke newline rendering helix-editor/helix#10950

Closed

This was referenced Sep 19, 2024

please reconsider re-publishing 0.1.13 as 0.2.0 #66

Closed

Treat newlines as width 0 in the 0.1 stream, publish 0.1.14 #67

Merged

Manishearth closed this as completed in #67 Sep 19, 2024

Minoru mentioned this issue Sep 20, 2024

Chore(deps): bump unicode-width from 0.1.13 to 0.2.0 newsboat/newsboat#2864

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should line breaks be width 0 or 1? #60

Should line breaks be width 0 or 1? #60

Manishearth commented Jun 7, 2024

Jules-Bertholet commented Jun 7, 2024 •

edited

Loading

decathorpe commented Jun 7, 2024 •

edited

Loading

m-hilgendorf commented Jun 11, 2024 •

edited

Loading

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 11, 2024

m-hilgendorf commented Jun 11, 2024 •

edited

Loading

Manishearth commented Jun 11, 2024

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 11, 2024

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 18, 2024 •

edited

Loading

Manishearth commented Jun 18, 2024

Jules-Bertholet commented Jun 18, 2024

Manishearth commented Jun 18, 2024

decathorpe commented Jun 19, 2024

Jules-Bertholet commented Jun 19, 2024 •

edited

Loading

Manishearth commented Jun 20, 2024

decathorpe commented Jun 20, 2024

Manishearth commented Jun 20, 2024 •

edited

Loading

Manishearth commented Jun 20, 2024

Should line breaks be width 0 or 1? #60

Should line breaks be width 0 or 1? #60

Comments

Manishearth commented Jun 7, 2024

Jules-Bertholet commented Jun 7, 2024 • edited Loading

decathorpe commented Jun 7, 2024 • edited Loading

m-hilgendorf commented Jun 11, 2024 • edited Loading

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 11, 2024

m-hilgendorf commented Jun 11, 2024 • edited Loading

Manishearth commented Jun 11, 2024

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 11, 2024

Manishearth commented Jun 11, 2024

Jules-Bertholet commented Jun 18, 2024 • edited Loading

Manishearth commented Jun 18, 2024

Jules-Bertholet commented Jun 18, 2024

Manishearth commented Jun 18, 2024

decathorpe commented Jun 19, 2024

Jules-Bertholet commented Jun 19, 2024 • edited Loading

Manishearth commented Jun 20, 2024

decathorpe commented Jun 20, 2024

Manishearth commented Jun 20, 2024 • edited Loading

Manishearth commented Jun 20, 2024

Jules-Bertholet commented Jun 7, 2024 •

edited

Loading

decathorpe commented Jun 7, 2024 •

edited

Loading

m-hilgendorf commented Jun 11, 2024 •

edited

Loading

m-hilgendorf commented Jun 11, 2024 •

edited

Loading

Jules-Bertholet commented Jun 18, 2024 •

edited

Loading

Jules-Bertholet commented Jun 19, 2024 •

edited

Loading

Manishearth commented Jun 20, 2024 •

edited

Loading