Tracking issue for Read::chars #27802

alexcrichton · 2015-08-13T05:50:51Z

This is a tracking issue for the deprecated std::io::Read::chars API.

The text was updated successfully, but these errors were encountered:

eminence · 2015-08-25T03:23:35Z

It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.

gyscos · 2015-09-10T05:29:42Z

If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:

let mut chars = reader.chars();
loop {
    for c in chars {
        // ...
    }
    if chars.was_valid_utf8() { break; }
    println!("Encountered invalid byte sequence.");
}

Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?

On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.

alexcrichton · 2015-11-04T01:52:39Z

Nominating for 1.6 discussion

alexcrichton · 2015-11-05T17:56:21Z

🔔 This issue is now entering its cycle-long final comment period for deprecation 🔔

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

abonander · 2015-11-05T18:59:30Z

.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.

Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.

SimonSapin · 2015-11-05T21:16:37Z

I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:

It’s unstable because

the semantics of a partial read/write of where errors happen is currently unclear and may change

(The same would apply to chars_lossy.)

This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.

Here are some failing tests.

        let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert_eq!(chars.next().unwrap().unwrap(), 'b');
        assert_eq!(chars.next().unwrap().unwrap(), 'c');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.

sfackler · 2015-11-05T21:36:34Z

Moving chars to BufRead does not seem unreasonable.

abonander · 2015-11-05T21:41:44Z

@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)

alexcrichton · 2015-11-06T16:21:03Z

While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.

SimonSapin · 2015-11-06T17:08:27Z

How does bytes perform when used with BufReader<_>?

abonander · 2015-11-06T17:11:57Z

@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.

abonander · 2015-11-06T17:17:08Z

I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.

softprops · 2015-11-06T17:31:24Z

I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.

alexcrichton · 2015-11-06T19:39:59Z

@SimonSapin

I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.

@cybergeek94

Yeah I suspect the performance isn't abysmal, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)

@softprops

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).

gyscos · 2015-11-06T19:57:09Z

@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?

softprops · 2015-11-06T20:13:56Z

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"

In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.

SimonSapin · 2015-11-06T22:28:03Z

@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs

alexcrichton · 2015-11-06T23:10:50Z

@gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.

@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing

gyscos · 2015-11-06T23:29:13Z

So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't need to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?

alexcrichton · 2015-11-06T23:53:55Z

Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.

It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.

sfackler · 2015-11-06T23:57:02Z

I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.

abonander · 2015-11-07T00:04:48Z

The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.

alexcrichton · 2015-11-07T01:02:33Z

Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.

gyscos · 2015-11-07T01:07:30Z

@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.

derekdreery · 2017-01-06T14:37:58Z

Does this sound correct?:

The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).

It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).

If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.

It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).

Previouly, skim relys on nightly rust for `io::chars` Now use crate utf8parse instead. Check rust-lang/rust#27802 (comment)

Use private char iterator as done in kkawakam/rustyline#38 while waiting for stabilisation of the chars method per rust-lang/rust#27802 This removes the need for `#[feature(io)]` letting skim compile on rust stable.

SimonSapin · 2017-03-04T22:31:43Z

TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.

I’m in favor of deprecating and eventually removing this with no in-std replacement.

I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.

I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:

pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }

pub enum DecodeResult<'a> {
    Ok(&'a str),

    /// These three slices cover all of the original input.
    /// `decode` should be called again with the third one as the new input.
    Error(&'a str, InvalidSequence<'a>, &'a [u8]),

    Incomplete(&'a str, IncompleteChar),
}

pub struct InvalidSequence<'a>(pub &'a [u8]);

pub struct IncompleteChar {
    // Fields are private. They include a [u8; 4] buffer.
}

impl IncompleteChar {
    pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
                                       -> TryCompleteResult<'char, 'input> { /* ... */ }
}

pub enum TryCompleteResult<'char, 'input> {
    Ok(&'char str, &'input [u8]),  // str.chars().count() == 1
    Error(InvalidSequence<'char>, &'input [u8]),
    StillIncomplete,
}

It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.

We can hide some of the details with a stateful decoder:

pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }

impl Decoder {
    pub fn new() -> Self;

    pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
                                    -> DecoderResult<'decoder, 'input>;

    /// Signal that there is no more input.
    /// The decoder might contain a partial `char` which becomes an error.
    pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}

/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
    /// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
    pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,

    /// Up to the first error, if any
    pub decoded: &'input str,

    /// Whether we did find an error
    pub error: Result<(), InvalidSequence<'input>>

    /// Call `decoder.decode()` again with this, if non-empty
    pub remaining_input_after_error: &'input [u8]
}

/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);

Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.

Either of these is complicated enough that I don’t think it belongs in libcore or libstd.

SimonSapin · 2017-03-04T22:45:57Z

By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.

taralx · 2017-03-05T03:53:35Z

I would support deprecating Read::chars in favor of seeing what the community can produce.

SimonSapin · 2017-03-06T18:15:21Z

Another attempt turned out almost nice:

pub struct Decoder { 
    buffer: [u8; 4],
    /* ... */
}

impl Decoder {
    pub fn new() -> Self { /* ... */ }
    pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
    pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}

pub struct DecoderIter<'a> {
    decoder: &'a mut Decoder,
    /* ... */ 
}

impl<'a> Iterator for DecoderIter<'a> {
    type Item = Result<&'a str, &'a [u8]>;
}

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:

impl<'a> DecoderIter<'a> {
    pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}

    let mut iter = decoder.next_chunk(input);
    while let Some(result) = iter.next() {
         // ...
    }

This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.

We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.

pub struct Decoder { /* ... */ }

impl Decoder {
    pub fn new() -> Self { /* ... */ }

    pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }

    pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }
}

pub struct DecoderIter<'decoder, 'input> { /* ... */ }

impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
    pub fn next<'buffer>(&'buffer mut self) 
        -> Option<Result<EitherLifetime<'buffer, 'input, str>,
                         EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}

pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
    Buffer(&'buffer T),
    Input(&'input T),
}

impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
    pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
        match *self {
            EitherLifetime::Input(x) => x,
            EitherLifetime::Buffer(x) => x,
        }
    }
}

taralx · 2017-03-07T02:38:13Z

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator.

Can you elaborate? I don't follow here.

SimonSapin · 2017-03-07T10:28:24Z

@taralx

DecoderIter<'a> contains &'a mut Decoder.
Decoder contains a [u8; 4] buffer.
The DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).
In some cases, the next method wants to return a borrow of the buffer with (simplified) std::str::from_utf8_unchecked(&self.decoder.buffer).
Since this borrow is going through &'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).
The solution is to return &'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.

Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.

wilysword · 2017-09-05T20:11:29Z

I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.

SimonSapin · 2017-09-05T20:37:00Z

hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.

SimonSapin · 2018-03-30T14:37:55Z

The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.

@rfcbot fcp close

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

rfcbot · 2018-03-30T14:37:56Z

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

rfcbot · 2018-04-04T17:40:37Z

🔔 This is now entering its final comment period, as per the review above. 🔔

rfcbot · 2018-04-14T17:44:07Z

The final comment period is now complete.

Per FCP: * rust-lang#27802 (comment) * rust-lang#33906 (comment)

Deprecate Read::chars and char::decode_utf8 Per FCP: * rust-lang#27802 (comment) * rust-lang#33906 (comment)

SimonSapin · 2018-04-24T13:56:38Z

Deprecated in #49970

alexcrichton added A-io T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 13, 2015

alexcrichton added the I-nominated label Nov 4, 2015

alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Nov 5, 2015

lotabout added a commit to lotabout/skim that referenced this issue Jan 19, 2017

close #49, compile on stable

383d0eb

Previouly, skim relys on nightly rust for `io::chars` Now use crate utf8parse instead. Check rust-lang/rust#27802 (comment)

SimonSapin mentioned this issue Feb 19, 2017

Tracking issue for 1.0.0 tracking issues #39954

Closed

38 tasks

Mark-Simulacrum removed the A-io label Jun 25, 2017

IvanUkhov mentioned this issue Jul 5, 2017

Switch from str::Chars to io::Chars bodoni/svg#7

Closed

Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 22, 2017

rfcbot added the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Mar 30, 2018

rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 4, 2018

SimonSapin mentioned this issue Apr 14, 2018

Deprecate Read::chars and char::decode_utf8 #49970

Merged

SimonSapin added a commit to SimonSapin/rust that referenced this issue Apr 15, 2018

Deprecate Read::chars and char::decode_utf8

7cbeddb

Per FCP: * rust-lang#27802 (comment) * rust-lang#33906 (comment)

kennytm added a commit to kennytm/rust that referenced this issue Apr 24, 2018

Rollup merge of rust-lang#49970 - SimonSapin:deprecate, r=sfackler

6b1ed8e

Deprecate Read::chars and char::decode_utf8 Per FCP: * rust-lang#27802 (comment) * rust-lang#33906 (comment)

SimonSapin closed this as completed Apr 24, 2018

Mark-Simulacrum removed the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Jul 28, 2018

Lucretiel mentioned this issue Nov 30, 2018

Defactored Bytes::read #56363

Merged

Tracking issue for Read::chars #27802

Tracking issue for Read::chars #27802

Comments

alexcrichton commented Aug 13, 2015 • edited by Mark-Simulacrum Loading

eminence commented Aug 25, 2015

gyscos commented Sep 10, 2015

alexcrichton commented Nov 4, 2015

alexcrichton commented Nov 5, 2015

abonander commented Nov 5, 2015

SimonSapin commented Nov 5, 2015

sfackler commented Nov 5, 2015

abonander commented Nov 5, 2015

alexcrichton commented Nov 6, 2015

SimonSapin commented Nov 6, 2015

abonander commented Nov 6, 2015

abonander commented Nov 6, 2015

softprops commented Nov 6, 2015

alexcrichton commented Nov 6, 2015

gyscos commented Nov 6, 2015

softprops commented Nov 6, 2015

SimonSapin commented Nov 6, 2015

alexcrichton commented Nov 6, 2015

gyscos commented Nov 6, 2015

alexcrichton commented Nov 6, 2015

sfackler commented Nov 6, 2015

abonander commented Nov 7, 2015

alexcrichton commented Nov 7, 2015

gyscos commented Nov 7, 2015

derekdreery commented Jan 6, 2017 • edited Loading

SimonSapin commented Mar 4, 2017

SimonSapin commented Mar 4, 2017

taralx commented Mar 5, 2017

SimonSapin commented Mar 6, 2017

taralx commented Mar 7, 2017

SimonSapin commented Mar 7, 2017

wilysword commented Sep 5, 2017

SimonSapin commented Sep 5, 2017

SimonSapin commented Mar 30, 2018

rfcbot commented Mar 30, 2018 • edited by Kimundi Loading

rfcbot commented Apr 4, 2018

rfcbot commented Apr 14, 2018

SimonSapin commented Apr 24, 2018

alexcrichton commented Aug 13, 2015 •

edited by Mark-Simulacrum

Loading

derekdreery commented Jan 6, 2017 •

edited

Loading

rfcbot commented Mar 30, 2018 •

edited by Kimundi

Loading