Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue for Read::chars #27802

Closed
alexcrichton opened this issue Aug 13, 2015 · 88 comments
Closed

Tracking issue for Read::chars #27802

alexcrichton opened this issue Aug 13, 2015 · 88 comments
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@alexcrichton
Copy link
Member

alexcrichton commented Aug 13, 2015

This is a tracking issue for the deprecated std::io::Read::chars API.

@alexcrichton alexcrichton added A-io T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. B-unstable Blocker: Implemented in the nightly compiler and unstable. labels Aug 13, 2015
@eminence
Copy link
Contributor

It would be nice if std::io::Chars was an Iterator<Item=char>, just like std::str::Chars. I don't have a proposal for how decoding errors should be handled, though.

@gyscos
Copy link

gyscos commented Sep 10, 2015

If we want to make std::io::Chars a simple iterator on char, a solution is to have it return None on a UTF-8 error, and set an error flag in the Chars itself (with an associated was_valid_utf8() method or something). An empty sequence is considered valid UTF-8.
It does make error detection slightly less convenient, as writing for c in reader.chars() does not keep the Chars for later verification.
Here is an example use-case where we try to recover after an invalid sub-sequence:

let mut chars = reader.chars();
loop {
    for c in chars {
        // ...
    }
    if chars.was_valid_utf8() { break; }
    println!("Encountered invalid byte sequence.");
}

Or it could provide a more informative message, similar to the current CharsError.
Maybe this could apply to the other adapters as well? Or is this a bad pattern?

On the other hand, it is not so difficult to treat the Result item explicitly, or to wrap the current Chars to get the behavior I described (an unwrapper wrapping, interesting notion), so maybe the current situation is acceptable as is.

@alexcrichton
Copy link
Member Author

Nominating for 1.6 discussion

@alexcrichton
Copy link
Member Author

🔔 This issue is now entering its cycle-long final comment period for deprecation 🔔

These functions seem like excellent candidates to move out-of-tree into an ioutil crate or something like that. This deprecation won't actually happen until that crate exists, however.

@alexcrichton alexcrichton added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed I-nominated labels Nov 5, 2015
@abonander
Copy link
Contributor

.chars() is a nice operation to have, it'd be a shame to have to pull in a separate crate just to get such a useful iterator. The error type doesn't seem problematic; you'd have to handle an io::Error anyways.

Perhaps a .chars_lossy() iterator that yields the UTF-8 replacement character on a UTF-8 error and stops on the first io::Error.

@SimonSapin
Copy link
Contributor

I’m in favor of stabilizing Read::chars eventually, but it’s not ready yet:

It’s unstable because

the semantics of a partial read/write of where errors happen is currently unclear and may change

(The same would apply to chars_lossy.)

This behavior should be per Unicode Standard §5.22 "Best Practice for U+FFFD Substitution" http://www.unicode.org/versions/Unicode8.0.0/ch05.pdf#G40630

Roughly, that means stopping at the first unexpected byte. This is not the behavior currently implemented, which reads as many bytes as indicated by the first byte and then checks them. This is a problem as, with only Read (as opposed to, say, BufRead), you can’t put a byte "back" in the stream after reading it.

Here are some failing tests.

        let mut buf = Cursor::new(&b"\xf0\x9fabc"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert_eq!(chars.next().unwrap().unwrap(), 'b');
        assert_eq!(chars.next().unwrap().unwrap(), 'c');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0\x80a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

        let mut buf = Cursor::new(&b"\xed\xa0a"[..]);
        let mut chars = buf.chars();
        assert!(match chars.next() { Some(Err(CharsError::NotUtf8)) => true, _ => false });
        assert_eq!(chars.next().unwrap().unwrap(), 'a');
        assert!(chars.next().is_none());

I’ve looked at fixing this, but it basically involves duplicating all of the UTF-8 decoding logic from str::from_utf8, which I’m not really happy with. (That many more tests would need to be added.) I’ll try to think of some way to have a more basic decoder that can be used by both.

@sfackler
Copy link
Member

sfackler commented Nov 5, 2015

Moving chars to BufRead does not seem unreasonable.

@abonander
Copy link
Contributor

@sfackler I concur. I was thinking the exact same thing. In fact, it is unfortunate that Read::bytes() is already stabilized because, like chars(), it is almost always preferable to have it on a buffered source. A lot of the Read types really do not tolerate small, frequent reads well (#28073)

@alexcrichton
Copy link
Member Author

While I agree that chars is useful, questions like those brought up by @SimonSapin make me think that it's too weighty/controversial to be in the standard library. I'm not sure that this is really ever going to be seriously used either due to the performance concerns, and I agree that bytes is a bit of a misnomer in that it's super non-performant much of the time as well.

@SimonSapin
Copy link
Contributor

How does bytes perform when used with BufReader<_>?

@abonander
Copy link
Contributor

@SimonSapin there were some perf numbers reading from a file with bytes in #28073. The difference is pretty significant.

@abonander
Copy link
Contributor

I think chars would work pretty well if implemented on top of a buffer. It doesn't have to consume any bytes it doesn't need to, so no unnecessary loss of data.

@softprops
Copy link

I may be doing something hugely inefficient here but I really needed read.chars for a crate I was working on that attempted to produce an iterator of rustc serialize json items from a read stream. I had to fallback on vendoring most of the source around read.chars to as a workaround to get it working on stable awhile back. It was almost a requirement as the rustc builder for json requires a iterable of chars. It would be really nice to have this stabilized if env in some form of method name that implied lossyness.

@alexcrichton
Copy link
Member Author

@SimonSapin

I would expect it to perform much better than on a raw File because the number of syscalls issued is far less, but on the other hand it is likely much slower than another equivalent operation such as mmap + iterate the bytes via a slice.


@cybergeek94

Yeah I suspect the performance isn't abysmal, but when compared to str::chars it's likely still much slower (just more stuff that has to be checked during I/O)


@softprops

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string". For example a json deserializer may want to take an iterator of characters, but the actual iterator does something like BufReader where it reads a huge string, then literally uses str::chars, only reading a new string once it's exhausted (handling the boundary).

@gyscos
Copy link

gyscos commented Nov 6, 2015

@alexcrichton
Wouldn't a chars method on a BufReader-backed Read serve this exact purpose? It reads a big slice of bytes, hopefully in UTF-8 format, which is exactly what a string would be; shouldn't it be the same then to iterate chars on these bytes compared to using str::chars?

@softprops
Copy link

Yeah that's actually a case where I believe chars is inappropriate because it's just going to be inherently slower than any equivalent operation of "iterate over the characters of this string"

In this case it's a literally a stream of json I'm working with. My use case was interfacing with dockers streaming json endpoints. In which case json objects are pushed through a streamed response. I'm not sure how I'd accomplish that with a string.

@SimonSapin
Copy link
Contributor

@alexcrichton Hnadling the boundary is tricky. It’s handled in https://github.com/SimonSapin/rust-utf8/blob/master/lib.rs

@alexcrichton
Copy link
Member Author

@gyscos hm yes, I guess it would! That would definitely mean that chars could only be on BufReader.

@softprops in theory you can transform an iterator of &str slices into an iterator of char values, which is kinda what BufReader would be doing

@gyscos
Copy link

gyscos commented Nov 6, 2015

So a solution is to move chars to BufRead as @sfackler mentionned. I was worried this would force to needlessly wrap some Readers that didn't need to, but I just noticed Cursor<Vec<u8>> and Stdin already provide a BufRead access.
But all of that is about the performance aspect.
Concerning the error handling, BufRead::lines return an iterator on io::Result<String> to properly propagate the error, which includes invalid UTF-8 (by returning a ErrorKind::InvalidData). Couldn't chars do the same, instead of using a custom error?

@alexcrichton
Copy link
Member Author

Yeah I have a feeling that would be sufficient, there's more fine-grained error information we can give (such as the bytes that were read, if any), but iterators like lines are already lossy where you lose access to the underlying data if there's an error, so it's probably not too bad.

It does raise a question though that if chars exists only on BufRead for performance reasons, why does bytes not exist only on BufRead? Just a minor inconsistency.

@sfackler
Copy link
Member

sfackler commented Nov 6, 2015

I think the reasoning to move chars to BufRead is for correctness wrt @SimonSapin's comment - to be able to peek at the next byte without consuming it.

@abonander
Copy link
Contributor

The performance issue being alleviated is only a benevolent side-effect. bytes doesn't require a buffer for correctness.

@alexcrichton
Copy link
Member Author

Hm I don't think that this is a correctness problem that can be solved by "just moving to BufRead", it's always a possibility that the buffer is 1 byte in size so it behaves the same as reading one byte at a time. Basically BufRead just reduces the number of calls to read, but it essentially always has the same problems wrt partial reads, short reads, and consuming data.

@gyscos
Copy link

gyscos commented Nov 7, 2015

@SimonSapin was saying that BufReader could potentially put things back in the buffer, but it's not actually exposed by the BufRead trait.

@derekdreery
Copy link
Contributor

derekdreery commented Jan 6, 2017

Does this sound correct?:

The issue could be summarised as: How to handle reading chars over a datastream, specifically how to handle either incomplete utf8 byte sequences and incorrect bytes (w.r.t. utf8).

It's hard because when you encounter an invalid sequence of bytes that may be valid with more data, it could either be an error, or that you just need to read more bytes (it's ambiguous).

If you know it's incomplete, you may want to call Read and try again with the incomplete part prepended, but if it's incomplete because something has errored you want to return the error with the offending bytes.

It seems the consensus is that for error and incomplete bytes, you return a struct that contains an enum variant saying whether it is an error or a possibly incomplete set of bytes, along with the bytes. It's then the responsibility of a higher level iterator how to handle these cases (as not all use cases will want to handle it the same).

lotabout added a commit to lotabout/skim that referenced this issue Jan 19, 2017
Previouly, skim relys on nightly rust for `io::chars`
Now use crate utf8parse instead.
Check rust-lang/rust#27802 (comment)
brookst added a commit to brookst/skim that referenced this issue Jan 19, 2017
Use private char iterator as done in kkawakam/rustyline#38 while waiting
for stabilisation of the chars method per rust-lang/rust#27802
This removes the need for `#[feature(io)]` letting skim compile on rust
stable.
@SimonSapin
Copy link
Contributor

TL;DR: I think it is very hard to come up with an abstraction that: is zero-cost, covers all use cases, and is not terrible to use.

I’m in favor of deprecating and eventually removing this with no in-std replacement.


I think that anything that looks at one u8 or one char at a time is gonna have abysmal performance. Instead we probably want &str slices that reference fragments of some [u8] buffer.

I spent some time thinking of a low-level API that would make no assumptions about how one would want to use it ("pushing" vs "pulling" bytes and string slices, buffer allocation strategy, error handling, etc.) I came up with this:

pub fn decode_utf8(input: &[u8]) -> DecodeResult { /* ... */ }

pub enum DecodeResult<'a> {
    Ok(&'a str),

    /// These three slices cover all of the original input.
    /// `decode` should be called again with the third one as the new input.
    Error(&'a str, InvalidSequence<'a>, &'a [u8]),

    Incomplete(&'a str, IncompleteChar),
}

pub struct InvalidSequence<'a>(pub &'a [u8]);

pub struct IncompleteChar {
    // Fields are private. They include a [u8; 4] buffer.
}

impl IncompleteChar {
    pub fn try_complete<'char, 'input>(&'char mut self, mut input: &'input [u8])
                                       -> TryCompleteResult<'char, 'input> { /* ... */ }
}

pub enum TryCompleteResult<'char, 'input> {
    Ok(&'char str, &'input [u8]),  // str.chars().count() == 1
    Error(InvalidSequence<'char>, &'input [u8]),
    StillIncomplete,
}

It’s complicated. It requires the user to think about a lot of corner cases, especially around IncompleteChar. Explaining how to properly use it takes several paragraphs of docs.

We can hide some of the details with a stateful decoder:

pub struct Decoder { /* Private. Also with a [u8; 4] buffer. */ }

impl Decoder {
    pub fn new() -> Self;

    pub fn decode<'decoder, 'input>(&'decoder mut self, &'input [u8])
                                    -> DecoderResult<'decoder, 'input>;

    /// Signal that there is no more input.
    /// The decoder might contain a partial `char` which becomes an error.
    pub fn end<'decoder>(&self) -> Result<(), InvalidSequence<'decoder>>>;
}

/// Order of fields indicates order in the input
pub struct DecoderResult<'decoder, 'input> {
    /// str in the `Ok` case is either empty or one `char` (up to 4 bytes)
    pub partially_from_previous_input_chunk: Result<&'decoder str, InvalidSequence<'decoder>>,

    /// Up to the first error, if any
    pub decoded: &'input str,

    /// Whether we did find an error
    pub error: Result<(), InvalidSequence<'input>>

    /// Call `decoder.decode()` again with this, if non-empty
    pub remaining_input_after_error: &'input [u8]
}

/// Never more than 3 bytes.
pub struct InvalidSequence<'a>(pub &'a [u8]);

Even so, it’s very easy to misuse, for example by ignoring part of DecoderResult. Using a tuple instead of a struct makes it more visible when a field is ignored, but then we can’t name them to help explain which is what.

Either of these is complicated enough that I don’t think it belongs in libcore or libstd.

@SimonSapin
Copy link
Contributor

By the way, I’ve submitted #40212 to make something like the above (possibly specialized to one use case) easier to implement outside of std based on std::str::from_utf8. Doing so avoids re-implementing most of the decoding logic, and allows taking advantage of optimizations in std like #30740.

@taralx
Copy link
Contributor

taralx commented Mar 5, 2017

I would support deprecating Read::chars in favor of seeing what the community can produce.

@SimonSapin
Copy link
Contributor

Another attempt turned out almost nice:

pub struct Decoder { 
    buffer: [u8; 4],
    /* ... */
}

impl Decoder {
    pub fn new() -> Self { /* ... */ }
    pub fn next_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
    pub fn last_chunk<'a>(&'a mut self, input_chunk: &'a [u8]) -> DecoderIter<'a> { /* ... */ }
}

pub struct DecoderIter<'a> {
    decoder: &'a mut Decoder,
    /* ... */ 
}

impl<'a> Iterator for DecoderIter<'a> {
    type Item = Result<&'a str, &'a [u8]>;
}

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator. (It is sometimes backed by buffer in the decoder.) The result needs to borrow the iterator, which means the Iterator trait can’t be implemented, and for loops can’t be used:

impl<'a> DecoderIter<'a> {
    pub fn next(&mut self) -> Option<Result<&str, &[u8]>> { /* ... */ }
}
    let mut iter = decoder.next_chunk(input);
    while let Some(result) = iter.next() {
         // ...
    }

This compiles, but something like String::from_utf8_lossy(&[u8]) -> Cow<str> can’t be implemented on top of it because str fragments always borrow the short-lived decoder (and iterator), not only the original input.

We can work around that by adding enough lifetimes parameters and one weird enum… but yeah, no.

pub struct Decoder { /* ... */ }

impl Decoder {
    pub fn new() -> Self { /* ... */ }

    pub fn next_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }

    pub fn last_chunk<'decoder, 'input>(&'decoder mut self, input_chunk: &'input [u8])
                                        -> DecoderIter<'decoder, 'input> { /* ... */ }
}

pub struct DecoderIter<'decoder, 'input> { /* ... */ }

impl<'decoder, 'input> DecoderIter<'decoder, 'input> {
    pub fn next<'buffer>(&'buffer mut self) 
        -> Option<Result<EitherLifetime<'buffer, 'input, str>,
                         EitherLifetime<'buffer, 'input, [u8]>>> { /* ... */ }
}

pub enum EitherLifetime<'buffer, 'input, T: ?Sized + 'static> {
    Buffer(&'buffer T),
    Input(&'input T),
}

impl<'buffer, 'input, T: ?Sized> EitherLifetime<'buffer, 'input, T> {
    pub fn get<'a>(&self) -> &'a T where 'buffer: 'a, 'input: 'a {
        match *self {
            EitherLifetime::Input(x) => x,
            EitherLifetime::Buffer(x) => x,
        }
    }
}

@taralx
Copy link
Contributor

taralx commented Mar 7, 2017

Except it doesn’t work. &'a str in the result conflicts with &'a mut Decoder in the iterator.

Can you elaborate? I don't follow here.

@SimonSapin
Copy link
Contributor

@taralx

  • DecoderIter<'a> contains &'a mut Decoder.
  • Decoder contains a [u8; 4] buffer.
  • The DecoderIter::next method takes &'b mut self. This implies 'a: 'b (b outlives a).
  • In some cases, the next method wants to return a borrow of the buffer with (simplified) std::str::from_utf8_unchecked(&self.decoder.buffer).
  • Since this borrow is going through &'b mut self, its lifetime cannot be longer than 'b. This conflicts with the 'a: 'b requirement. If we could somehow make that borrow (such as with unsafe), then &mut Decoder in DecoderIter would no longer be exclusive since there’s another borrow of part of it (the buffer).
  • The solution is to return &'b str instead of &'a str. But the Iterator trait cannot express that since the return type of next is Option<Self::Item>, and there is no way to include the lifetime of the iterator itself in the associated type Item.

Perhaps it’s clearer with code. This does not compile: https://gist.github.com/anonymous/0587b4484ec9a15f5c5ce6908b3807c1, unless you change Result<&'a str, &'a [u8]> to Result<&str, &[u8]> (borrowing self) and make next an inherent method instead of an Iterator impl.

@Mark-Simulacrum Mark-Simulacrum added the C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. label Jul 22, 2017
@wilysword
Copy link

I tend to agree that this should be removed from std, though for a slightly different reason: I'd like the interface for "returning a stream of chars from an io::Reader" to work with any encoding, not just UTF-8. If the general purpose of io::Read is to represent a stream of arbitrary bytes, then some other trait should represent a stream of decoded characters (though for completeness, I would probably also have such a trait implement io::Read). This would be similar to how you can chain streams in other languages: Decoder(BufferedReader(FileReader("somefile.txt"))). Unless/until general encoding functionality lives in std, I don't see much benefit in such a half-hearted implementation.

@SimonSapin
Copy link
Contributor

hsivonen/encoding_rs#8 has some discussion of Unicode stream and decoders for not-only-UTF-8 encodings.

@SimonSapin
Copy link
Contributor

The libs team discussed this and consensus was to deprecate the Read::chars method and its Chars iterator.

@rfcbot fcp close

Code that does not care about processing data incrementally can use Read::read_to_string instead. Code that does care presumably also wants to control its buffering strategy and work with &[u8] and &str slices that are as large as possible, rather than one char at a time. It should be based on the str::from_utf8 function as well as the valid_up_to and error_len methods of the Utf8Error type. One tricky aspect is dealing with cases where a single char is represented in UTF-8 by multiple bytes where those bytes happen to be split across separate read calls / buffer chunks. (Utf8Error::error_len returning None indicates that this may be the case.) The utf-8 crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.

Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the encoding_rs or encoding crate.

@rfcbot
Copy link

rfcbot commented Mar 30, 2018

Team member @SimonSapin has proposed to close this. The next step is review by the rest of the tagged teams:

No concerns currently listed.

Once a majority of reviewers approve (and none object), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up!

See this document for info about what commands tagged team members can give me.

@rfcbot rfcbot added the proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. label Mar 30, 2018
@rfcbot
Copy link

rfcbot commented Apr 4, 2018

🔔 This is now entering its final comment period, as per the review above. 🔔

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Apr 4, 2018
@rfcbot
Copy link

rfcbot commented Apr 14, 2018

The final comment period is now complete.

kennytm added a commit to kennytm/rust that referenced this issue Apr 24, 2018
@SimonSapin
Copy link
Contributor

Deprecated in #49970

@Mark-Simulacrum Mark-Simulacrum removed the final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. label Jul 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
B-unstable Blocker: Implemented in the nightly compiler and unstable. C-tracking-issue Category: A tracking issue for an RFC or an unstable feature. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests