Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Impl TryFrom<char> for u8 #2854

Closed
ebkalderon opened this issue Jan 22, 2020 · 11 comments
Closed

Impl TryFrom<char> for u8 #2854

ebkalderon opened this issue Jan 22, 2020 · 11 comments
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.

Comments

@ebkalderon
Copy link

ebkalderon commented Jan 22, 2020

I ran into a case using the csv crate in combination with structopt where I needed to parse a single char from an argument to use as a CSV delimiter. However, it appears that csv::StringBuilder only accepts u8 delimiters. I was surprised to see that there was no type-safe way to convert between char to u8 in the standard library. It would be nice if u8 implemented TryFrom<char>, where the conversion fails if the char value is outside of the U+0000 to U+00FF codepoint range (0-255).

An example use would look like this:

fn main() {
    let success = u8::try_from('A').expect("Conversion should have succeeded");
    let failure = u8::try_from('我').expect_err("Conversion should have failed");
}
@kennytm
Copy link
Member

kennytm commented Jan 22, 2020

ASCII range

This differs from impl From<u8> for char which supports U+0000 to U+00FF.

@ebkalderon
Copy link
Author

@kennytm Good point! Thanks for pointing out the semantic difference in From<u8>. In that case, I think it's necessary to revise my proposal to specifically U+0000 to U+00FF (value range of 0-255).

@Lokathor
Copy link
Contributor

if all of 0..=255 is allowed I believe what you want can be done as follows:

if c as u32 <= u8::max_value() as u32 {
  Some(c as u8)
} else {
  None
}

@ebkalderon
Copy link
Author

@Lokathor Thanks for the code sample! That's actually very similar to what I've incorporated into my own project, so it's nice to see external confirmation that this general pattern exists. I was hoping to at least raise some discussion on whether a similar conversion could or should be included as part of the standard library.

@SimonSapin
Copy link
Contributor

This also works: u8::try_from(u32::from(some_char))

@fstirlitz
Copy link

If this is added, I'd prefer if it's a method on char called something like try_into_latin1, to clarify that using it constitutes a semantic conversion (and not just type-checking bureaucracy), and to discourage using it as the least-effort/least-thought-out way to convert a stream of chars into a stream of u8s (which should explicitly refer to some character encoding).

@SimonSapin
Copy link
Contributor

“Latin-1” may refer to slightly different things though.

  • Originally, it is Part 1: Latin alphabet No. 1 of ISO/IEC 8859. It is a set of 191 characters, each assigned an 8-bit value. The ranges 0x00 to 0x1F and 0x7F to 0x9F are not assigned.

  • IANA calls ISO-8859-1 an extension of the above that assigns the C0 and C1 control codes to these two ranges. This makes it compatible with ASCII, and all 256 bytes are assigned a character or control code. Unicode chose to make its first 256 code points U+0000 to U+00FF compatible with this, so u8::try_from(u32::from(some_char)) effectively encodes to IANA’s ISO-8859-1. To avoid confusion, https://infra.spec.whatwg.org/#isomorphic-encode calls this operation isomorphic encode instead.

  • Windows code page 1252 is a different extension of ISO/IEC 8859-1 that assigns various punctuation or letters to some of the 0x80 to 0x9F range. (In 0x00 to 0x7F it is compatible with ASCII and IANA’s ISO-8859-1 though) Windows-1252 is sometimes incorrectly called ISO-8859-1 or Latin-1.

  • In web browsers, at this point for compatibility reasons, many labels (used for example in charset in Content-Type) including ascii, latin1 and iso-8859-1 map to an extension of Windows-1252 where the 5 unassigned byte values are mapped to corresponding C1 control codes like in IANA’s ISO-8859-1.

Back to Rust API naming, we already have the precedent of impl From<u8> for char that does the opposite conversion. And my pre-1.0 proposal to rename str::as_bytes to str::as_utf8 was rejected.

@scottmcm
Copy link
Member

scottmcm commented Nov 10, 2020

FWIW, I think this makes sense. In general, any time there's A: From<B>, I think it makes sense to have B: TryFrom<A> for going the other way. (Assuming, of course, that it's possible to implement, but if it isn't then I'm also skeptical of the From...)

EDIT: Said otherwise, we have u8: TryFrom<u32>, and char: From<u8> is precedent for "u8 is a USV in 0..256" (as opposed to a code unit, or something), so we might as well have u8: TryFrom<char>.

@scottmcm scottmcm added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Nov 10, 2020
@burdges
Copy link

burdges commented Nov 11, 2020

Oh? Is there a convention to not provide A: From<B> if B: TryFrom<A> makes no sense, like say if A is a hash of B or something? Or did you mean only in the context of std?

I think wrapper types improve code quality of course, but sometimes people then use From/Into more once they have specific types, which harms readability over named methods.

I suppose one guideline would be impl From<B> for A if code would be improved by using the bound B: Into<A> but avoid doing so otherwise.

@scottmcm
Copy link
Member

Is there a convention to [...]

Nothing official, as far as I know. It's just a heuristic I use, since if you can't recover the original, then it generally seems to me like it's doing something non-obvious enough that it would be better as a method with a more illustrative name. (Of course, this is also only for Froms with the same ownership type -- I'm not expecting to be able to recover the original from something like String: From<&str>.)

Scrolling through std's From impls they do mostly match this. I probably wouldn't bother ever writing out the the TryFrom opposite of RecvTimeoutError: From<RecvError>, since it's just matches!(foo, RecvTimeoutError::Disconnected), but it would be completely plausible if someone wanted to have it. Similarly for String: From<char> -- I'm not sure that char: TryFrom<String> would be a good API (would one really ever want to .try_into() such that it fails if there's not exactly one char?) but it's pretty clear what the behaviour would be.

The ones that are a bit weirder are things like BinaryHeap<T>: From<Vec<T>> -- there's a From back the other way, but the first way was actually a heapify operation, so the original value can't be recovered exactly. I think that's a place where I'd have actually preferred it just be BinaryHeap::new without the from existing -- it gives a more accessible place to document complexity and such.

All that said, #2484 includes some text proposing rules more along these lines.

ehuss pushed a commit to ehuss/rust that referenced this issue Jan 8, 2022
Previously suggested in rust-lang/rfcs#2854.

It makes sense to have this since `char` implements `From<u8>`. Likewise
`u32`, `u64`, and `u128` (since rust-lang#79502) implement `From<char>`.
ehuss added a commit to ehuss/rust that referenced this issue Jan 8, 2022
Implement `TryFrom<char>` for `u8`

Previously suggested in rust-lang/rfcs#2854.

It makes sense to have this since `char` implements `From<u8>`. Likewise `u32`, `u64`, and `u128` (since rust-lang#79502) implement `From<char>`.
@ebkalderon
Copy link
Author

Done in rust-lang/rust#84640, which landed in 1.59.0! 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

No branches or pull requests

7 participants