Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor low-level UTF-16 decoding and move Borrow(Mut) to libcore. #27808

Merged
merged 2 commits into from
Aug 27, 2015

Conversation

SimonSapin
Copy link
Contributor

  • Rename Utf16Items to Utf16Decoder. "Items" is meaningless.
  • Generalize it to any u16 iterator, not just [u16].iter()
  • Make it yield Result instead of a custom Utf16Item enum that was isomorphic to Result. This enable using the FromIterator for Result impl.
  • Replace Utf16Item::to_char_lossy with a Utf16Decoder::lossy iterator adaptor.

This is a [breaking change], but only for users of the unstable rustc_unicode crate.

I’d like this functionality to be stabilized and re-exported in std eventually, as the "low-level equivalent" of String::from_utf16 and String::from_utf16_lossy like #27784 is the low-level equivalent of #27714.

CC @aturon, @alexcrichton

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @pnkfelix (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. The way Github handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

}
let c = (((n1 - 0xD800) as u32) << 10 |
(n2 - 0xDC00) as u32) + 0x1_0000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we’re only dealing with a single pair of u16s here, I felt this was simpler than creating a one-shot iterator. But that’s still possible.

@SimonSapin
Copy link
Contributor Author

I’d like re-exporting Utf16Decoder and Utf16LossyDecoder in std while I’m at it (still unstable for now), but I don’t know what module they would go into. Maybe std::char, since they (sort of) return char iterators?

@bors
Copy link
Contributor

bors commented Aug 14, 2015

☔ The latest upstream changes (presumably #27684) made this pull request unmergeable. Please resolve the merge conflicts.

@SimonSapin
Copy link
Contributor Author

Rebased and added commits to move the iterators to rustc_unicode::char (reexported/exposed in std::char) with an unstable utf16_decoder feature name (#27830), and add issue annotations per #27818.

@bluss
Copy link
Member

bluss commented Aug 14, 2015

Should be [breaking-change] with hyphen in the message.

@SimonSapin
Copy link
Contributor Author

@bluss fixed

@alexcrichton
Copy link
Member

Thanks for the PR @SimonSapin! I like the reduction in API surface area here, and I think that this can actually go even farther and remove the lossy decoder (as it's just a simple .map(...) iterator anyway). I'm not 100% certain that this is an API we'd want to expose as stable at some point (we have little precedent for iterator constructors today), but I have no outstanding issue with the API as-is.

Could you leave behind the old support, but mark it all as #[deprecated]? (to help ease current users on nightly).

@SimonSapin
Copy link
Contributor Author

  • Added deprecated wrappers for the old API
  • Removed the lossy decoder
  • Added a std::char::REPLACEMENT_CHARACTER const instead, and documented how to do lossy decoding with it.
  • Moved Utf16Decoder::new to Iterator::decode_utf16, which seems more idiomatic for iterator adaptors. That moves it to libcore and (indirectly) into the prelude, but then that is already the case for char::encode_utf16. It’s also the first Iterator method that only works with one concrete Iterator::Item type.

I’m not so sure about that last change. What do you think?

@pnkfelix
Copy link
Member

@SimonSapin sorry, I was on PTO for much of the time this PR was open.

I'll skim over it but I'm not certain I'm the best person to review. (I assume the Travis failure is due to a timeout, though I was under a rough impression that timeouts were not supposed to end up being marked as fatal checks on the PRs...)

@SimonSapin
Copy link
Contributor Author

No worries @pnkfelix, @alexcrichton took over.

After a chat with @bluss on IRC I undid the move to Iterator::decode_utf16 method (although method chaining is nice, this is not the right place for it) and moved to a std::char::decode_utf16 function instead. I also generalized u16 to Borrow<u16> (moving Borrow to libcore) and (in the constructor function) Iterator to IntoIterator. This allows using a &[u16] slice directly, like before.

@bluss
Copy link
Member

bluss commented Aug 19, 2015

Nice. I think borrow should move to libcore anyway, since it is becoming a commonly used trait, not just for collections. I think the docs are clear enough for Borrow, to make its important guarantees clear.

@bluss
Copy link
Member

bluss commented Aug 19, 2015

We also found out free functions make sense to construct iterators because the focus is not at all on the struct carrying the iteration, that's almost an implementation detail. And it's consistent with libstd's precedents (std::ascii::escape_default)

@SimonSapin
Copy link
Contributor Author

(Before moving just Borrow and BorroMut I tried moving the entire collections::borrow module, but couldn’t because of coherence rules.)

@alexcrichton
Copy link
Member

I think it's fine to move the Borrow and BorrowMut traits into libcore, but I feel like it's a bit too much magic to rely on it in iterators currently. The only precedent we have for being generic over Borrow is just in collections, and starting to extend that to iterators isn't quite as idiomatic. For example the cloned iterator adapter doesn't strive to use Borrow even though it's technically all it needs.

@alexcrichton
Copy link
Member

I'm also ok with a free function for now to construct this iterator instead of Iterator::new, but I don't quite feel that the char module is the best place for a stable location for this eventually. Unfortunately, though, I don't think we have a great place for it to be located, so it at least works for an unstable location for now.

@SimonSapin
Copy link
Contributor Author

Borrow<u16> is used here as a way to express “u16 or &u16”. The former is the most general case, &u16 doesn’t work if you’re generating u16’s incrementally without allocating buffers. The latter, when combined with IntoIterator, conveniently allows passing a &[u16] slice directly.

I don’t care so much about generalizing &u16 to other kinds of references.

If you feel this is too much magic I can reverse to just u16, but decode_utf16(slice) would have to be more verbose: decode_utf16(slice.iter().cloned()).

@alexcrichton
Copy link
Member

I definitely agree that decode_utf16(slice) is more ergonomic than the alternative, but this is also a pretty niche function that's very rarely called, so I don't think it's necessarily important to have the best ergonomics right off the bat (especially when it's not an established convention in the standard library).

@SimonSapin
Copy link
Contributor Author

Alright, I’ll revert the use of Borrow. Should I keep the move of Borrow to libcore? It becomes unrelated to this PR.

@alexcrichton
Copy link
Member

I'm fine either way on that, it'll probably happen eventually regardless.

@SimonSapin SimonSapin changed the title Refactor low-level UTF-16 decoding. Refactor low-level UTF-16 decoding and move Borrow(Mut) to libcore. Aug 20, 2015
@bors
Copy link
Contributor

bors commented Aug 22, 2015

☔ The latest upstream changes (presumably #27871) made this pull request unmergeable. Please resolve the merge conflicts.

@SimonSapin SimonSapin force-pushed the utf16decoder branch 2 times, most recently from a1a936a to 2b70890 Compare August 22, 2015 12:30
@SimonSapin
Copy link
Contributor Author

Rebased.

* Rename `utf16_items` to `decode_utf16`. "Items" is meaningless.
* Move it to `rustc_unicode::char`, exposed in `std::char`.
* Generalize it to any `u16` iterable, not just `&[u16]`.
* Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl.
* Add a `REPLACEMENT_CHARACTER` constant.
* Document how `result.unwrap_or(REPLACEMENT_CHARACTER)` replaces `Utf16Item::to_char_lossy`.
@alexcrichton alexcrichton added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. I-needs-decision Issue: In need of a decision. labels Aug 23, 2015
@aturon
Copy link
Member

aturon commented Aug 26, 2015

LGTM!

@alexcrichton
Copy link
Member

We talked about this at the libs meeting today and decided to move forward, thanks @SimonSapin!

@bors: r+ 6174b8d

@alexcrichton alexcrichton removed the I-needs-decision Issue: In need of a decision. label Aug 27, 2015
@bors
Copy link
Contributor

bors commented Aug 27, 2015

⌛ Testing commit 6174b8d with merge fd302a9...

bors added a commit that referenced this pull request Aug 27, 2015
* Rename `Utf16Items` to `Utf16Decoder`. "Items" is meaningless.
* Generalize it to any `u16` iterator, not just `[u16].iter()`
* Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl.
* Replace `Utf16Item::to_char_lossy` with a `Utf16Decoder::lossy` iterator adaptor.

This is a [breaking change], but only for users of the unstable `rustc_unicode` crate.

I’d like this functionality to be stabilized and re-exported in `std` eventually, as the "low-level equivalent" of `String::from_utf16` and `String::from_utf16_lossy` like #27784 is the low-level equivalent of #27714.

CC @aturon, @alexcrichton
@bors bors merged commit 6174b8d into rust-lang:master Aug 27, 2015
@SimonSapin SimonSapin deleted the utf16decoder branch August 27, 2015 07:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants