-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular subtag APIs #1932
Regular subtag APIs #1932
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise
: generally this looks awesome. It will make it easier to maintain the code. There's some cost to readability (macros are by nature a bit more challenging), but I think it's a good ROI.
My only larger concern is about performance of LSR subtags. I handcrafted parsing of each to a cycle. My reasoning is that by far the hottest paths in parsing of Locale are:
- Are the first couple characters a valid
Language
? - Is the second subtag a
Script
or not? - Is the second/third subtag a
Region
?
You may notice that each of the three has different codepath to attempt to always check length first and bail early. Then check first character if possible and bail early.
I also tested (criterion, non iai tho!) whether it is faster to first check ascii and conditionally convert to tinystr, or convert and then test - testing tinystr casing is much faster, but parsing to tinystr costs.
This whole microoptimization may be moot tho and maybe the actual result is not going to be affected in a significant way, but please, test it. With iai
if possible.
Particularly, rejecting invalid first characters, and fast-skipping Script
as commonly omitted - the handcrafting allowed us to not have separate Script::is_valid
from parsing, because parsing was instruction-optimized for that.
You also have a couple places where you, for normalization reasons, repeat validation calls where they weren't before. It may or may not be needed, and I'm ok with some overhead in all subtags except of LID so please, test those. Maybe no-ops will be enough to avoid complicating the macro?
1d2be8d
to
4baa586
Compare
72df063
to
5ae99df
Compare
b3c0de1
to
44da72d
Compare
Parsing now does:
For The from-raw path does:
This is identical to before for LSRV. For other tags this path might be new and validation and normalization are sometimes two TinyStr calls, as the combined predicate doesn't exist. |
Benchmarks look bad though and I don't know why... |
74306f5
to
e24979e
Compare
^ rebased on #2068 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm requesting changes because we shouldn't land unsafe code that isn't documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't have strong opinions on this component but looks good 👍
@robertbastian Can you determine whether this change can be decoupled from 1.0? If there are any methods that are being removed or renamed, you can do that in a standalone PR first |
|
Benchmarks look OK to me. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: Nice work!
/// Compare with a potentially unnormalized BCP-47 string. | ||
/// | ||
/// The return value is equivalent to what would happen if you first parsed the | ||
/// BCP-47 string and then performed a structucal comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// BCP-47 string and then performed a structucal comparison. | |
/// BCP-47 string and then performed a structural comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub const fn as_bytes(&self) -> &[u8] { | ||
// Safe because `self.bytes.as_slice()` pointer-casts to `&[u8]`, | ||
// and changing the length of that slice to self.len() < N is safe. | ||
unsafe { core::mem::transmute((self.bytes.as_slice().as_ptr(), self.len())) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping @Manishearth to review this unsafe block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I want to move away from relying on the slice layout matching tuples (miri will eventually not like this, but also we'll have better slice APIs by then)
Can you instead take self.bytes.as_slice()
and just transmute that reference? It should work.
Also, explicitly include the from/to types on the transmute::<_,_>()
turbofish (and perhaps import core::mem
somewhere)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s.to_ascii_lowercase(), | ||
s.is_ascii_lowercase() | ||
&& s.is_ascii_alphanumeric() | ||
&& (s.len() != 4 || s.all_bytes()[0].is_ascii_digit()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Praise: This is a clearer way to express the invariant than we had before
variant = 5*8alphanum ; registered variants
/ (DIGIT 3alphanum)
I want this so I'm going to merge it. |
No description provided.