Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beginning of "tinystr" optimization #8

Closed
wants to merge 3 commits into from

Conversation

raphlinus
Copy link

@raphlinus raphlinus commented Mar 18, 2019

This patch implements a "tiny string" datatype, very efficient but limited to bounded lengths, and starts with the integration into the Locale datatype. As of this PR, it's just the language subtag, but I wanted to upload it to get feedback before proceeding further.

Progress towards #7

This commit adds a compact string representation, but doesn't wire them up.

Part of the plan for projectfluent#7
Changes the Locale and parser to use TinyStr for language.
@raphlinus
Copy link
Author

I'm not asking for this to be merged just yet. There are some unused code warnings that will go away.

A few questions. I see that the full generality of bcp-47 is not supported, for example language can only be 2-3 alpha characters, so "i-enochian" would be failed, similarly for registered language subtag. Supporting these uncommon values is the reason it's TinyStr8, but I can change this.

This implementation uses "fake SIMD". I was able to get that to work well, but @SimonSapin has had success getting LLVM to auto-vectorize in rust-lang/rust#59283 . That code would probably be clearer, and a hair more efficient, but I wasn't able to get auto-vectorization to work for bytewise operations in a u64.

If this approach generally looks good, I'll keep going.

@zbraniecki
Copy link
Collaborator

I see that the full generality of bcp-47 is not supported, for example language can only be 2-3 alpha characters, so "i-enochian" would be failed, similarly for registered language subtag.

Yes, we're aiming for Unicode Locale Identifier, not Language Tag.

Please, consult http://unicode.org/reports/tr35/#BCP_47_Conformance for differences.

@raphlinus
Copy link
Author

Ah, that's super helpful, thanks. I was using bcp-47 as my normative spec. I'll adjust.

@raphlinus
Copy link
Author

As of this commit, this PR moves bench_locale from to 3214ns to 1096ns. Of the remaining time, roughly 40% is in the split on ('-', '_'), and roughly 15% is .is_ascii(), both of which are in stdlib. It should definitely be possible improve these as well, but my main focus is making the representation efficient, so it can be cloned and compared quickly; the representation stuff has gone from the lion's share to a few percent, so I'm pretty happy with how this is going.

@zbraniecki
Copy link
Collaborator

@raphlinus - can we move this to unic-langid crate?

@raphlinus
Copy link
Author

Sure, I'm not attached. Feel free to take my draft whichever direction you like, and if you need my attention on something, just let me know exactly what you'd like.

@zbraniecki
Copy link
Collaborator

closing in favor of zbraniecki/unic-locale#7

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants