-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloat binary size for fantastic cold start performance #54
Conversation
Causes some issue with a bindgen function not being exposed.
Hi Martin, this is a very interesting case study. Thanks a lot for the effort you have put into it. It is surprising how much time is needed for unzipping and parsing, even though Rust already is very efficient performing this kind of task. But the binary size is problematic, I agree. Do you perhaps know an alternative to rkyv that produces smaller binaries but is still significantly faster than json decoding? Can you do me a favor and make your PR compile properly? The CI pipeline has failed. I would then merge the PR into a feature branch of my own for future reference and experiments. Thank you. |
If it matters, I was going to do some binary size investigation on this change to see if there were any space-saving opportunities that rkyv is missing. Do you have an idea of what an acceptable binary size threshold would be? |
We'd just be testing rkyv serialization/deserialization.
Since Display on a path contains backslashes on Windows, this gives us trouble when directly using that in generated code. Using Debug formatting escapes them for us.
@pemistahl I don't know of an alternative that ticks all the boxes. There are formats that are faster to deserialize and there are even faster JSON deserializers than I fixed the remaining issues so the tests pass. But the pipeline takes over 25 minutes now and the one for Windows can't even complete because it runs out of disk space, I guess that speaks for itself 😄 @djkoloski Great to see you here! By the way, I first learned about rkyv from swc-project/swc#2635. Pretty amazing how this approach solved a major blocker and contributed to making the new plugin system feasible, that's a pretty big deal for SWC and the sizable (and only growing) ecosystem around it. I can't give you a hard number for a size limit, but I don't think it matters either. Looking at the fivegrams for the German language model as an example, the intermediate file containing the serialized bytes takes up 6'732'788 bytes on disk for the total of 336'605 entries in the |
That's awesome! I've been idly watching swc for a while and it's been exciting to see everything they're working on. There's some new tech coming down the pipeline that I think will help make those kinds of use cases better too. The archived hash map implementation for rkyv has an overhead of up to four bytes per key, so reducing that even by a byte or two could yield some big savings. Using a B-Tree might result in lower overhead too, at the expense of maybe slower lookup. An added bonus of using a B-Tree would also be faster serialization, since hash maps do a lot of heavy compute to build perfect hash tables. In situations like this, it might be worth resurrecting the older swisstable-based implementation since it has similar performance characteristics and better serialization speed. |
Nice, you're right a couple of bytes might actually make a difference here. This is turning out to be a pretty good benchmark for that sort of thing! I'm still going to close this PR because I just opened it to document the results of this little investigation. We can keep discussing here if anything comes up. |
Worth noting: we've had a similar desire in ICU4X, but unfortunately we don't use For this purpose, we designed CrabBake, which allows you to serialize an object straight to rust code. We've got a working experimental implementation in here, hoping to have time to polish it up and start using it by the end of this month. |
Very interesting, thanks for putting it out there! This is exactly the kind of input I was hoping for when posting the writeup here. It sounds like a much more elegant version of my first attempt before I switched over to |
As the slightly ironic title hopefully conveys, I'm not very convinced of this idea, I'm just putting it here for the sake of discussion.
Motivation
Lingua stores language models (which really are just n-grams and their associated probabilities) as zipped JSON files in the binary. Depending on user preference either upfront at startup or lazily on demand, it will unzip, parse and then transform this data into the in-memory representation
HashMap<Ngram, f64>
used by part of the detection pipeline. That carries a certain computational cost, which is negligible for most use cases where the detector is reused and this initial cost can be offset over its lifetime as all following detections can use the already "warmed up" detector.In certain environments however, for example serverless computing or when using WebAssembly in plugins for other software, the detector can not be used in such a way. In these cases our code (containing a call to Lingua) is very short-lived and no reuse over subsequent invocations is possible, meaning each time it has to be constructed from scratch. In these permanent "cold start" situations, the described startup cost ends up dominating the overall runtime needed for detection. In my case, where I invoke Lingua from C# by means of wasmtime, that results in taking 540 ms for detecting the language of a short text, most of which is spent in startup. The following flamegraph illustrates this nicely. You can see some time being spent deflating the zipped data, some more deserializing the JSON and then some constructing the hash map.
I would like to point out that this is not a weakness of Lingua. It's a problem that comes with the domain of language detection and I'm not aware of a library that has solved this. And as I mentioned, it's really only an issue in specialized use cases.
Solution
Idea
We could save a lot of time by embedding the language models into the binary in a way that is significantly cheaper to read. With rkyv's zero-copy deserialization it's possible to encode a data structure in a format that is the same as its in-memory representation. This means there's no additional work to be done, as language models can be read directly from the binary's data segment where they are stored.
Implementation
I hacked the demo together by adding a build script for language crates. It reads the JSON models for a language and writes them to binary files in their rkyv encoding. These bytes are then simply statically included and accessed at runtime. Instead of exposing a
Dir<'static>
, every language crate now has a[Option<&'static [u8]>; 5]
, an array of bytes for all 5 possible n-gram types. The bytes are just the archived representation ofHashMap<Ngram, f64>
.To read the JSON models from the build script I had to move the
Fraction
andNgram
types out into a separatecommon
crate used by both the build script and Lingua.Results
I measured the C# scenario I described earlier.
The effect of zero-copy deserialization of the language models in this cold start scenario is as follows:
The new flamegraph looks much less informative, mainly because there's just not a whole lot going on anymore. Some time is spent on what I think is language runtime stuff and the rest in the actual detection, there is nothing obvious to optimize away anymore.
These results are impressive both in the positive and negative sense. The build takes much longer now because it's serializing a bunch of pretty sizable files to disk for all languages. And the actual binary grows to a size that is no longer convenient to deliver in all environments, for example if you imagine it having to be downloaded over a mobile connection to a phone. On the other hand it's pretty amazing to see that you can initialize and run a sophisticated language detection suite in just under 9 ms.