Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bloat binary size for fantastic cold start performance #54

Closed
wants to merge 13 commits into from

Conversation

martindisch
Copy link
Contributor

@martindisch martindisch commented Mar 27, 2022

As the slightly ironic title hopefully conveys, I'm not very convinced of this idea, I'm just putting it here for the sake of discussion.

Motivation

Lingua stores language models (which really are just n-grams and their associated probabilities) as zipped JSON files in the binary. Depending on user preference either upfront at startup or lazily on demand, it will unzip, parse and then transform this data into the in-memory representation HashMap<Ngram, f64> used by part of the detection pipeline. That carries a certain computational cost, which is negligible for most use cases where the detector is reused and this initial cost can be offset over its lifetime as all following detections can use the already "warmed up" detector.

In certain environments however, for example serverless computing or when using WebAssembly in plugins for other software, the detector can not be used in such a way. In these cases our code (containing a call to Lingua) is very short-lived and no reuse over subsequent invocations is possible, meaning each time it has to be constructed from scratch. In these permanent "cold start" situations, the described startup cost ends up dominating the overall runtime needed for detection. In my case, where I invoke Lingua from C# by means of wasmtime, that results in taking 540 ms for detecting the language of a short text, most of which is spent in startup. The following flamegraph illustrates this nicely. You can see some time being spent deflating the zipped data, some more deserializing the JSON and then some constructing the hash map.

Flamegraph before changes

I would like to point out that this is not a weakness of Lingua. It's a problem that comes with the domain of language detection and I'm not aware of a library that has solved this. And as I mentioned, it's really only an issue in specialized use cases.

Solution

Idea

We could save a lot of time by embedding the language models into the binary in a way that is significantly cheaper to read. With rkyv's zero-copy deserialization it's possible to encode a data structure in a format that is the same as its in-memory representation. This means there's no additional work to be done, as language models can be read directly from the binary's data segment where they are stored.

Implementation

I hacked the demo together by adding a build script for language crates. It reads the JSON models for a language and writes them to binary files in their rkyv encoding. These bytes are then simply statically included and accessed at runtime. Instead of exposing a Dir<'static>, every language crate now has a [Option<&'static [u8]>; 5], an array of bytes for all 5 possible n-gram types. The bytes are just the archived representation of HashMap<Ngram, f64>.

To read the JSON models from the build script I had to move the Fraction and Ngram types out into a separate common crate used by both the build script and Lingua.

Results

I measured the C# scenario I described earlier.

  • Benchmark implemented with BenchmarkDotNet on .NET 6.0
  • Uses wasmtime to call a WebAssembly module built with lingua to detect the language of a single short text
  • Instance of WebAssembly module (its memory space) is destroyed after each iteration, while the JIT-compiled module (bytecode) itself can be reused
  • Only 4 languages were included in the binary

The effect of zero-copy deserialization of the language models in this cold start scenario is as follows:

before after change
Binary size 8.8 MB 33 MB x3.75 bigger
Time 543 ms 8.9 ms x61 faster

The new flamegraph looks much less informative, mainly because there's just not a whole lot going on anymore. Some time is spent on what I think is language runtime stuff and the rest in the actual detection, there is nothing obvious to optimize away anymore.

Flamegraph after changes

These results are impressive both in the positive and negative sense. The build takes much longer now because it's serializing a bunch of pretty sizable files to disk for all languages. And the actual binary grows to a size that is no longer convenient to deliver in all environments, for example if you imagine it having to be downloaded over a mobile connection to a phone. On the other hand it's pretty amazing to see that you can initialize and run a sophisticated language detection suite in just under 9 ms.

@martindisch martindisch mentioned this pull request Mar 27, 2022
@pemistahl
Copy link
Owner

Hi Martin, this is a very interesting case study. Thanks a lot for the effort you have put into it. It is surprising how much time is needed for unzipping and parsing, even though Rust already is very efficient performing this kind of task. But the binary size is problematic, I agree. Do you perhaps know an alternative to rkyv that produces smaller binaries but is still significantly faster than json decoding?

Can you do me a favor and make your PR compile properly? The CI pipeline has failed. I would then merge the PR into a feature branch of my own for future reference and experiments. Thank you.

@djkoloski
Copy link

If it matters, I was going to do some binary size investigation on this change to see if there were any space-saving opportunities that rkyv is missing. Do you have an idea of what an acceptable binary size threshold would be?

We'd just be testing rkyv serialization/deserialization.
Since Display on a path contains backslashes on Windows, this gives us
trouble when directly using that in generated code. Using Debug
formatting escapes them for us.
@martindisch
Copy link
Contributor Author

@pemistahl I don't know of an alternative that ticks all the boxes. There are formats that are faster to deserialize and there are even faster JSON deserializers than serde_json, but I think it wouldn't make that much of a difference. Nothing comes close to zero-copy deserialization where absolutely no additional work has to be done. Also deserialization is only part of the equation, as we currently spend even slightly more time populating the n-gram hash map after deserializing the language model.

I fixed the remaining issues so the tests pass. But the pipeline takes over 25 minutes now and the one for Windows can't even complete because it runs out of disk space, I guess that speaks for itself 😄

@djkoloski Great to see you here! By the way, I first learned about rkyv from swc-project/swc#2635. Pretty amazing how this approach solved a major blocker and contributed to making the new plugin system feasible, that's a pretty big deal for SWC and the sizable (and only growing) ecosystem around it.

I can't give you a hard number for a size limit, but I don't think it matters either. Looking at the fivegrams for the German language model as an example, the intermediate file containing the serialized bytes takes up 6'732'788 bytes on disk for the total of 336'605 entries in the HashMap<Ngram, f64> (an Ngram is just a wrapper for a String). That means about 20 bytes per entry, of which the f64 takes up 8 and the n-gram is a string of 5 characters, so at least 5 bytes unless they're not all ASCII and take up more. That's 13 bytes we're guaranteed to require for every entry and I imagine the rest is used to support the hash map itself. My napkin math may be wrong, but I'd be very impressed if there's much room for improvement here. There's just no way around it, at one point during the execution all this space for the uncompressed data will be taken up, it used to be just in memory during runtime, now that we're shifting the work to compile time it's going to be in the binary too. Through no fault of either Lingua or rkyv, there are just some theoretical limits we can't beat.

@djkoloski
Copy link

That's awesome! I've been idly watching swc for a while and it's been exciting to see everything they're working on. There's some new tech coming down the pipeline that I think will help make those kinds of use cases better too.

The archived hash map implementation for rkyv has an overhead of up to four bytes per key, so reducing that even by a byte or two could yield some big savings. Using a B-Tree might result in lower overhead too, at the expense of maybe slower lookup. An added bonus of using a B-Tree would also be faster serialization, since hash maps do a lot of heavy compute to build perfect hash tables. In situations like this, it might be worth resurrecting the older swisstable-based implementation since it has similar performance characteristics and better serialization speed.

@martindisch
Copy link
Contributor Author

Nice, you're right a couple of bytes might actually make a difference here. This is turning out to be a pretty good benchmark for that sort of thing! I'm still going to close this PR because I just opened it to document the results of this little investigation. We can keep discussing here if anything comes up.

@Manishearth
Copy link

Worth noting: we've had a similar desire in ICU4X, but unfortunately we don't use rkyv (we instead use serde and zerovec), so there's no way to to a zero-cost unvalidated deserialize of trusted data.

For this purpose, we designed CrabBake, which allows you to serialize an object straight to rust code. We've got a working experimental implementation in here, hoping to have time to polish it up and start using it by the end of this month.

@martindisch
Copy link
Contributor Author

Very interesting, thanks for putting it out there! This is exactly the kind of input I was hoping for when posting the writeup here.

It sounds like a much more elegant version of my first attempt before I switched over to rkyv. I started out by generating code that constructs a PHF map and then doing an include!() to pull in the code at compile time. I ended up abandoning this one because I didn't like how it had to generate enormous source files. That would be way more convenient with CrabBake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants