Bloat binary size for fantastic cold start performance #54

martindisch · 2022-03-27T16:00:09Z

As the slightly ironic title hopefully conveys, I'm not very convinced of this idea, I'm just putting it here for the sake of discussion.

Motivation

Lingua stores language models (which really are just n-grams and their associated probabilities) as zipped JSON files in the binary. Depending on user preference either upfront at startup or lazily on demand, it will unzip, parse and then transform this data into the in-memory representation HashMap<Ngram, f64> used by part of the detection pipeline. That carries a certain computational cost, which is negligible for most use cases where the detector is reused and this initial cost can be offset over its lifetime as all following detections can use the already "warmed up" detector.

In certain environments however, for example serverless computing or when using WebAssembly in plugins for other software, the detector can not be used in such a way. In these cases our code (containing a call to Lingua) is very short-lived and no reuse over subsequent invocations is possible, meaning each time it has to be constructed from scratch. In these permanent "cold start" situations, the described startup cost ends up dominating the overall runtime needed for detection. In my case, where I invoke Lingua from C# by means of wasmtime, that results in taking 540 ms for detecting the language of a short text, most of which is spent in startup. The following flamegraph illustrates this nicely. You can see some time being spent deflating the zipped data, some more deserializing the JSON and then some constructing the hash map.

I would like to point out that this is not a weakness of Lingua. It's a problem that comes with the domain of language detection and I'm not aware of a library that has solved this. And as I mentioned, it's really only an issue in specialized use cases.

Solution

Idea

We could save a lot of time by embedding the language models into the binary in a way that is significantly cheaper to read. With rkyv's zero-copy deserialization it's possible to encode a data structure in a format that is the same as its in-memory representation. This means there's no additional work to be done, as language models can be read directly from the binary's data segment where they are stored.

Implementation

I hacked the demo together by adding a build script for language crates. It reads the JSON models for a language and writes them to binary files in their rkyv encoding. These bytes are then simply statically included and accessed at runtime. Instead of exposing a Dir<'static>, every language crate now has a [Option<&'static [u8]>; 5], an array of bytes for all 5 possible n-gram types. The bytes are just the archived representation of HashMap<Ngram, f64>.

To read the JSON models from the build script I had to move the Fraction and Ngram types out into a separate common crate used by both the build script and Lingua.

Results

I measured the C# scenario I described earlier.

Benchmark implemented with BenchmarkDotNet on .NET 6.0
Uses wasmtime to call a WebAssembly module built with lingua to detect the language of a single short text
Instance of WebAssembly module (its memory space) is destroyed after each iteration, while the JIT-compiled module (bytecode) itself can be reused
Only 4 languages were included in the binary

The effect of zero-copy deserialization of the language models in this cold start scenario is as follows:

	before	after	change
Binary size	8.8 MB	33 MB	x3.75 bigger
Time	543 ms	8.9 ms	x61 faster

The new flamegraph looks much less informative, mainly because there's just not a whole lot going on anymore. Some time is spent on what I think is language runtime stuff and the rest in the actual detection, there is nothing obvious to optimize away anymore.

These results are impressive both in the positive and negative sense. The build takes much longer now because it's serializing a bunch of pretty sizable files to disk for all languages. And the actual binary grows to a size that is no longer convenient to deliver in all environments, for example if you imagine it having to be downloaded over a mobile connection to a phone. On the other hand it's pretty amazing to see that you can initialize and run a sophisticated language detection suite in just under 9 ms.

Causes some issue with a bindgen function not being exposed.

pemistahl · 2022-03-28T18:39:10Z

Hi Martin, this is a very interesting case study. Thanks a lot for the effort you have put into it. It is surprising how much time is needed for unzipping and parsing, even though Rust already is very efficient performing this kind of task. But the binary size is problematic, I agree. Do you perhaps know an alternative to rkyv that produces smaller binaries but is still significantly faster than json decoding?

Can you do me a favor and make your PR compile properly? The CI pipeline has failed. I would then merge the PR into a feature branch of my own for future reference and experiments. Thank you.

djkoloski · 2022-03-28T18:41:33Z

If it matters, I was going to do some binary size investigation on this change to see if there were any space-saving opportunities that rkyv is missing. Do you have an idea of what an acceptable binary size threshold would be?

We'd just be testing rkyv serialization/deserialization.

Since Display on a path contains backslashes on Windows, this gives us trouble when directly using that in generated code. Using Debug formatting escapes them for us.

martindisch · 2022-03-28T22:12:26Z

@pemistahl I don't know of an alternative that ticks all the boxes. There are formats that are faster to deserialize and there are even faster JSON deserializers than serde_json, but I think it wouldn't make that much of a difference. Nothing comes close to zero-copy deserialization where absolutely no additional work has to be done. Also deserialization is only part of the equation, as we currently spend even slightly more time populating the n-gram hash map after deserializing the language model.

I fixed the remaining issues so the tests pass. But the pipeline takes over 25 minutes now and the one for Windows can't even complete because it runs out of disk space, I guess that speaks for itself 😄

@djkoloski Great to see you here! By the way, I first learned about rkyv from swc-project/swc#2635. Pretty amazing how this approach solved a major blocker and contributed to making the new plugin system feasible, that's a pretty big deal for SWC and the sizable (and only growing) ecosystem around it.

I can't give you a hard number for a size limit, but I don't think it matters either. Looking at the fivegrams for the German language model as an example, the intermediate file containing the serialized bytes takes up 6'732'788 bytes on disk for the total of 336'605 entries in the HashMap<Ngram, f64> (an Ngram is just a wrapper for a String). That means about 20 bytes per entry, of which the f64 takes up 8 and the n-gram is a string of 5 characters, so at least 5 bytes unless they're not all ASCII and take up more. That's 13 bytes we're guaranteed to require for every entry and I imagine the rest is used to support the hash map itself. My napkin math may be wrong, but I'd be very impressed if there's much room for improvement here. There's just no way around it, at one point during the execution all this space for the uncompressed data will be taken up, it used to be just in memory during runtime, now that we're shifting the work to compile time it's going to be in the binary too. Through no fault of either Lingua or rkyv, there are just some theoretical limits we can't beat.

djkoloski · 2022-03-28T23:57:59Z

That's awesome! I've been idly watching swc for a while and it's been exciting to see everything they're working on. There's some new tech coming down the pipeline that I think will help make those kinds of use cases better too.

The archived hash map implementation for rkyv has an overhead of up to four bytes per key, so reducing that even by a byte or two could yield some big savings. Using a B-Tree might result in lower overhead too, at the expense of maybe slower lookup. An added bonus of using a B-Tree would also be faster serialization, since hash maps do a lot of heavy compute to build perfect hash tables. In situations like this, it might be worth resurrecting the older swisstable-based implementation since it has similar performance characteristics and better serialization speed.

martindisch · 2022-03-29T18:42:03Z

Nice, you're right a couple of bytes might actually make a difference here. This is turning out to be a pretty good benchmark for that sort of thing! I'm still going to close this PR because I just opened it to document the results of this little investigation. We can keep discussing here if anything comes up.

Manishearth · 2022-04-05T21:26:15Z

Worth noting: we've had a similar desire in ICU4X, but unfortunately we don't use rkyv (we instead use serde and zerovec), so there's no way to to a zero-cost unvalidated deserialize of trusted data.

For this purpose, we designed CrabBake, which allows you to serialize an object straight to rust code. We've got a working experimental implementation in here, hoping to have time to polish it up and start using it by the end of this month.

martindisch · 2022-04-06T19:55:02Z

Very interesting, thanks for putting it out there! This is exactly the kind of input I was hoping for when posting the writeup here.

It sounds like a much more elegant version of my first attempt before I switched over to rkyv. I started out by generating code that constructs a PHF map and then doing an include!() to pull in the code at compile time. I ended up abandoning this one because I didn't like how it had to generate enormous source files. That would be way more convenient with CrabBake.

martindisch added 11 commits March 27, 2022 13:43

Move Ngram into common

556d2d0

Move Fraction into common

d1c0816

Generate rkyv representations of models

ac0515f

Load & use archived HashMap

c3431f2

Export archived data directly as statics

f47e62b

Load archived HashMap from static

33109e4

Split model-builder into modules

1020364

Align archived data to 16 bytes

b4a5bf5

Build archives for all language models

a5ecc17

Deal with languages that don't have all 5 n-gram types

179a221

Temporarily exclude wasm module

a3d386c

Causes some issue with a bindgen function not being exposed.

martindisch mentioned this pull request Mar 27, 2022

Add WASM support #14

Closed

martindisch added 2 commits March 28, 2022 22:53

Make tests run by removing now unnecessary one

0fd8be5

We'd just be testing rkyv serialization/deserialization.

Make build work on Windows

b75d93e

Since Display on a path contains backslashes on Windows, this gives us trouble when directly using that in generated code. Using Debug formatting escapes them for us.

martindisch closed this Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bloat binary size for fantastic cold start performance #54

Bloat binary size for fantastic cold start performance #54

martindisch commented Mar 27, 2022 •

edited

Loading

pemistahl commented Mar 28, 2022

djkoloski commented Mar 28, 2022

martindisch commented Mar 28, 2022

djkoloski commented Mar 28, 2022

martindisch commented Mar 29, 2022

Manishearth commented Apr 5, 2022

martindisch commented Apr 6, 2022

Bloat binary size for fantastic cold start performance #54

Bloat binary size for fantastic cold start performance #54

Conversation

martindisch commented Mar 27, 2022 • edited Loading

Motivation

Solution

Idea

Implementation

Results

pemistahl commented Mar 28, 2022

djkoloski commented Mar 28, 2022

martindisch commented Mar 28, 2022

djkoloski commented Mar 28, 2022

martindisch commented Mar 29, 2022

Manishearth commented Apr 5, 2022

martindisch commented Apr 6, 2022

martindisch commented Mar 27, 2022 •

edited

Loading