Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental compilation cache in Cranelift #4155

Closed
bnjbvr opened this issue May 16, 2022 · 14 comments · Fixed by #4551
Closed

Incremental compilation cache in Cranelift #4155

bnjbvr opened this issue May 16, 2022 · 14 comments · Fixed by #4551
Assignees
Labels
cranelift Issues related to the Cranelift code generator enhancement

Comments

@bnjbvr
Copy link
Member

bnjbvr commented May 16, 2022

Feature

Allow Wasmtime/Cranelift to store compiled artifacts of single functions in a global cache, and when trying to compile a function with the same IR, reload the compiled artifact from the cache instead of recompiling it from scratch.

Benefit

  • Speed up recompilation of wasm modules that have already been compiled once, but for which only a very small subset of code changed. Quite impactful for hot-reload use cases and incremental workflows, in particular.
  • Speed up compilation of wasm modules sharing lots of inner code, creating a global compilation cache (e.g. when reusing the same libraries in different modules, it's possible the lib code may be present and the same in all the modules).

Implementation

  • Since this will require moar dependencies in Cranelift (namely serde and bincode) and some users might not need the compilation cache, I plan to implement this behind a Cargo feature (and behind a dynamic runtime config in Cranelift too, to enable/disable without having to recompile it all).
  • In Cranelift, simple memoization strategy when compiling:
    • Use the function's meaningful CLIF as the input (cache key).
    • Use the MachCompileResult as the output (cache value)

To make this really efficient, the caching system should be resilient to non-meaningful changes in the function's input. For instance, it should be independent of the source locations, but also to adding/removing functions (which would shift function references indices, and thus cause mostly only cache misses). This may involve relocations or other fixups to make this work, just after reloading from the cache.

I also want to make sure that we don't accidentally miss a new field in the ir::Function being meaningful in the cache key, so here I think of using a mix of strategies:

  • static typing, i.e. use new "cacheable" types for MachCompileResult and its fields, that require explicit conversion from/to the non-cacheable types. So when we developers add new fields to MachCompileResult we think about their being needed in the key or not.
  • fuzzing. Credit goes to @cfallin for the idea: generate a random CLIF/wasm function F, compile it with the incremental compilation cache, slightly alter the function to obtain another function F", recompile it; if the cache keys for F and F" are the same, compare compiling F" from scratch with the result of loading from the cache + applying the fixups: the results should be exactly the same.

(more implementations strategies / design choices below)

  • The caching mechanism should also check the compilation configuration (for which target are we compiling, with which cranelift compile flags).

    • I hoped to be able to reuse wasmtime-cache's mechanism in there, but most of the metadata doesn't apply cleanly when translated to Cranelift. So I am planning to either split the metadata struct in the bits that apply to Cranelift and those that don't, or to create a new similar data structure.
  • Provide a way to configure how to store/retrieve the cached bytes (basically a key/value store for which keys and values are Vec<u8> or similar). That allows user to implement their cache policy on top of that, and use different backing storage (in memory, on disk), etc.

    • Wasmtime would expose a new public API to make this configurable, when enabling the incremental compilation cache. (It could be in the same public API that allows to enable the cache at all, in fact.)
    • Is that fine to use a Rust trait here, or will it cause hardships for the C bindings?

Alternatives

  • Speed up compilation in general using algorithmic/micro-architectural/memory optimizations, which is a fine but orthogonal goal.
  • Have a compilation cache at the wasm level (that is, use the wasm function as the input). Probably a bit simpler, but would not benefit non-wasmtime consumers like cg_clif (... which may not need it, because it might benefit from rustc's incremental compilation framework, likely?)
  • [OTHER SMART ALTERNATIVES GO HERE]
@bnjbvr bnjbvr added enhancement cranelift Issues related to the Cranelift code generator labels May 16, 2022
@bnjbvr bnjbvr self-assigned this May 16, 2022
@bnjbvr
Copy link
Member Author

bnjbvr commented May 16, 2022

I should add: I've started working on this. I already have something that works for the simplest case, that is, changing a single statement in the source code and recompiling only recompiles the one function that changed. I'm now looking into making it resilient when adding/removing functions internally-defined in the same wasm module (viz. not imported functions yet).

@bjorn3
Copy link
Contributor

bjorn3 commented May 16, 2022

Can you hash the entirety of Function as cache key and instead ensure that the Function only changes when there are meaningful changes to the input wasm? For example you could renumber all FuncRef starting from 0 in a deterministic way. So for example a function calling functions 5 and 3 could get FuncRef(0) and FuncRef(1) to reference functions 5 and 3 respectively. If a new function shifts said functions to offset 6 and 4, you would still get FuncRef(0) and FuncRef(1), but a different mapping table. This should even be resilient against a completely different function with the same signature being called as the FuncRef's wouldn't change. For VMContext offsets you could introduce a new GlobalValueData variant containing a constant allowing relocating of VMContext offsets after compilation. For source locations you could tell cranelift-codegen the source locations relative to the start of the function and then add the function offset when actually using them.

Also I think this caching implementation can be kept in a separate crate. There is already a cranelift-codegen feature to enable serde serialization and deserialization of Function, which should be enough.

Probably a bit simpler, but would not benefit non-wasmtime consumers like cg_clif (... which may not need it, because it might benefit from rustc's incremental compilation framework, likely?)

Cg_clif benefits from rustc's incremental compilation framework, but that one only works at the level of codegen units for codegen which is less fine grained than individual functions.

@cfallin
Copy link
Member

cfallin commented May 16, 2022

Thanks for writing all of this up, @bnjbvr! I am very excited to see how this works out, and early results already show that it should be a huge improvement for users with tight edit-compile-reload loops.

Two things I wanted to add:

  1. A little more detail on the "relocations" idea that we came up with, both to document it here and to see if others can think of related ideas or any issues. The basic idea is that parts of input (CLIF) to Cranelift may vary slightly in unimportant ways when other parts of a Wasm module are changed. These are "renumbering" kinds of problems: the shape of the CLIF is the same, but particular values change.

    Two particular examples are: SourceLocs (e.g. all the source code for the slightly-edited input Wasm module shifts down by one line) and offsets in Wasmtime's vmctx struct at which imported function structs are loaded (e.g. adding an import shifts all of these after the import down by 16 bytes).

    The basic idea is to make these constants "symbolic" somehow and feed them into the code as fixups after the point at which it's cached. (We called them "relocations" in our discussion but actually I like something like "constant parameters" a bit better now...) Inserting concrete values for constant parameters may involve fixups in the MachCompileResult's metadata (e.g. for line-number info) but also maybe in the machine code itself (e.g. a constant offset for a load).

    The latter case makes things interesting too: e.g. a load may go out-of-range (for example with the 12-bit offset for a reg+imm addressing mode on AArch64). In this case I think we can say that parameter insertion may "fail", and we just fall back to a compilation from scratch. So adding imports might hit a cliff at some point at which all the generated code needs to start using a more verbose access sequence to a huge vmctx, but that's OK.

  2. Regarding the caching API itself: I think I might actually slightly prefer an "inverted control" scheme here where the embedder controls all access to the KV store, rather than try to reuse something from Wasmtime. In particular, if the CLIF lets us get a "cache key", and if the compiler context exposes the two-step process "compile to cacheable object" and "finalize cacheable object given concrete params/actual CLIF/..." then one could write a wrapper around this that does the right thing with a KV store, but it's more flexible.

    The specific reason I favor this is that it keeps Cranelift agnostic to any notions of storage or IO; a compiler really should live as much as possible in the "pure function from data to data" space. And there may be other ways a user would want to use this functionality than we initially imagine and bake into a KV-store trait. Two particular cases that are hard are the C API, as you mention; but also e.g. async (what if the KV store is async? Do large parts of Cranelift become async too?).

@alexcrichton
Copy link
Member

From an internal design perspective I would recommend following what wasmtime-cache is already doing which is to have a function that looks like:

    pub fn get_data_raw<T, U, E>(
        &self,
        state: &T,
        // NOTE: These are function pointers instead of closures so that they
        // don't accidentally close over something not accounted in the cache.
        compute: fn(&T) -> Result<U, E>,
        serialize: fn(&T, &U) -> Option<Vec<u8>>,
        deserialize: fn(&T, Vec<u8>) -> Option<U>,
    ) -> Result<U, E>
    where
        T: Hash,
    {
        // ...
    }

Here state is explicitly Hash to generate a sha or something like that and the compute, serialize, and deserialize functions are all explicitly function pointers that cannot close over state. This means that the computation of U, the memoization of the compute computation over T, is guaranteed to reasonably be a function of T.

Ideally the wasmtime-cache bits could either be reused for this or somehow refactored to not duplicate this. Historically I have found that stale caches, especially with incremental compilation, are so difficult to debug and diagnose that getting the interface right so that it's practically never broken is extremely important.

@bnjbvr
Copy link
Member Author

bnjbvr commented May 17, 2022

(Will use ICC for incremental compilation cache from now on :))

Can you hash the entirety of Function as cache key and instead ensure that the Function only changes when there are meaningful changes to the input wasm? For example you could renumber all FuncRef starting from 0 in a deterministic way. So for example a function calling functions 5 and 3 could get FuncRef(0) and FuncRef(1) to reference functions 5 and 3 respectively. If a new function shifts said functions to offset 6 and 4, you would still get FuncRef(0) and FuncRef(1), but a different mapping table. This should even be resilient against a completely different function with the same signature being called as the FuncRef's wouldn't change. For VMContext offsets you could introduce a new GlobalValueData variant containing a constant allowing relocating of VMContext offsets after compilation. For source locations you could tell cranelift-codegen the source locations relative to the start of the function and then add the function offset when actually using them.

I think that's a fine idea, but I wonder if this wouldn't bias the code towards the incremental cache a bit too much, making it non "zero-cost" for users who disabled the incremental compilation cache (ICC) feature? Say, when it comes to source locations, this means that first we'd transform all source locations into relative (from start) ones, then if the ICC isn't used, re-transform them back later into absolute source locations (for the MachRelocs to be correct) -- or carry different meanings for these fields whether the ICC is enabled or not. This back-and-forth wouldn't add any value to users not using the ICC, so I tend to slightly prefer the opposite choice which is to have bespoke "mirror" data structures for the compilation cache (which could contain plain references to Function's fields, when they don't need any change, and only introduce new ones when they do need internal changes).

The strategies you mention here can still be used, though. In fact the relative-from-start source location was what I implemented for that specific problem indeed, and I had a similar idea of a scheme for function refs.

Also I think this caching implementation can be kept in a separate crate. There is already a cranelift-codegen feature to enable serde serialization and deserialization of Function, which should be enough.

That would be really nice if it could be a separate crate, yet I am unclear if it's doable because currently the ICC uses lots of cranelift-codegen internals. Will keep this in mind, and we can reconsider this later when closer to the "final" design and implementation.

@bnjbvr
Copy link
Member Author

bnjbvr commented May 17, 2022

@alexcrichton @cfallin I am trying to consolidate the two design suggestions, so rephrasing to make sure I've understood clearly:

  • Alex is proposing to reuse the backing storage that's used for Wasmtime caching of modules, since there's already a nice API that handles that, with safety backed in the function signature and generics.
    • If I have understood clearly: this is something I would like to avoid, since our use case implies having control over the storage itself, and not reuse the on-disk cache directory that wasmtime uses. (In fact, we're not using wasmtime-cache because of that need for control over the backing storage for the cache.) Also, I think that the compute in this case would be one of the methods exposed in cranelift_codegen and proposed by Chris, so it wouldn't really need to be provided by a wasmtime user, then, right?
  • Chris is proposing that cranelift-codegen exposes 2 methods: one to perform the compile step with constants, and then another one to finish the compilation by fixing up the constants.
    • This makes sense to me, and indeed makes it possible to have all sorts of IO providers on top of that. Then from looking at the callers of cranelift_codegen::compile, I think that we'd need a way to specify another wasmtime_environ::compilation::Compiler impl in Module::build_artifacts, so we can use the ICC methods provided by cranelift-codegen there. Am I understanding this correctly? If so, would refactoring the Engine's code to use a Box<dyn Compiler> instead of the concrete wasmtime_environ::Compiler impl make sense, as well as providing a public API to set a dyn Compiler impl that should be used at runtime? (If not, I'm happy to chat on Zulip!)

@alexcrichton
Copy link
Member

To clarify I'm talking more about methods of caching moreso than using specific code exactly as-is without any modifications. The strategy used by wasmtime-cache I feel is pretty solid and we should be quite deliberate about migrating away from that if we do.

I do think it would be a not-great state to be in to have two caching systems at the end of this, so I think in the long run we don't want two entirely separate systems for caching things that know nothing of each other. In the short-to-mid term though it's fine to develop whatever and refactor things later to share more code.

@bjorn3
Copy link
Contributor

bjorn3 commented May 17, 2022

Say, when it comes to source locations, this means that first we'd transform all source locations into relative (from start) ones, then if the ICC isn't used, re-transform them back later into absolute source locations (for the MachRelocs to be correct)

We already need transforming between the output format of cranelift and the serialized format of wasmtime. It is just a cheap addition which can be folded into this transformation. Basically just add the addition here:

FilePos::new(loc.bits())

so I tend to slightly prefer the opposite choice which is to have bespoke "mirror" data structures for the compilation cache (which could contain plain references to Function's fields, when they don't need any change, and only introduce new ones when they do need internal changes).

That would make the incremental compilation cache depend on the original version of the function, right? I'm afraid that would turn hard to debug bugs like is normally the case for incremental compilation related bugs into straight up nightmares.

@bnjbvr
Copy link
Member Author

bnjbvr commented May 17, 2022

It is just a cheap addition which can be folded into this transformation. Basically just add the addition here:

I guess it's not "just" that :-). In practice every transitive field that contains a SourceLoc in MachCompileResult should get the same treatment, so that also means tweaking mach_reloc_to_reloc, mach_trap_to_trap, etc. That also means moving the responsibility of doing those fixups from Cranelift to the consumer of the compiled artifact, when the initial intent was to make it as easy as possible for every Cranelift user to have something that works out of the box without requiring them to do extra work. (This argument has limited applicability as the two main users are wasmtime and cg_clif, but still hold in theory at least.)

I do hear your concerns about running into incremental compilation related bugs, though, which would be mitigated by the mix of static typing sanity + fuzzing; any other idea to make it more robust is appreciated here too.

@cfallin
Copy link
Member

cfallin commented May 17, 2022

To unify things a bit: I think that we could, and probably should, use the same caching mechanism for both the Wasmtime module-level cache and any Cranelift function-level cache. I was mainly concerned above about flexibility (unsure with a synchronous API like wasmtime-cache: how would one use an async external storage provider, e.g. a key-value store across a network?).

I agree with @alexcrichton that we should design to make it impossible to get cache keys wrong. To that end if we have a "put together the pieces yourself" option, I think that we can still make it completely safe by storing the expected CLIF hash in the cached value as well, and checking that on return. In essence the interface that I think provides maximum flexibility is something like:

  • Context::compile_to_cacheable() -> (CacheKey, Vec<u8>): from a Cranelift function, get bytes and a cache key
  • Context::get_expected_cache_key() -> CacheKey: from a Cranelift function, get the cache key for the cached compilation result we need, if any
  • Context::compile_maybe_with_cache(bytes: Option<&[u8]>) -> MachCompileResult: from a Cranelift function, compile to machine code. If the given bytes are a valid compilation result from compile_to_cacheable() with a matching hash to what's embedded in the bytes, then use them; otherwise fall back to from-scratch compilation. Also fall back to from-scratch if cached result isn't usable due to e.g. an offset going out-of-bounds with vmctx shifts.

This is fully "dynamically safe" in that the cached data is only used if the cache key matches, and the cache key production is internal to Cranelift and opaque to outside. We can then use the type-safety ideas that Ben mentions (wrapper types on fields) to make that CacheKey correct.

This together with the fuzzing to verify correctness of incremental compilation I suspect is enough, and this whole approach grants the flexibility that Ben's case needs.

Then tying this back to Wasmtime: wasmtime-cache could potentially use these building blocks to cache functions as well as whole modules; or an advanced user of Cranelift could use them directly. Thoughts?

@bjorn3
Copy link
Contributor

bjorn3 commented May 17, 2022

I guess it's not "just" that :-). In practice every transitive field that contains a SourceLoc in MachCompileResult should get the same treatment, so that also means tweaking mach_reloc_to_reloc, mach_trap_to_trap, etc.

mach_reloc_to_reloc ignores srcloc and so does mach_trap_to_trap. collect_address_maps seems to be the only that actually reads srcloc.

@bnjbvr
Copy link
Member Author

bnjbvr commented May 23, 2022

mach_reloc_to_reloc ignores srcloc and so does mach_trap_to_trap. collect_address_maps seems to be the only that actually reads srcloc.

Indeed, I even started a patch to remove those as they're really unused in the code base,... and then I recalled that Spidermonkey would use those (if it integrated with the new backend). Anyhow, I don't still think it would make for a good API to request any cranelift embedder to do this kind of transform themselves, when it can just be done in Cranelift directly, so I'm going to stick with my plan (which is more aligned with what Chris suggests too), incorporating the good ideas from yours :)

@bjorn3
Copy link
Contributor

bjorn3 commented May 23, 2022

Indeed, I even started a patch to remove those as they're really unused in the code base,... and then I recalled that Spidermonkey would use those (if it integrated with the new backend).

mach_reloc_to_reloc and friends are in wasmtime-cranelift so spidermonkey can't use them even if it wanted to. They are the translation between Cranelift's representation and Wasmtime's serializable representation of the compilation results.

@bnjbvr
Copy link
Member Author

bnjbvr commented May 23, 2022

I was talking about the srcloc fields in MachTrap et al. In fact I've received confirmation from the Spidermonkey folks that they might not need it, so will open a PR to remove these, as an incremental step towards this.

(Sorry all for the noise caused by this sidetracked discussion, and @bjorn3 happy to follow up in Zulip if you'd like to chat more about this.)

bnjbvr added a commit to bnjbvr/wasmtime that referenced this issue Jun 13, 2022
This fixes a bug when the `cold` field would not be serialized, since
we're using a custom (de)serializer for `Layout`. This is now properly
handled by adding a boolean in the serialized stream.

This was caught during the work on bytecodealliance#4155, as this would result in cache
mismatches between a function and itself.
cfallin pushed a commit that referenced this issue Jun 13, 2022
…4265)

This fixes a bug when the `cold` field would not be serialized, since
we're using a custom (de)serializer for `Layout`. This is now properly
handled by adding a boolean in the serialized stream.

This was caught during the work on #4155, as this would result in cache
mismatches between a function and itself.
bnjbvr added a commit that referenced this issue Aug 12, 2022
This is the implementation of #4155, using the "inverted API" approach suggested by @cfallin (thanks!) in Cranelift, and trait object to provide a backend for an all-included experience in Wasmtime. 

After the suggestion of Chris, `Function` has been split into mostly two parts:

- on the one hand, `FunctionStencil` contains all the fields required during compilation, and that act as a compilation cache key: if two function stencils are the same, then the result of their compilation (`CompiledCodeBase<Stencil>`) will be the same. This makes caching trivial, as the only thing to cache is the `FunctionStencil`.
- on the other hand, `FunctionParameters` contain the... function parameters that are required to finalize the result of compilation into a `CompiledCode` (aka `CompiledCodeBase<Final>`) with proper final relocations etc., by applying fixups and so on.

Most changes are here to accomodate those requirements, in particular that `FunctionStencil` should be `Hash`able to be used as a key in the cache:

- most source locations are now relative to a base source location in the function, and as such they're encoded as `RelSourceLoc` in the `FunctionStencil`. This required changes so that there's no need to explicitly mark a `SourceLoc` as the base source location, it's automatically detected instead the first time a non-default `SourceLoc` is set.
- user-defined external names in the `FunctionStencil` (aka before this patch `ExternalName::User { namespace, index }`) are now references into an external table of `UserExternalNameRef -> UserExternalName`, present in the `FunctionParameters`, and must be explicitly declared using `Function::declare_imported_user_function`.
- some refactorings have been made for function names:
  - `ExternalName` was used as the type for a `Function`'s name; while it thus allowed `ExternalName::Libcall` in this place, this would have been quite confusing to use it there. Instead, a new enum `UserFuncName` is introduced for this name, that's either a user-defined function name (the above `UserExternalName`) or a test case name.
  - The future of `ExternalName` is likely to become a full reference into the `FunctionParameters`'s mapping, instead of being "either a handle for user-defined external names, or the thing itself for other variants". I'm running out of time to do this, and this is not trivial as it implies touching ISLE which I'm less familiar with.

The cache computes a sha256 hash of the `FunctionStencil`, and uses this as the cache key. No equality check (using `PartialEq`) is performed in addition to the hash being the same, as we hope that this is sufficient data to avoid collisions.

A basic fuzz target has been introduced that tries to do the bare minimum:

- check that a function successfully compiled and cached will be also successfully reloaded from the cache, and returns the exact same function.
- check that a trivial modification in the external mapping of `UserExternalNameRef -> UserExternalName` hits the cache, and that other modifications don't hit the cache.
  - This last check is less efficient and less likely to happen, so probably should be rethought a bit.

Thanks to both @alexcrichton and @cfallin for your very useful feedback on Zulip.

Some numbers show that for a large wasm module we're using internally, this is a 20% compile-time speedup, because so many `FunctionStencil`s are the same, even within a single module. For a group of modules that have a lot of code in common, we get hit rates up to 70% when they're used together. When a single function changes in a wasm module, every other function is reloaded; that's still slower than I expect (between 10% and 50% of the overall compile time), so there's likely room for improvement. 

Fixes #4155.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator enhancement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants