diff --git a/text/2603-symbol-name-mangling-v2.md b/text/2603-symbol-name-mangling-v2.md new file mode 100644 index 00000000000..e082ced2adf --- /dev/null +++ b/text/2603-symbol-name-mangling-v2.md @@ -0,0 +1,1156 @@ +- Feature Name: `symbol_name_mangling_v2` +- Start Date: 2018-11-27 +- RFC PR: [rust-lang/rfcs#2603](https://github.com/rust-lang/rfcs/pull/2603) +- Rust Issue: [rust-lang/rust#60705](https://github.com/rust-lang/rust/issues/60705) + +# Summary +[summary]: #summary + +This RFC proposes a new mangling scheme that describes what the symbol +names generated by the Rust compiler look like. This new scheme has a number of +advantages over the existing one which has grown over time without a +clear direction. The new scheme is consistent, depends less on +compiler internals, and the information it stores in symbol names can +be decoded again which provides an improved experience for users of +external tools that work with Rust symbol names. + +Note that, at this point, the new mangling scheme would not be part of +the language specification or the specification of a stable Rust ABI. +In the future it _could_ be part of both and it is designed to be +stable and extensible; but for the time being it would still be an +implementation detail of the Rust compiler. + +# Motivation +[motivation]: #motivation + +Due to its ad-hoc nature, the compiler's current name mangling scheme +has a number of drawbacks: + +- Information about generic parameters and other things is lost in the + mangling process. One cannot extract the type arguments of a + monomorphized function from its symbol name. + +- The current scheme is inconsistent: most paths use + [Itanium ABI][itanium-mangling] style encoding, but some don't. + +- The symbol names it generates can contain `.` characters which is + not generally supported on all platforms. \[[1][gas]\] + \[[2][lld-windows-bug]\] \[[3][thin-lto-bug]\] + +- It depends on compiler internals and its results cannot be replicated + by another compiler implementation or external tool. + +[gas]: https://sourceware.org/binutils/docs/as/Symbol-Names.html#Symbol-Names +[lld-windows-bug]: https://github.com/rust-lang/rust/issues/54190 +[thin-lto-bug]: https://github.com/rust-lang/rust/issues/53912 + +The proposed scheme solves these problems: + +- It encodes information about generic parameters in a reversible way. +- It has a consistent definition that does not rely on pretty-printing + certain language constructs. +- It generates symbols that only consist of the characters `A-Z`, `a-z`, + `0-9`, and `_`. +- While the proposed scheme still contains things that are implementation + defined it has a clearer path towards full name predictability in future. + +These properties should make it easier for third party tools to work +with Rust binaries. + +# Guide-level explanation +[guide-level-explanation]: #guide-level-explanation + +The following section will lay out the requirements for a name mangling +scheme and then introduce the actual scheme through a series of ever +more complex examples. + +## Requirements for a Symbol Mangling Scheme + +A symbol mangling scheme has a few goals, one of them essential, +the rest of them desirable. The essential one is: + +- The scheme must provide an unambiguous string encoding for + everything that can end up in a binary's symbol table. + +"Unambiguous" means that no two distinct compiler-generated entities +(that is, mostly object code for functions) must be mapped to the same +symbol name. This disambiguation is the main purpose of the hash-suffix +in the current, legacy mangling scheme. The scheme proposed here, on +the other hand, achieves it in a way that allows to also satisfy a +number of additional desirable properties of a mangling scheme: + + - A mangled symbol should be *decodable* to some degree. That is, it + is desirable to be able to tell which exact concrete instance of e.g. + a polymorphic function a given symbol identifies. This is true for + external tools, backtraces, or just people only having the binary + representation of some piece of code available to them. With the + current scheme, this kind of information gets lost in the magical + hash-suffix. + + - A mangling scheme should be platform-independent. This is mainly + achieved by restricting the character set to `A-Z`, `a-z`, `0-9`, + `_`. All other characters might have special meaning in some + context (e.g. `.` for MSVC `DEF` files) or are simply not + supported (e.g. Unicode). + + - The scheme should be efficient, meaning that the symbols it + produces are not unnecessarily long (because that takes up space + in object files and means more work for the compiler and the linker). + In addition, generating or demangling a symbol name should not be + too computationally expensive. + + - When used as part of a stable ABI, it should be possible to predict + the symbol name for a given source-level construct. For example, + given the definition `fn foo() { ... }`, the scheme should allow + to construct, by hand, the symbol names for e.g. `foo` or + `foo, ...) -> !>()`. + Since the current scheme generates its hash from the values of + various compiler internal data structures, an alternative compiler + implementation could not predict the symbol name, even for + simple cases. Note that the scheme proposed here does not fulfill + this requirement either (yet) as some things are still left to + the compiler implementation. + +The RFC also has a couple of non-goals: + + - The mangling scheme does not try to be compatible with an existing + (e.g. C++) mangling scheme. While it might sound tempting to encode Rust + symbols with an existing scheme, it is the author's opinion that + the actual benefits are small (C++ tools would not demangle to Rust + syntax, demanglings would be hard to read) and at the same time + supporting a Rust-specific scheme in existing tools seems quite + feasible (many tools like GDB, LLDB, binutils, and valgrind already + have specialized code paths for Rust symbols). + + - The RFC does not try to define a standardized _demangled_ form for + symbol names. It defines the mangled form and makes sure it can be + demangled in an efficient manner but different demanglers still + have some degree of freedom regarding how symbol names are presented + to the user. + + +## The Mangling Scheme by Example + +This section will develop an overview of the mangling scheme by walking +through a number of examples. We'll start with the simplest case -- and +will see how that already involves things that might be surprising. + +### Free-standing Functions and Statics + +A free-standing function is fully identified via its absolute path. +For example, the following function + +```rust +mod foo { + fn bar() {} +} +``` + +has the path `foo::bar` and `NN3foo3bar` is a possible mangling of that path +that complies to the character set we are restricted to. Why this format with +numbers embedded in it? It is a run-length encoding, similar to what the +[Itanium C++ ABI][itanium-mangling] name mangling scheme uses for +identifiers. The scheme proposed here will also use this +format because it does not need termination tokens for identifiers +(which are hard to come by with our limited character set). + +Note that each component in the path (i.e. `foo` and `bar`) also has an +accompanying _start-tag_ (here `N`) at the beginning. This start-tag is +needed in order for the syntax to be able to represent complex, nested +structures as we will see later. + +The symbol name above, unfortunately, does not unambiguously identify the +function in every context. It is perfectly valid for another crate +to also define `mod foo { fn bar() {} }` somewhere. So in order to +avoid conflicts in such cases, the absolute path must always include +the crate-id, as in `NNC7mycrate3foo3bar`. The crate-id has a `C` +start-tag. + +There is another possible ambiguity that we have to take care of. +Rust has two distinct namespaces: the type and the value namespace. +This leads to a path of the form `crate_id::foo::bar` not uniquely +identifying the item `bar` because the following snippet is legal +Rust code: + +```rust +fn foo() { + fn bar() {} +} + +mod foo { + fn bar() {} +} +``` + +The function `foo` lives in the value namespace while the module `foo` +lives in the type namespace. They don't interfere. In order to make the +symbol names for the two distinct `bar` functions unique, we thus add a +namespace identifier to the start-tag of components where necessary, as in +`NvNvC7mycrate3foo3bar` for the first case and `NvNtC7mycrate3foo3bar` +second case (notice the difference: `NvNv...` vs `NvNt...`). + +There is one final case of name ambiguity that we have to take care of. +Because of macro hygiene, multiple items with the same name can appear in +the same context. The compiler internally disambiguates such names by +augmenting them with a numeric index. For example, the first occurrence +of the name `foo` within its parent is actually treated as `foo'0`, the +second occurrence would be `foo'1`, the next `foo'2`, and so one. The +mangling scheme will adopt this setup by prepending a disambiguation +prefix to each identifier with a non-zero index. So if macro expansion +would result in the following code: + +```rust +mod foo { + fn bar'0() {} + // The second `bar` function was introduced by macro expansion. + fn bar'1() {} +} +``` + +then we would encode the two functions symbols as `NvNtC7mycrate3foo3bar` +and `NvNtC7mycrate3foos_3bar` respectively (note the `s_` prefix in the +second case). A very similar disambiguation is needed for avoiding +conflicts between crates of the same name but different versions. The +same syntactic prefix is thus used for crate-id where we encode the +crate disambiguator as in `NtNvCs1234_7mycrate3foo3bar`. Details on +the shape of this prefix are provided in the reference-level description. + +As opposed to C++ and other languages that support function overloading, +we don't need to include function parameter types in the symbol name. +Rust does not allow two functions of the same name but different arguments. + +The final symbol name for the function would also include the prefix +`_R` that is common to all symbol names generated by this scheme: + +``` + _RNvNtCs1234_7mycrate3foo3bar + <>^^^^^<----><------><--><--> + |||||| | | | | + |||||| | | | +--- "bar" identifier + |||||| | | +------- "foo" identifier + |||||| | +------------- "mycrate" identifier + |||||| +-------------------- disambiguator for "mycrate" + |||||+------------------------ start-tag for "mycrate" + ||||+------------------------- namespace tag for "foo" + |||+-------------------------- start-tag for "foo" + ||+--------------------------- namespace tag for "bar" + |+---------------------------- start-tag for "bar" + +----------------------------- common Rust symbol prefix +``` + + +### Generic Functions + +Each monomorphization of a generic function has its own symbol name. +The monomorphizations are disambiguated by the list of concrete generic +arguments. These arguments are added to the symbol name by a pair of `I` +start-tag at the beginning and a list of the actual arguments at the end. +So the instance + +```rust +std::mem::align_of:: +``` + +would be mangled to + +``` +_RINvNtC3std3mem8align_ofdE + ^ ^^ + | || + | |+--- end of argument list + | +--- f64 + +--- start-tag +``` + +where `I` precedes the thing the arguments belong to, `d` designates `f64` +and `E` ends the argument list. As we can see, we need to be able to +represent all kinds of types that can be part of such an argument list. +(In the future, when const generics get added to the language, we will +also need to represent *values*) These kinds of types are: + + - basic types (`char`, `()`, `str`, `!`, `i8`, `i16`, ...) + - reference and pointers types, shared, `mut` and `const` + - tuples + - arrays, with and without fixed size (e.g. `[u8]`, `[u8; 17]`) + - structs, enums, closures, and other named types, possibly with their + own set of type arguments + - function types such as `fn(&i32) -> u16` + - `dyn` traits + +Basic types are all encoded via a single lower-case letter, like in the +Itanium scheme. Named types are encoded as their absolute path +(including arguments) like is done for function symbols. Composites like +references, tuples, and function types all follow a simple grammar given +in the reference-level explanation below. Here are some example manglings +to get a general feel of what they look like: + + - `std::mem::align_of::`: `_RINvNtC3std3mem8align_ofjE` + - `std::mem::align_of::<&char>`: `_RINvNtC3std3mem8align_ofRcE` + - `std::mem::align_of::`: + `_RINvNtC3std3mem8align_ofNtNtC3std3mem12DiscriminantE` + - `std::mem::align_of::<&mut (&str,())>`: `_RINvNtC3std3mem8align_ofQTReuEE` + +There's one more thing we have to take into account for generic functions: +The compiler may produce "crate-local" copies of a monomorphization. +That is, if there is a function `foo` which gets used as `foo` +in two different crates, the compiler (depending on the optimization level) +might generate two distinct functions at the LLVM IR level, each with it's +own symbol. In order to support this without running into conflicts, symbol +names for monomorphizations must include the id of the crate they are +instantiated for. This scheme does this by appending an `` suffix +to the symbol. So for example the mangling for `std::mem::align_of::` +would actually look like this: + +``` +_RINvNtC3std3mem8align_ofjEC3foo (for crate "foo") +_RINvNtC3std3mem8align_ofjEC3bar (for crate "bar") +``` + + +### Closures and Closure Environments + +The scheme needs to be able to generate symbol names for the function +containing the code of a closure and it needs to be able to refer to +the type of a closure if it occurs as a type argument. As closures +don't have a name, we need to generate one. The scheme proposes to +use the namespace and disambiguation mechanisms already introduced +above for this purpose. Closures get their own "namespace" (i.e. +they are neither in the type nor the value namespace), and each closure +has an empty name with a disambiguation index (like for macro hygiene) +identifying them within their parent. The full name of a closure is +then constructed like for any other named item: + +```rust +mod foo { + fn bar(x: u32) { + let a = |x| { x + 1 }; // local name: NC<...>0 + let b = |x| { x + 2 }; // local name: NC<...>s_0 + + a(b(x)) + } +} + +``` + +In the above example we have two closures, the one assigned to `a` +and the one assigned to `b`. The first one would get the local name +`NC<...>0` and the second one the name `NC<...>s_0`. The `0` signifies +the length of their (empty) name. The `<...>` part is the path of the +parent. The `C` is the namespace tag, analogous to the `v` tag for +the value namespace. The `s_` for the second closure is the +disambiguation index (index `0` is, again, encoded by not prepending +a prefix). Their full names would then be `NCNvNtC7mycrate3foo3bar0` +and `NCNvNtC7mycrate3foo3bars_0` respectively. + + +### Methods + +Methods are nested within `impl` or `trait` items. As such it would be +possible to construct their symbol names as paths like +`my_crate::foo::{{impl}}::some_method` where `{{impl}}` somehow identifies +the the `impl` in question. Since `impl`s don't have names, we'd have to +use an indexing scheme like the one used for closures (and indeed, this is +what the compiler does internally). Adding in generic arguments to, this +would lead to symbol names looking like +`my_crate::foo::impl'17::::some_method`. + +However, in the opinion of the author these symbols are very hard to map +back to the method they represent. Consider a module containing dozens of +types, each with multiple `impl` blocks generated via `#[derive(...)]`. +In order to find out which method a symbol maps to, one would have to count +the number of handwritten _and_ macro generated `impl` blocks in the module, +and hope that one correctly guessed the number of `impl` blocks introduced +by the given derive-macro (each macro invocation can introduce `0..n` such +blocks). The name of the method might give a hint, but there are still +likely to be dozens of methods named `clone`, `hash`, `eq`, et cetera. + +The RFC therefore proposes to keep symbol names close to how methods are +represented in error messages, that is: + +- `>::some_method` for inherent methods, and +- ` as SomeTrait>::some_method` for trait methods. + +This can be achieved by extending the definition of paths that we have +used so far. Instead of the path root always being a crate-id, we now +also allow a path to start with an `impl` production that contains the +self-type and (for trait methods) the name of the trait being implemented. + +Thus, this extended form of paths would have the following syntax: + +``` + = C // crate-id root + | M // inherent impl root + | X // trait impl root + | N // nested path + | I {} E // generic arguments +``` + +Here are some examples for complete symbol names: + +``` +>::foo => _RNvMINtC7mycrate3FoomE3foo +::foo => _RNvXmNtC7mycrate3Foo3foo + as mycrate::Bar>::foo => _RNvXINtC7mycrate3FoomEINtC7mycrate3BaryE3foo +``` + + +### Items Within Generic Impls + +In Rust one can define items within generic items, e.g. functions or +impls, like in the following example: + +```rust +struct Foo(T); + +impl From for Foo { + fn from(x: T) -> Self { + static MSG: &str = "..."; + panic!("{}", MSG) + } +} +``` + +The `MSG` here (or any other such nested definition) does not inherit +the generic context from the `impl`. `MSG` is non-generic, and a +function defined in its place would be too. The fully qualified name +of `MSG`, according to our examples so far, is thus +` as std::convert::From<_>>::from::MSG` and its symbol name: + +``` +_RNvNvXINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG +``` + +However, with trait specialization, this symbol can be ambiguous. +Consider the following piece of code: + +```rust +struct Foo(T); + +impl From for Foo { + default fn from(x: T) -> Self { + static MSG: &str = "..."; + panic!("{}", MSG) + } +} + +impl From for Foo { + fn from(x: T) -> Self { + static MSG: &str = "123"; + panic!("{}", MSG) + } +} +``` + +Notice that both `MSG` statics have the path ` as From<_>>::foo::MSG`. +We somehow have to disambiguate the `impls`. We do so by adding the path of +the `impl` to the symbol name. + +``` + = C // crate-id root + | M // inherent impl root + | X // trait impl root + | N // nested path + | I {} E // generic arguments + + = [] +``` + +The two symbol names would then look something like: + +``` +_RNvNvXs2_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG +_RNvNvXs3_C7mycrateINtC7mycrate3FoopEINtNtC3std7convert4FrompE4from3MSG + <----------><----------------><-----------------------> + impl-path self-type trait-name +``` + +Like other disambiguation information, this path would usually not actually +be shown by demanglers. + +### Unicode Identifiers + +Rust allows Unicode identifiers but our character set is restricted +to ASCII alphanumerics, and `_`. In order to transcode the former to +the latter, we use the same approach as Swift, which is: encode all +non-ASCII identifiers via [Punycode][punycode], a standardized and +efficient encoding that keeps encoded strings in a rather +human-readable format. So for example, the string + +``` +"Gödel, Escher, Bach" +``` + +is encoded as + +``` +"Gdel, Escher, Bach-d3b" +``` + +which, as opposed to something like _Base64_, still gives a pretty +good idea of what the original string looked like. + +Each component of a name, i.e. anything that starts with the number +of bytes to read in the examples above, is encoded individually. +Components encoded this way are augmented with a `u` prefix so that +demanglers know that the identifier needs further decoding. As an +example, the function: + +```rust +mod gödel { + mod escher { + fn bach() {} + } +} +``` + +would be mangled as: + +``` +_RNvNtNtC7mycrateu8gdel_Fqa6escher4bach + <--------> + Unicode component +``` + +### Compression/Substitution + +The length of symbol names has an influence on how much work the compiler, +linker, and loader have to perform. The shorter the names, the better. +At the same time, Rust's generics can lead to rather long names (which +are often not visible in the code because of type inference and +`impl Trait`). For example, the return type of the following function: + +```rust +fn quux(s: Vec) -> impl Iterator { + s.into_iter() + .map(|x| x + 1) + .filter(|&x| x > 10) + .zip(0..) + .chain(iter::once((0, 0))) +} +``` + +is + +```rust +std::iter::Chain< + std::iter::Zip< + std::iter::Filter< + std::iter::Map< + std::vec::IntoIter, + [closure@src/main.rs:16:11: 16:18]>, + [closure@src/main.rs:17:14: 17:25]>, + std::ops::RangeFrom>, + std::iter::Once<(u32, usize)>> +``` + +It would make for a long symbol name if this type is used (maybe +repeatedly) as a generic argument somewhere. C++ has the same problem +with its templates; which is why the Itanium mangling introduces the +concept of compression. If a component of a definition occurs more than +once, it will not be repeated and instead be emitted as a substitution +marker that allows to reconstruct which component it refers to. The +scheme proposed here will use the same approach (but with a simpler +definition). + +The exact scheme will be described in detail in the reference level +explanation below but it roughly works as follows: As a mangled symbol +name is being built, we remember the position of every substitutable item +in the output string, that is, we keep track of things a subsequent +occurrence of which could be replaced by a back reference. + +The things that are eligible for substitution are (1) all prefixes of +paths (including the entire path itself), (2) all types except for +basic types, and (3) type-level constants (array lengths and values +passed to const generic params). + +Here's an example in order to illustrate the concept. The name + +``` +std::iter::Chain, std::vec::IntoIter>> +``` + +is mangled to the following uncompressed string. The lines below show parts +of the mangled string that already occurred before and can thus be replaced +by a back reference. The number of at the beginning of each span given +the 0-based byte position of where it occurred the first time. + +``` + 0 10 20 30 40 50 60 70 80 90 +_RINtNtC3std4iter5ChainINtNtC3std4iter3ZipINtNtC3std3vec8IntoItermEINtNtC3std3vec8IntoItermEEE + 5---- 5---- 5---- + 3----------- 43--------- + 41-------------------- + 40----------------------- +``` + +The compiler is always supposed to use the longest replacement possible +in order to achieve the best compression. The compressed symbol looks +as follows: + +``` +_RINtNtC3std4iter5ChainINtB2_3ZipINtNtB4_3vec8IntoItermEBt_EE + ^^^ ^^^ ^^^ back references +``` + +Back references have the form `B_`. + +# Reference-level explanation +[reference-level-explanation]: #reference-level-explanation + +The reference-level explanation consists of three parts: + +1. A specification of the syntax mangled names conform to. +2. A specification of the compression scheme. +3. A mapping of Rust entities to the mangling syntax. + +For implementing a demangler, only the first two sections are of +interest, that is, a demangler only needs to understand syntax and +compression of names, but it does not have to care about how the +compiler generates mangled names. + + +## Syntax Of Mangled Names + +The syntax of mangled names is given in extended Backus-Naur form: + + - Non-terminals are within angle brackets (as in ``) + - Terminals are within quotes (as in `"_R"`), + - Optional parts are in brackets (as in `[]`), + - Repetition (zero or more times) is signified by curly braces (as in `{}`) + - Comments are marked with `//`. + +Mangled names conform to the following grammar: + +``` +// The specifies the encoding version. + = "_R" [] [] + + = "C" // crate root + | "M" // (inherent impl) + | "X" // (trait impl) + | "Y" // (trait definition) + | "N" // ...::ident (nested path) + | "I" {} "E" // ... (generic args) + | + +// Path to an impl (without the Self type or the trait). +// The is the parent, while the distinguishes +// between impls in that same parent (e.g. multiple impls in a mod). +// This exists as a simple way of ensure uniqueness, and demanglers +// don't need to show it (unless the location of the impl is desired). + = [] + +// The is the length of the identifier in bytes. +// is the identifier itself and must not start with a decimal digit. +// If the "u" is present then is Punycode-encoded. + = [] + = "s" + = ["u"] + +// Namespace of the identifier in a (nested) path. +// It's an a-zA-Z character, with a-z reserved for implementation-internal +// disambiguation categories (and demanglers should never show them), while +// A-Z are used for special namespaces (e.g. closures), which the demangler +// can show in a special way (e.g. `NC...` as `...::{closure}`), or just +// default to showing the uppercase character. + = "C" // closure + | "S" // shim + | // other special namespaces + | // internal namespaces + + = + | + | "K" // forward-compat for const generics + +// An anonymous (numbered) lifetime, either erased or higher-ranked. +// Index 0 is always erased (can show as '_, if at all), while indices +// starting from 1 refer (as de Bruijn indices) to a higher-ranked +// lifetime bound by one of the enclosing s. + = "L" + +// Specify the number of higher-ranked (for<...>) lifetimes to bound. +// can then later refer to them, with lowest indices for +// innermost lifetimes, e.g. in `for<'a, 'b> fn(for<'c> fn(...))`, +// any s in ... (but not inside more binders) will observe +// the indices 1, 2, and 3 refer to 'c, 'b, and 'a, respectively. + = "G" + + = + | // named type + | "A" // [T; N] + | "S" // [T] + | "T" {} "E" // (T1, T2, T3, ...) + | "R" [] // &T + | "Q" [] // &mut T + | "P" // *const T + | "O" // *mut T + | "F" // fn(...) -> ... + | "D" // dyn Trait + Send + 'a + | + + = "a" // i8 + | "b" // bool + | "c" // char + | "d" // f64 + | "e" // str + | "f" // f32 + | "h" // u8 + | "i" // isize + | "j" // usize + | "l" // i32 + | "m" // u32 + | "n" // i128 + | "o" // u128 + | "s" // i16 + | "t" // u16 + | "u" // () + | "v" // ... + | "x" // i64 + | "y" // u64 + | "z" // ! + | "p" // placeholder (e.g. for generic params), shown as _ + +// If the "U" is present then the function is `unsafe`. +// The return type is always present, but demanglers can +// choose to omit the ` -> ()` by special-casing "u". + := ["U"] ["K" ] {} "E" + + = "C" + | + + = {} "E" + = {} + = "p" + = + | "p" // placeholder (e.g. for polymorphic constants), shown as _: T + | + +// The encoding of a constant depends on its type, currently only +// unsigned integers (mainly usize, for arrays) are supported, and they +// use their value, in base 16 (0-9a-f), not their memory representation. + = {} "_" + +// uses 0-9-a-z-A-Z as digits, i.e. 'a' is decimal 10 and +// 'Z' is decimal 61. +// "_" with no digits indicates the value 0, while any other value is offset +// by 1, e.g. "0_" is 1, "Z_" is 62, "10_" is 63, etc. + = {<0-9a-zA-Z>} "_" + + = "B" + +// We use here, so that we don't have to add a special rule for +// compression. In practice, only a crate root is expected. + = +``` + +### Namespace Tags + +Namespaces are identified by an implementation defined single character tag +(the `` production). Only closures (`C`) and shims (`S`) have a +specific character assigned to them so that demanglers can reliable +adjust their output accordingly. Other namespace tags have to be omitted +or shown verbatim during demangling. + +This is a concession to the compiler's current implementation. While the +language only knows two namespaces (the type and the value namespace), the +compiler uses many more in some important data structures and disambiguation +indices are assigned according to these internal data structures. So, in +order not to force the compiler to waste processing time on re-constructing +different disambiguation indices, the internal unspecified "namespaces" are +used. This may change in the future. + +### Type-Level Constants + +As described above, the grammar encodes constant values via the +` = {} "_"` production, where `{}` is +the numeric value of the constant, not its representation as bytes. Using +the numeric value is platform independent but does not easily scale to +non-integer data types. + +In the future it is likely that Rust will support complex type-level +constants (i.e. not just integers). This RFC suggests to develop a +proper mangling for these as part of the future const-generics work, +and, for now, only define a mangling for integer values. + +### Punycode Identifiers + +Punycode generates strings of the form `([[:ascii:]]+-)?[[:alnum:]]+`. +This is problematic for two reasons: + +- Generated strings can contain a `-` character; which is not in the + supported character set. +- Generated strings can start with a digit; which makes them clash + with the byte-count prefix of the `` production. + +For these reasons, vanilla Punycode string are further encoded during mangling: + +- The `-` character is simply replaced by a `_` character. +- The part of the Punycode string that encodes the non-ASCII characters + is a base-36 number, using `[a-z0-9]` as its "digits". We want to get + rid of the decimal digits in there, so we simply remap `0-9` to `A-J`. + +Here are some examples: + +| Original | Punycode | Punycode + Encoding | +|-----------------|-----------------|---------------------| +| føø | f-5gaa | f_Fgaa | +| α_ω | _-ylb7e | __ylbHe | +| 铁锈 | n84amf | nIEamf | +| 🤦 | fq9h | fqJh | +| ρυστ | 2xaedc | Cxaedc | + +With this post-processing in place the Punycode strings can be treated +like regular identifiers and need no further special handling. + + +## Compression + +Symbol name compression works by substituting parts of the mangled +name that have already been seen for a back reference. Compression +is directly built into the mangling algorithm, as shown by the +following piece of pseudocode: + +```rust +fn mangle(node, output_string, substitution_dictionary) { + if let Some(backref) = substitution_dictionary.get(node) { + // Emit the backref instead of the node's contents + mangle(backref, output_string) + } else { + // Remember where the current node starts in the output + let start_position = output_string.len() + + // Do the actual mangling, including recursive mangling of child nodes + + // Add the current node to the substitution dictionary + if node.is_substitutable() { + substitution_dictionary.insert(node, start_position) + } + } +} +``` + +This algorithm automatically chooses the best compression because +parent nodes (which are always larger) are visited before child +nodes. + +Note that this kind of compression relies on the fact that all +substitutable AST nodes have a self-terminating mangled form, +that is, given the start position of the encoded node, the +grammar guarantees that it is always unambiguous where the +node ends. This is ensured by not allowing optional or +repeating elements at the end of substitutable productions. + + +### Decompression + +Decompression too is built directly into demangling/parsing. When a back +reference is encountered, we decode the referenced position and use a +temporary demangler/parser to do the decoding of the node's actual content: + +```rust +fn demangle_at(&mut pos, mangled, output_string) { + if is_backref(*pos, mangled) { + // Read the byte offset of the referenced node and + // advance `pos` past the backref. + let mut referenced_pos = decode(pos, mangled); + demangle_at(&mut referenced_pos, mangled, output_string) + } else { + // do regular demangling + } +} +``` + +Using byte offsets as backref keys (as this RFC does) instead of post-order +traversal indices (as Itanium mangling does) has the advantage that the +demangler does not need to duplicate the mangler's substitution indexing logic, +something that can become quite complex (as demonstrated by the compression +scheme proposed in the initial version of this RFC). + +### A Note On Implementing Efficient Demanglers + +The mangling syntax is constructed in a way that allows for implementing +efficient demanglers: + + - Mangled names contain information in the same order as unmangled + names are expected to contain it. Therefore, a demangler can directly + generate its output while parsing the mangled form. There is no need + to explicitly instantiate the AST in memory. + + - The same is true for decompression. Decompression can be done without + allocating memory outside of the stack. Alternatively the demangler + can keep a simple array that maps back-ref indices to ranges in the + already generated output. When it encounters a `` in need + of expansion, it can just look up corresponding range and do a + simple `memcpy`. + +Parsing, decompression, and demangling can thus be done in a single pass +over the mangled name without the need for complex data structures, which +is useful when having to implement `#[no_std]` or C demanglers. (Note that +Punycode can complicate decoding slightly because it needs dynamic memory +allocation in the general case but it can be implemented with an on-stack +buffer for a reasonable maximum supported length). + +## Mapping Rust Language Entities to Symbol Names + +This RFC suggests the following mapping of Rust entities to mangled names: + +- Named functions, methods, and statics shall be represented by a + `` production. + +- Paths should be rooted at the inner-most entity that can act + as a path root. Roots can be crate-ids, inherent impls, trait impls, and + (for items within default methods) trait definitions. + +- The compiler is free to choose disambiguation indices and namespace tags from + the reserved ranges as long as it ascertains identifier unambiguity. + +- Generic arguments that are equal to the default should not be encoded in + order to save space. + + +# Drawbacks +[drawbacks]: #drawbacks + +Why should we *not* do this? + +- The current/legacy scheme based on symbol-hashes is flexible in that + hashes can be changed at will. That is, the unstable part of the + current mangling scheme is nicely contained and does not keep breaking + external tools. The danger of breakage is greater with the scheme + proposed here because it exposes more information. + + +# Rationale and alternatives +[rationale-and-alternatives]: #rationale-and-alternatives + +The alternatives considered are: + + 1. Keeping the current scheme. It does meet the minimum requirements + after all. However, the general consensus seems to be that this + leads to situations where people are unpleasantly surprised when + they come across (demangled) symbol names in backtraces or profilers. + + 2. Keeping the current scheme but cleaning it up by making the non-hash + part more consistent and more expressive. Keep the hash part as a + safeguard against symbol conflicts and the rest as something just + for demangling. The downside of this is that the hash would still + not be predictable, and symbols would get rather long if they should + contain more human-readable information about generic arguments. + + 3. Define a standardized pretty-printing format for things that end up + as symbols, and then encode that via Punycode in order to meet the + character set restrictions. This would be rather simple. Symbol names + would remain somewhat human-readable (but not very, because all + separators would be stripped out). But without some kind of additional + compression, symbol names would become rather long. + + 4. Use the scheme from the previous bullet point but apply the compression + scheme described above. We could do this but it wouldn't really be less + complex than the scheme proposed by the RFC. + + 5. Define a standardized pretty-printing format for things that end up as + symbols, compress with `zstd` (specially trained for Rust symbols) and + encode the result as `base63`. This is rather simple but loses all + human-readability. It's unclear how well this would compress. It would + pull the `zstd` specification into the mangling scheme specification, + as well as the pre-trained dictionary. + +# Prior art +[prior-art]: #prior-art + +One of the major modern mangling schemes with a public specification is the +[Itanium C++ ABI][itanium-mangling] scheme for C++ which is used by the GCC +toolchain. An initial version of this RFC sticked closely to Itanium mangling, +however, the latest version only retains the run-length encoding for +identifiers and some literals for tagging things like basic types. The +Itanium scheme has been criticized for being overly complex, due to its +extensive grammar and two separate compression schemes. + +The idea of using [Punycode][punycode] for handling of unicode identifiers +is taken from the [Swift][swift-gh] programming language's +[mangling scheme][swift-mangling]. + +[punycode]: https://tools.ietf.org/html/rfc3492 +[itanium-mangling]: http://refspecs.linuxbase.org/cxxabi-1.86.html#mangling +[swift-gh]: https://github.com/apple/swift +[swift-mangling]: https://github.com/apple/swift/blob/master/docs/ABI/Mangling.rst#identifiers + + +# Unresolved questions +[unresolved-questions]: #unresolved-questions + +### Punycode vs UTF-8 +During the pre-RFC phase, it has been suggested that Unicode identifiers +should be encoded as UTF-8 instead of Punycode on platforms that allow it. +GCC, Clang, and MSVC seem to do this. The author of the RFC has a hard +time making up their mind about this issue. Here are some interesting +points that might influence the final decision: + +- Using UTF-8 instead of Punycode would make mangled strings containing + non-ASCII identifiers a bit more human-readable. For demangled strings, + there would be no difference. + +- Punycode support is non-optional since some platforms only allow a very + limited character set for symbol names. Thus, we would be using UTF-8 on + some platforms and Punycode on others, making it harder to predict what a + symbol name for a given item looks like. + +- Punycode encoding and decoding is more runtime effort for the mangler + and demangler. + +- Once a demangler supports Punycode, it is not much effort to support + both encodings. The `u` identifier prefix tells the demangler whether + it is Punycode. Otherwise it can just assume UTF-8 which already + subsumes ASCII. + +**UPDATE**: This RFC recommends that Punycode encoded identifiers must +be supported by demanglers but that it is up to the compiler implementation +(for now) to decide whether to use it for a given platform. This question +will have to be revisited if Rust ever wants to define a stable ABI. + +### Encoding parameter types for function symbols + +It has been suggested that parameter types for functions and methods should +be encoded in mangled form too. This is not necessary for symbol name +uniqueness but it would provide an additional safeguard against silent +ABI-related errors where definition and callers of some function make +different assumptions about what parameters a function takes. The RFC +does not propose to do this because: + + - Rust makes sure this cannot happen via crate metadata, + - it would make symbol names longer, and + - only some but not all ABI related errors are caught by the safeguard. + +However, a final decision on the topic has not been made yet. + +**UPDATE**: This RFC suggests that parameter types are *not* encoded into +function and method symbols. Symbol names will already get significantly +longer due to encoding additional information and the additional +safeguard provided against ABI mismatches is less relevant for Rust +than it is for other languages that don't have a concept of +library/crate metadata. + + +# Appendix A - Suggested Demangling + +This RFC suggests that names are demangled to a form that matches Rust +syntax as it is used in source code, compiler error messages and `rustdoc`: + + - Path components should be separated by `::`. + + - If the path root is a `` it should be printed as the crate name. + If the context requires it for correctness, the crate disambiguator can be + printed too, as in, for example, `std[a0b1c2d3]::collections::HashMap`. + In this case `a0b1c2d3` would be the disambiguator. Usually, the + disambiguator can be omitted for better readability. + + - If the path root is an impl, it should be printed as `` (for + inherent impls) or `` (for trait impls), like the + compiler does in error messages. The `` also contained in the + AST node should usually be omitted. + + - The list of generic arguments should be demangled as ``. + + - Identifiers can have a numeric disambiguator + (the `` production). The syntactic version of the numeric + disambiguator maps to a numeric index. If the disambiguator is not + present, this index is 0. If it is of the form `s_` then the index is 1. + If it is of the form `s_` then the index is + ` + 2`. The suggested demangling of a disambiguator is + `[]`. However, for better readability, these disambiguators + should usually be omitted in the demangling altogether. Disambiguators + with index zero can _always_ be omitted. + + The exception here are closures. Since these do not have a name, the + disambiguator is the only thing identifying them. The suggested + demangling for closures is thus `{closure}[]`. + +# Appendix B - Examples + +We assume that all examples are defined in a crate named `mycrate[1234]`. + + +### Free-standing Item + +```rust +mod foo { + mod bar { + fn baz() {} + } +} +``` +- unmangled: `mycrate::foo::bar::baz` +- mangled: `_RNvNtNtCs1234_7mycrate3foo3bar3baz` + + +### Item Defined In Inherent Method + +```rust +struct Foo(T); + +impl Foo { + pub fn bar(_: U) { + static QUUX: u32 = 0; + // ... + } +} +``` + +- unmangled: `>::bar::QUUX` +- mangled: `_RNvNvMCs1234_7mycrateINtCs1234_7mycrate3FoopE3bar4QUUX` + + +### Item Defined In Trait Method + +```rust +struct Foo(T); + +impl Clone for Foo { + fn clone(&self) -> Self { + static QUUX: u32 = 0; + // ... + } +} +``` +- unmangled: ` as std::clone::Clone>::clone::QUUX` +- mangled: `_RNvNvXCs1234_7mycrateINtCs1234_7mycrate3FoopENtNtC3std5clone5Clone5clone4QUUX` + + +### Item Defined In Initializer Of A Static + +```rust +pub static QUUX: u32 = { + static FOO: u32 = 1; + FOO + FOO +}; +``` +- unmangled: `mycrate::QUUX::FOO` +- mangled: `_RNvNvCs1234_7mycrate4QUUX3FOO` + + +### Compressed Prefix Constructed From Prefix That Contains A Substitution Itself - TODO +- unmangled: `mycrate::foo` +- mangled: `_RINvCs1234_7mycrate3fooNvB4_3barNvBn_3bazE` + + +### Progressive type compression +- unmangled: `std::foo<(std::Bar,std::Bar),(std::Bar,std::Bar)>` +- mangled: `_RINxC3std3fooTNyB4_3BarBe_EBd_E` + + +# Appendix C - Change LOG +- Removed mention of Itanium mangling in introduction. +- Weakened "predictability" goal. +- Removed non-goal of not providing a mangling for lifetimes. +- Added non-goal for not trying to standardize the demangled form. +- Updated specification and examples to new grammar as proposed by eddyb. +- `impl` disambiguation strategy changed to using the impl path instead of param bounds. +- Updated prior art section to not say this RFC is an adaptation of Itanium mangling. +- Updated compiler's expected assignment of disambiguation indices and namespace tags. +- Removed "complexity" drawback since the scheme is not very complex anymore. +- Removed unresolved question "Re-use `` for crate disambiguator". +- Added note about default generic arguments to reference-level-explanation. +- Added note about Punycode making decoding more complicated. +- Resolve question of complex constant data. +- Add a recommended resolution for open question around Punycode identifiers. +- Add a recommended resolution for open question around encoding function parameter types.