Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trim-paths: Remap rules for different dependency kinds #13171

Open
weihanglo opened this issue Dec 14, 2023 · 25 comments
Open

trim-paths: Remap rules for different dependency kinds #13171

weihanglo opened this issue Dec 14, 2023 · 25 comments
Labels
S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. Z-trim-paths Nightly: path sanitization

Comments

@weihanglo
Copy link
Member

weihanglo commented Dec 14, 2023

This is part of #12137

As of 6982b44, the remap of -Ztrim-paths is relatively simple. It follows what is described in RFC 3127:

For the the current package (where the current working directory is in), from the the absolute path of the package root to empty string. For other packages, from the absolute path of the package root to [package name]-[package version].

In Cargo, that is:

// Remapped to path relative to workspace root:
//
// * path dependencies under workspace root directory
//
// Remapped to `<pkg>-<version>`
//
// * registry dependencies
// * git dependencies
// * path dependencies outside workspace root directory

However, this is not ideal for debugger. Several questions emerge in rust-lang/rust#111540 (comment).

  • Impossible to manually configure remap for debugger — People need to configure one remap rules per per dependency. That is not possible for people to do it manually for each <pkg>-<version>/ back to absolute paths. It would require a dedicated tool to restore all the paths.
  • The same issue also affects path dependency remap. For path dependencies outside the current workspace root, we remap paths to <pkg>-<version> as well.
  • There is no way to distinguish relative path from different Cargo workspace — For path dependencies inside a workspace Cargo currently remaps to paths relative to workspace root. Imagine you have multiple Cargo workspaces and there is one non-Rust root project compiles them together. Now all your debuginfo files are messed up since they were remapped to relative path like src/some/file.rs.

Possible options

Registry and git dependencies

Updated as suggested in #13171 (comment)

In rust-lang/rust#111540 (comment), @michaelwoerister brought up a smarter remap strategy. We could remap paths from $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts to a empty string for registry and git dependencies respectively, so that there is a finite set of rule (two exactly) to configure in debugger for each platform. That is to say, cargo could unconditionally pass these to the compiler:

  • --remap-path-prefix=$CARGO_HOME/.cargo/registry/src=
  • --remap-path-prefix=$CARGO_HOME/.cargo/git/checkouts=

(or conditionally if we want to reduce number of args passed in to rustc)

That is, example

/home/user/.cargo/registry/src/index.crates.io-6f17d22bba15001f/serde-1.0.193/src/lib.rs
-> index.crates.io-6f17d22bba15001f/serde-1.0.193/src/lib.rs

/home/user/.cargo/git/checkouts/cargo-new-release-57ee03ad5db79cc6/589ded5/src/lib.rs
-> cargo-new-release-57ee03ad5db79cc6/589ded5/src/lib.rs

To restore source paths for debuggers, you only need to set two paths. For example in gdb, set

directory /home/user/.cargo/registry/src
directory /home/user/.cargo/git/checkouts

Path dependencies

For path dependencies, I wonder if we can always remap to relative path to workspace root, regardless whether the package is inside or outside the workspace. Developers should be responsible for locals dependencies, and if they want, they are able to have their own remap rules via --remap-path-prefix.

Other sources

Vendor and local-registry dependencies

They are both sort of local dependencies. I would suggest treating them as path dependencies. We can always apply new rules to them later on.

Patched dependencies

Just apply all the suggested remap rule above, if not sufficient, people can write their own rule via --remap-path-prefix.

@weihanglo weihanglo added S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. Z-trim-paths Nightly: path sanitization labels Dec 14, 2023
@epage
Copy link
Contributor

epage commented Dec 15, 2023

We should make sure we keep the comment from https://github.com/rust-lang/cargo/blob/master/src/cargo/util/workspace.rs#L97 in mind

/// The path that we pass to rustc is actually fairly important because it will
/// show up in error messages (important for readability), debug information
/// (important for caching), etc. As a result we need to be pretty careful how we
/// actually invoke rustc.
///
/// In general users don't expect `cargo build` to cause rebuilds if you change
/// directories. That could be if you just change directories in the package or
/// if you literally move the whole package wholesale to a new directory. As a
/// result we mostly don't factor in `cwd` to this calculation. Instead we try to
/// track the workspace as much as possible and we update the current directory
/// of rustc/rustdoc where appropriate.
///
/// The first returned value here is the argument to pass to rustc, and the
/// second is the cwd that rustc should operate in.

I think we're fine but wanted to make sure we keep this in mind

@epage
Copy link
Contributor

epage commented Dec 15, 2023

In <rust-lang/rust#111540 (comment), @michaelwoerister brought up a smarter remap strategy. We could remap paths from $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts to a fixed string for registry and git dependencies respectively, so that there is a finite set of rule (two exactly) to configure in debugger. That is to say, cargo could unconditionally pass these to the compiler:

  • --remap-path-prefix=$CARGO_HOME/.cargo/registry/src=registry@

  • --remap-path-prefix=$CARGO_HOME/.cargo/git/checkouts=git@

For registry, the path prefix would be things like index.crates.io-6f17d22bba15001f/winnow-0.5.26

For git, the path prefix would be things like winnow-110da4674123f04f/

What are the stability guarantees for those paths?

@epage
Copy link
Contributor

epage commented Dec 15, 2023

Vendor and local-registry dependencies

They are both sort of local dependencies. I would suggest treating them as path dependencies. We can always apply new rules to them later on.

I wonder what code path these would go down and what the "for free" option is. If its registries, I expect the do-nothing would be good enough.

@weihanglo
Copy link
Member Author

weihanglo commented Dec 15, 2023

For registry, the path prefix would be things like index.crates.io-6f17d22bba15001f/winnow-0.5.26

For git, the path prefix would be things like winnow-110da4674123f04f/

What are the stability guarantees for those paths?

They should be fairly stable on each platform across different Rust versions. See here and here.

Granted, we should have a test to verify short_name is stable.

Vendor and local-registry dependencies
They are both sort of local dependencies. I would suggest treating them as path dependencies. We can always apply new rules to them later on.

I wonder what code path these would go down and what the "for free" option is. If its registries, I expect the do-nothing would be good enough.

let is_local = unit.pkg.package_id().source_id().is_path();

I was wrong with local-registry. Unpacked sources go into the same ~/.cargo/registry/src directory for local registry, so no special treatment needed.

People might want to remap vendor sources, as they might do release builds with them. However, from the above code path cargo only knows the pkg is from a Registry source, since in fact it is a ReplacedSource. We have no way to tell whether a package is from a vendor source.

I still think it's better to remap to relative paths for vendor sources, as they are designed to be used as local files. If people can get free debugging experience without configuring remap restore, then why not?

@michaelwoerister
Copy link
Member

Thanks for picking this up, @weihanglo!

  • --remap-path-prefix=$CARGO_HOME/.cargo/registry/src=registry@
  • --remap-path-prefix=$CARGO_HOME/.cargo/git/checkouts=git@

Please note that the registry@ and git@ prefixes would complicate things again. Remapped paths would then look like:

registry@index.crates.io-6f17d22bba15001f\regex-1.10.2\src\lib.rs

instead of just

index.crates.io-6f17d22bba15001f\regex-1.10.2\src\lib.rs

A debugger would look for a file registry@index.crates.io-6f17d22bba15001f\regex-1.10.2\src\lib.rs in its source directories, but would fail finding a directory called registry@index.crates.io-6f17d22bba15001f because it's actually called index.crates.io-6f17d22bba15001f.

@michaelwoerister
Copy link
Member

They should be fairly stable on each platform across different Rust versions. See here and here.

Granted, we should have a test to verify short_name is stable.

Stability across platforms and across time would indeed be desirable. The test case linked above excludes a few platforms. The compiler has an example of a hasher the gives the same result on different kinds of platforms.

@weihanglo
Copy link
Member Author

Please note that the registry@ and git@ prefixes would complicate things again.

Hmm… Maybe I understand the original concern wrong 🤔 ?

I would expect remapped paths to be like

registry@/index.crates.io-6f17d22bba15001f\regex-1.10.2\src\lib.rs

So that we can restore the prefix-map by searching a specific prefix like registry@. For example,

# in gdb
set substitute-path registry@ /home/user/.cargo/registry/src
# in lldb
settings set target.source-map git@ /home/user/.cargo/git/checkouts

That makes the substitution rules down to 2.

@weihanglo
Copy link
Member Author

The compiler has an example of a hasher the gives the same result on different kinds of platforms.

Good reference! Thank you! I guess cargo might want to adopt some techniques there to have a stable hash across platforms (though it may break things on some platforms).

Despite that, is having a cross-platform stable hash in debuginfo a hard requirement? I assume things like DWARF are platform/architecture dependant. We cannot use DWARF generated on aarch64 to debug on x86_64. But again, I might be wrong 😞.

@michaelwoerister
Copy link
Member

That makes the substitution rules down to 2.

Yes, that looks correct to me. The only change I'm suggesting is to omit the registry@ and git@ prefixes, so that one doesn't need the substitution feature. So instead of doing this in GDB:

set substitute-path registry@ /home/user/.cargo/registry/src

we would do this:

directory /home/user/.cargo/registry/src

or add -d /home/user/.cargo/registry/src to GDB's command line.

That should have the same effect, but path substitution is less widely supported than specifying a source path. For example, I was not able to find an equivalent feature in WinDbg or the MSVC debugger. And when using any kind of source server, having something that does not match the file system path will likely also lead to complications. It's quite possible that all of these issues could be worked around somehow, but why make things more complicated than necessary? I'd like trim-paths to be the default (at least for release builds), and I think that will be easier if we remove as much friction as possible.

Despite that, is having a cross-platform stable hash in debuginfo a hard requirement?

I'm not 100% sure if it has to be a hard requirement -- but it might also be something where we can prevent headaches down the line. The hash value basically represents a URL/URI, right? Registries in general are not platform specific, so there is no reason why an identifier for a registry should be platform specific.

And there are quite a few cross platform scenarios involving debuginfo. Analyzing crashdumps from platform X on platform Y is pretty common, or cross-building from one platform to another. E.g. most 32-bits builds probably happen on 64-bit hosts, right?

I think it might be worth biting the bullet now and aligning these "short-names" on all platforms.

@weihanglo
Copy link
Member Author

Registries in general are not platform specific, so there is no reason why an identifier for a registry should be platform specific.

Totally agree on this. We might want to make hashes cross-platform stable. Just that would invalidate source caches on some platforms.

Do you think this is a major blocker on -Ztrim-paths? I am seeing that this could be a follow-up even after -Ztrim-paths stabilized, if we don't get stable hashes in time.

@epage
Copy link
Contributor

epage commented Dec 27, 2023

Sorry I haven't gotten back to this.

Maybe to put my earlier question in a different way.

The old prefixes basically said "this source is from some place, get it yourself". The new format seems to imply that the user should expect the source to be made available in CARGO_HOME (by doing a cargo check first?).

How strong do we expect these expectations to be, particularly

  • CARGO_HOME layout
  • The meaning of the path prefixes for lookup in CARGO_HOME or for manually looking up the code

For example,

  • We are looking at transitioning CARGO_HOMEs layout to be different, see https://internals.rust-lang.org/t/pre-rfc-split-cargo-home/19747
  • If cargo changes anything about hash, then this would no longer work unless rustc versions match before doing cargo check
    • Maybe remap is enough?
  • If a user wants to debug locally without a local project, what steps are expected for them to do so and are we ok with any assumptions made, particularly when automating?

@weihanglo
Copy link
Member Author

How strong do we expect these expectations to be

I am pretty sure that we don't want to change the layout too often, as it may hurt caches. It is also the reason that we want a cross-platform stable hash for the path prefixes. That said, if necessary, we can definitely change the layout.

If a user wants to debug locally without a local project, what steps are expected for them to do so and are we ok with any assumptions made, particularly when automating?

I don't fully get this. But the assumption is the user has an existing CARGO_HOME with source code. And they only need to configure directory /home/user/.cargo/registry/src for their debugger.
I understand the RFC as “let's get paths trimmed by default, but maximize the usability of debug info.” In that regard, we should reduce the burden of configuring debuggers for common cases. Users can always turn off -Ztrim-paths and set their own remap rules when needed.

If cargo changes anything about hash, then this would no longer work unless rustc versions match before doing cargo check

That's unfortunately yes. Alternatively we could provide a cargo debug subcommand to generate required debugger configurations. I don't feel that is better though.

@michaelwoerister
Copy link
Member

Some thoughts:

How strong do we expect these expectations to be, particularly: CARGO_HOME layout

It looks to me like there are two separate aspects here:

  • What do paths within $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts look like. That defines what remapping Cargo should pass to rustc. This part is affected by the hashes Cargo generates for registries. If Cargo changes the hashing and then one tries to debug a program without re-compiling it then the debugger would not find the source files. I think that would be acceptable.

  • Where are the $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts directories located on disk. Debuggers would basically have to add all the cases specified here. The fewer cases the better, I'd say. What the real impact of this is depends on who is in charge of telling the debugger about these directories. If we, the Rust project, do it (e.g. via rust-gdb or a cargo debug command) then we can update the logic whenever needed. If this is somehow built into debuggers themselves, then things might be more complicated and might take a long time.

    One way to possibly mitigate this is to provide something like a $CARGO_HOME/source_links directory that has a stable, well-defined layout with symlinks to the various directories:

    $CARGO_HOME
     |
     +-- source_links
         |
         +-- registry_sources
         |
         +-- git_sources
         |
         +-- path_sources
    

    That could also provide a solution of dealing with path-dependencies. Cargo would have to take care of keeping this up-to-date. Something similar could be done for rustup for the rust-src component.

If a user wants to debug locally without a local project, what steps are expected for them to do so and are we ok with any assumptions made, particularly when automating?

I think that use case is supported best via symbol servers and/or SourceLink. In both cases, there is a snapshot of the source available that's persistently linked to a specific executable, so we don't have to worry about any changes.

Alternatively we could provide a cargo debug subcommand to generate required debugger configurations.

This is something I'd like to explore (together with the debugging working group) in any case. It could provide a lot of useful functionality that the current rust-gdb and rust-lldb wrappers can't because they don't know about cargo projects. But that is kind of a separate topic.

@weihanglo
Copy link
Member Author

  • One way to possibly mitigate this is to provide something like a $CARGO_HOME/source_links directory that has a stable, well-defined layout with symlinks to the various directories:

This is a brilliant idea! However, people might want a platform specific source links directory instead of ~/.cargo/sourcelinks. So still it won't be stable before splitting CARGO_HOME happens.

That could also provide a solution of dealing with path-dependencies. Cargo would have to take care of keeping this up-to-date.

That's a good idea, but might need to resolve name conflict issues like #12516.

@epage
Copy link
Contributor

epage commented Jan 19, 2024

Sorry for how slow I've been on this.

What do paths within $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts look like.

Where are the $CARGO_HOME/registry/src and $CARGO_HOME/git/checkouts directories located on disk.

So for the first, "rebuild" and for the second, "remap".

If we're going in this direction, should we simplify this by just remaping to CARGO_HOME? That means users have fewer paths to add to their debugger.

I think that use case is supported best via symbol servers and/or SourceLink. In both cases, there is a snapshot of the source available that's persistently linked to a specific executable, so we don't have to worry about any changes.

SourceLink is the inspiration for the comment. The problem is, last I looked, gdb doesn't have the concept. When I last dealt with this (porting a Windows shop to Linux), we used fully qualified, cross-system paths that you can remap as needed. Hmm, maybe what we have is enough to do something like that? I've been so slow at this (sorry) that I lost track!

One way to possibly mitigate this is to provide something like a $CARGO_HOME/source_links directory that has a stable, well-defined layout with symlinks to the various directories:

We have at least one major change to CARGO_HOME in the works, XDG paths. I'd like to avoid declaring a cargo-written, user-read location like this to be stable.

@michaelwoerister
Copy link
Member

Sorry for how slow I've been on this.

No worries -- as you can see, I'm not particularly fast either 🙂

If we're going in this direction, should we simplify this by just remaping to CARGO_HOME? That means users have fewer paths to add to their debugger.

I don't think I understand. Can you elaborate?

SourceLink is the inspiration for the comment. The problem is, last I looked, gdb doesn't have the concept.

Yes, it sounds like something like this is planned for the next version of DWARF, but LLVM doesn't seem to implement this yet either. There are the debuginfod and Microsoft symbol servers, that seem to have existing support. I'll try to find out if these would need anything special from trim-paths. But overall we just have to make sure not to create any footguns for those, I think.

We have at least one major change to CARGO_HOME in the works, XDG paths. I'd like to avoid declaring a cargo-written, user-read location like this to be stable.

Is there any such path that we could declare stable? I think overall the approach might solve a couple of problems.

@briansmith
Copy link

Usually when using a hash function for such things, you want a strongly-collision-resistant hash function. But the "rustc-stable-hasher" is infamously not a strongly-collision-resistant hash function. It would be good to avoid making more use of it.

@weihanglo
Copy link
Member Author

Could you elaborate more on why this needs to be strong in terms of collision resistant? The current algorithm SipHash is strong for HashDOS protection but not really a cryptographic hash function either.

@michaelwoerister
Copy link
Member

Most of the interesting parts of making the hash value stable (~platform independent) is happening in the outer layer of StableHasher, so it seems feasible to make it configurable which hash algorithm is underneath. But I would also be interested to hear why a cryptographic hash function is required for this use case.

weihanglo added a commit to weihanglo/cargo that referenced this issue Jun 20, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

Two caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things.
weihanglo added a commit to weihanglo/cargo that referenced this issue Jun 20, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

Two caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things.
weihanglo added a commit to weihanglo/cargo that referenced this issue Jun 20, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

A few caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* As a consequnce of changing how `SourceId` is hashed, the global cache
  tracker is also affected becauseCargo writes source identifiers (e.g.
  `index.crates.io-6f17d22bba15001f`) to SQLite.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things like `SourceId`, but for long stuff
  like fingerprint. See appendix.

StableHasher is used in several places (some might not be needed?):

* Rebuild detection (fingerprints)
  * Rustc version, including all the CLI args running `rustc -vV`.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381
  * Build caches
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456
* Compute rustc `-C metadata`
  * stable hash for SourceId
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207
  * Also read and hash contents from custom target JSON file.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91
* `UnitInner::dep_hash`
  * This is to distinguish same units having different features set between normal and build dependencies.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627
* Hash file contents for `cargo package` to verify if files were modified before and after the build.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999
* Rusc diagnostics deduplication
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311
* Places using `SourceId` identifier like `registry/src` path,
  and `-Zscript` target directories.

Appendix
--------

Benchmark on x86_64-unknown-linux-gnu

```
bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
weihanglo added a commit to weihanglo/cargo that referenced this issue Jun 20, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

A few caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* As a consequence of changing how `SourceId` is hashed, the global cache
  tracker is also affected because Cargo writes source identifiers (e.g.
  `index.crates.io-6f17d22bba15001f`) to SQLite.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things like `SourceId`, but for long stuff
  like fingerprint. See appendix.

StableHasher is used in several places (some might not be needed?):

* Rebuild detection (fingerprints)
  * Rustc version, including all the CLI args running `rustc -vV`.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381
  * Build caches
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456
* Compute rustc `-C metadata`
  * stable hash for SourceId
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207
  * Also read and hash contents from custom target JSON file.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91
* `UnitInner::dep_hash`
  * This is to distinguish same units having different features set between normal and build dependencies.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627
* Hash file contents for `cargo package` to verify if files were modified before and after the build.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999
* Rusc diagnostics deduplication
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311
* Places using `SourceId` identifier like `registry/src` path,
  and `-Zscript` target directories.

Appendix
--------

Benchmark on x86_64-unknown-linux-gnu

```
bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
@weihanglo
Copy link
Member Author

Posted by @briansmith in #14116 (comment):

From a <1 minute reading of "-Ztrim-paths build a stable cross-platform path for the registry and git sources."

My understanding is that the intent here is to use a hash function to create a stable path to a particular set of source files or artifacts. If there is a hash collision then potentially hash(malicous-sources) == hash(trusted-sources) and so malicous-sources could be used instead of trusted-sources, silently.

The collision could happen, yes. However, for registry index the identifier is like index.crates.io-6f17d22bba15001f, which the first part is the URL of the registry. I can't immediately think of any approach to fake that. For the Git source, since the transformation is proto://host/path/repo -> repo-<hash-of-th-url> it is possible to make a collision like that.

@briansmith
Copy link

index.crates.io-6f17d22bba15001f

Presumably the "6f17d22bba15001f" there is the hash of something. What happens when, in the same registry, somebody publishes two crates X and Y that end up with the same hash? Is the assumption that the registry will check for a collision at publication time and reject the second one? If so, potentially you may be free of issues related to collisions for the registry case, but you may open yourself up to DoS if somebody could precompute of your hash that would collide with somebody else's to-be-published crate. (Precomputing a hash is called preimage and it is a harder problem than collisions.)

@bjorn3
Copy link
Member

bjorn3 commented Jun 20, 2024

It is the hash of the registry url itself. If you have two registries under the same domain, you could get a hash collision. The infra team would have to explicitly place another registry on index.crates.io to get a collision here. This is not something a random package uploader can do. If you have two crates in the same registry with the same name and version however, cargo will consider it as a single version of a single crate and if there is a lockfile, refuse to download the replaced version as the checksum differs from what it should be according to the lockfile. And crates.io should never allow this either.

@briansmith
Copy link

OK, but then why use siphash for this? It doesn't seem like it would be justified by the performance.

@weihanglo
Copy link
Member Author

SipHash doesn't need to be the only choice. What we need for source mapping is a stable hash that won't change between different platforms, endianness, bit-width, and so on.

@michaelwoerister
Copy link
Member

I've opened an issue in the stable hash repo about supporting other hash algorithms. It should be straightforward.

Supporting multiple hash algorithms within rustc might lead to some code bloat or dynamic dispatch overhead, but is a different story 🙂

weihanglo added a commit to weihanglo/cargo that referenced this issue Jul 9, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

A few caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* As a consequence of changing how `SourceId` is hashed, the global cache
  tracker is also affected because Cargo writes source identifiers (e.g.
  `index.crates.io-6f17d22bba15001f`) to SQLite.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things like `SourceId`, but for long stuff
  like fingerprint. See appendix.

StableHasher is used in several places (some might not be needed?):

* Rebuild detection (fingerprints)
  * Rustc version, including all the CLI args running `rustc -vV`.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381
  * Build caches
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456
* Compute rustc `-C metadata`
  * stable hash for SourceId
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207
  * Also read and hash contents from custom target JSON file.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91
* `UnitInner::dep_hash`
  * This is to distinguish same units having different features set between normal and build dependencies.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627
* Hash file contents for `cargo package` to verify if files were modified before and after the build.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999
* Rusc diagnostics deduplication
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311
* Places using `SourceId` identifier like `registry/src` path,
  and `-Zscript` target directories.

Appendix
--------

Benchmark on x86_64-unknown-linux-gnu

```
bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
weihanglo added a commit to weihanglo/cargo that referenced this issue Jul 9, 2024
This helps `-Ztrim-paths` build a stable cross-platform path for the
registry and git sources. Sources files then can be found from the same
path when debugging.

See rust-lang#13171 (comment)

A few caveats:

* This will invalidate the current downloaded caches.
  Need to put this in the Cargo CHANGELOG.
* As a consequence of changing how `SourceId` is hashed, the global cache
  tracker is also affected because Cargo writes source identifiers (e.g.
  `index.crates.io-6f17d22bba15001f`) to SQLite.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/global_cache_tracker.rs#L388-L391
* The performance of rustc-stable-hash is slightly worse than the old
  SipHasher in std on short things like `SourceId`, but for long stuff
  like fingerprint. See appendix.

StableHasher is used in several places (some might not be needed?):

* Rebuild detection (fingerprints)
  * Rustc version, including all the CLI args running `rustc -vV`.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L326
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/util/rustc.rs#L381
  * Build caches
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/fingerprint/mod.rs#L1456
* Compute rustc `-C metadata`
  * stable hash for SourceId
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/package_id.rs#L207
  * Also read and hash contents from custom target JSON file.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/compile_kind.rs#L81-L91
* `UnitInner::dep_hash`
  * This is to distinguish same units having different features set between normal and build dependencies.
    * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_compile/mod.rs#L627
* Hash file contents for `cargo package` to verify if files were modified before and after the build.
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/ops/cargo_package.rs#L999
* Rusc diagnostics deduplication
  * https://github.com/rust-lang/cargo/blob/6e236509b2331eef64df844b7bbc8ed352294107/src/cargo/core/compiler/job_queue/mod.rs#L311
* Places using `SourceId` identifier like `registry/src` path,
  and `-Zscript` target directories.

Appendix
--------

Benchmark on x86_64-unknown-linux-gnu

```
bench_hasher/RustcStableHasher/URL
                        time:   [33.843 ps 33.844 ps 33.845 ps]
                        change: [-0.0167% -0.0049% +0.0072%] (p = 0.44 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  5 (5.00%) low severe
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/URL
                        time:   [18.954 ns 18.954 ns 18.955 ns]
                        change: [-0.1281% -0.0951% -0.0644%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 14 outliers among 100 measurements (14.00%)
  3 (3.00%) low severe
  4 (4.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe
bench_hasher/RustcStableHasher/lorem ipsum
                        time:   [659.18 ns 659.20 ns 659.22 ns]
                        change: [-0.0192% -0.0062% +0.0068%] (p = 0.34 > 0.05)
                        No change in performance detected.
Found 12 outliers among 100 measurements (12.00%)
  4 (4.00%) low severe
  3 (3.00%) low mild
  3 (3.00%) high mild
  2 (2.00%) high severe
bench_hasher/SipHasher/lorem ipsum
                        time:   [1.2006 µs 1.2008 µs 1.2010 µs]
                        change: [+0.0117% +0.0467% +0.0808%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-needs-design Status: Needs someone to work further on the design for the feature or fix. NOT YET accepted. Z-trim-paths Nightly: path sanitization
Projects
None yet
Development

No branches or pull requests

5 participants