Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unsmudge fetchGit for reproducibility and security #4635

Closed
wants to merge 3 commits into from

Conversation

roberth
Copy link
Member

@roberth roberth commented Mar 13, 2021

Motivation

This PR makes builtins.fetchGit more secure and reproducible by using the clean version of the files.

Previously, only remotely fetched files would be in their clean state. With this change, files from local repositories will also be clean instead of smudged.

This also fixes the documentation which was wrong about the default behavior for ref (not HEAD when local).

Background

Git filters allow the user to change how files are represented
in the worktree.
On checkout, the configured smudge converts blobs to files in the worktree.
On checkin, the configured clean command converts files back into blobs.

Notable uses include

  • allow the user to work with a platform-specific representation, conveniently
  • git-crypt: only allow some users to see file contents, transparently
  • git-lfs: work with large files without inflating the repository

See also https://git-scm.com/docs/gitattributes#_filter

Why ignore filters

To quote the git docs

the intent is that if someone unsets the filter driver definition, or
does not have the appropriate filter program, the project should still
be usable.

So the feature was designed to be optional. This confirms that we have a
choice. Let's look at the individual use cases.

Allow the user to work with a platform-specific representation

While this might seem convenient, any such processing can also be done in
postUnpack, so it isn't necessary here.
Tarballs from GitHub and such don't apply the smudge filter either, so if
the project is going to be packaged in Nixpkgs, it will have to process its
files like this anyway.
The real kicker here is that running the smudge filter creates an
unreproducible dependency, because the filter does not come from a pinned
immutable source and it could inject information from arbitrary sources.

Git-crypt

The nix store can be read by any process on the system, or in some cases,
when using a cache, literally world-readable.
Running the filters in fetchGit would essentially make impossible the use of
git-crypt and Nix flakes in the same repository.
Even without flakes (or with changes to the flakes feature for that matter),
the software you want to build generally does not depend on credentials, so
not decrypting is not only a secure default, but a good one.
In a rare case where a build does not to decrypt a git-crypted file, one could
still pass the decrypted file or the git-crypt key explicitly (at the cost of
exposing it in the store, which is inevitable for nix-built paths).

Git LFS

Git LFS was designed to prevent excessive bloat in a repository, so the
"smudged" versions of these files will be huge.

If we were to include these directly in the fetchGit output, this creates
copies of all the large files for each commit we check out, or even for
each uncommitted but built local change (with fetchGit ./.).

In many cases, those files may not even be used in the build process. If
they are required, it seems feasible to fetch them explicitly with a
fetcher that fetches from LFS based on the sha256 in the unsmudged files.
It is more fine grained than downloading all LFS files and it does not even
require IFD because it happens after fetchGit, which runs at evaluation time.

If for some reason LFS support can not be achieved in Nix expressions, we
should add support for LFS itself, without running any other filters.

Conclusion

Not running the filters is more reproducible, secure and potentially more
efficient than running them.

@roberth
Copy link
Member Author

roberth commented Mar 18, 2021

@edolstra This is ready for review now.

@roberth
Copy link
Member Author

roberth commented Mar 31, 2021

Rebased and updated the docs and test of #4676 to reflect the actual behavior when the url is local and ref is unset.

@panicgh
Copy link

panicgh commented May 19, 2021

Given this proposal ignores git filters, how might Git LFS be supported
in the future?

I am quite new to Nix and not sure if I understand you correctly, so I
try to paraphrase your points in my own words:

Git LFS was designed to prevent excessive bloat in a repository, so the
"smudged" versions of these files will be huge.

If we were to include these directly in the fetchGit output, this creates
copies of all the large files for each commit we check out, or even for
each uncommitted but built local change (with fetchGit ./.).

  1. Git LFS files are likely huge
  2. Creating a new cache/store entry for each new fetched revision of a
    non-LFS repo is fine, since contents are small and little space is
    wasted. Because of (1.), space waste is likely to be huge for LFS
    repositories even if LFS files remain unchanged and only a small file
    was modified.

If [those huge files] are required, it seems feasible to fetch them explicitly with a
fetcher that fetches from LFS based on the sha256 in the unsmudged files.
It is more fine grained than downloading all LFS files and it does not even
require IFD because it happens after fetchGit, which runs at evaluation time.

Therefore, you propose never to include LFS files at fetchGit level
and rather use a to-be-implemented, e.g., fetchGitLfs builtin that
runs at evaluation time, works on top of fetchGit output, discovers
which files are from LFS, authenticates to the server, and downloads the
files into /nix/store/.... As a result, huge LFS files are refactored out of
the git repo sources (less space waste).

If for some reason LFS support can not be achieved in Nix expressions, we
should add support for LFS itself, without running any other filters.

The big advantage of fetchGit is that it runs at evaluation time,
allowing to use the (SSH) credentials of the user. If a fetchGitLfs
was not evaluation time, it would be a step backwards in my opinion.

For instance, I have some cases where LFS is used during the build
process, e.g. Java sources improperly bundled together with some jar
files or for (large) test input/reference files for (unit-)tests.

How would you imagine fetchGitLfs to be used? Would it wrap fetchGit
(or the archive fetchers for GitHub/-lab), or would it be a phase in the
builder as for patches?

{ stdenv }:
stdenv.mkDerivation {
  name = "myPackage";
  src = builtins.fetchGitLfs {
    src = builtins.fetchGit {
      url = "git+ssh://git@github.com/my-secret/repository.git";
      ref = "master";
      rev = "adab8b916a45068c044658c4158d81878f9ed1c3";
    };
  };
}

How would a fetchGitLfs be implemented? I imagine for example: read
the .gitattributes to learn the LFS file pattern, then scan the
unsmudged repo contents for LFS files, authentiate to LFS server,
download files using their oid and size, storing them in /nix/store
or ~/.cache/nix/ (?). The Nix output may be a function that copies the
unsmudged source and overwrites the LFS files, similar to how patches
works.

@roberth
Copy link
Member Author

roberth commented May 19, 2021

It seems that we can choose between two directions.
a. One is reimplementing certain git features in terms of the simple, well-defined tree and blob data model. Possibly using libgit2. This will be most reproducible and potentially more efficient as well. It will take more effort though. LFS support may require complicated logic.
b. The other is to simplify it, always base it on a non-bare worktree, allowing things like submodules and LFS to work in the way they're expected to. Unlike in (a) we'll need to make sure to "retroactively" block all sources of impurity, such as smudge filters that aren't LFS.

@edolstra which direction do you think we should take?

In this PR I haven't chosen a specific direction. In the simple case, it's like (a), but for submodules it behaves more like (b).

@panicgh Regardless of which direction, it seems more consistent to add an enableLFS option. I'm also not sure if fetchGit returns enough information for a subsequent LFS fetcher to do its job. Also flakes users won't have the freedom to put a fetcher in their fetcher.

@panicgh
Copy link

panicgh commented May 19, 2021

@panicgh Regardless of which direction, it seems more consistent to add an enableLFS option. I'm also not sure if fetchGit returns enough information for a subsequent LFS fetcher to do its job. Also flakes users won't have the freedom to put a fetcher in their fetcher.

Understood. So there would be an optional control flow within fetchGit.

Maybe an implementation could wrap the default LFS smudge/process hooks (modifying repo/.git/config?) and replace the LFS files with sym-/hardlinks to /nix/store? It may be necessary to disable the global/user-wide LFS config when git is called by fetchGit (instead git lfs install --local could be useful).

b. The other is to simplify it, always base it on a non-bare worktree, allowing things like submodules and LFS to work in the way they're expected to. Unlike in (a) we'll need to make sure to "retroactively" block all sources of impurity, such as smudge filters that aren't LFS.

In this PR I haven't chosen a specific direction. In the simple case, it's like (a), but for submodules it behaves more like (b).

Maybe submodules could also be refactored out to separate /nix/store entries: instead of relying on git to clone submodules, fetchGit could get only one repo at a time and then recursively call fetchGit for the submodules. Each repo goes into its own /nix/store path and the directories of the owning repo are replaced by symlinks to the submodule.

auto ss = std::stringstream{result.second};
StringSet filters;

for (std::string line; std::getline(ss, line, '\n');) {
Copy link
Contributor

@greedy greedy Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git config values can have multiple lines (try git -c $'foo=two\nlines' config -l to see for yourself), so splitting lines here can cause incorrect results. I suggest to use git config --name-only --get-regexp '^filter\..*\.(smudge|process)$' to avoid this issue (config keys cannot contain newlines so splitting on newlines is now acceptable) and, as a bonus, only return relevant names.

Additionally, it isn't necessary to set the smudge filter to cat. If the filter is set to an empty string then no filtering is done.

With these two changes, noSmudgeOptions can be constructed by just appending = to the names returned from the git config invocation.

@greedy
Copy link
Contributor

greedy commented Sep 30, 2021

I think a pathway to explicit git-lfs support can be found by avoiding the smudge filters as done here and then, when selected, running git lfs pull -I '*' -X '' $URL` post-checkout.

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/using-git-crypt-with-flakes/15372/5

@stale
Copy link

stale bot commented Apr 16, 2022

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Apr 16, 2022
@fricklerhandwerk fricklerhandwerk added security Security-related issues UX The way in which users interact with Nix. Higher level than UI. labels Sep 9, 2022
@stale stale bot removed stale labels Sep 9, 2022
@fricklerhandwerk
Copy link
Contributor

fricklerhandwerk commented Jan 13, 2023

Discussed on the Nix team meeting 2023-01-16:

  • @roberth: has been a while since this has been worked on, probably not mergeable
    • should instead reconsider the whole git thing to be based on the git objects API instead of the CLI, as the CLI does to much
    • @edolstra: have been thinking about a fetcher that uses libgit so we don't have to dump the entire source tree to the store, similar to the zip accessor for lazy trees
    • @thufschmitt: any idea how hard that would be?
    • @edolstra: have to test performance first, no idea how fast it is
    • @thufschmitt: another thing is Git LFS
      • @edolstra: reluctant to add more corner cases to the Git fetcher, it's already complicated
        • also LFS is probably not the right use case for Nix, as everything is copied to the store. might be okay with lazy trees
  • @thufschmitt: continue with this or work on libgit instead?
    • @roberth: prefer to work on libgit
    • @fricklerhandwerk: depends on how much work it would take to just merge this as an incremental improvement
      • @roberth: this will be quite some pain due to additional lock files
    • @edolstra: could we just disable smudging unconditionally?
  • to discuss, have to decide what to work on first

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-01-13-nix-team-meeting-minutes-23/24644/1

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2021-01-30-nix-team-meeting-minutes-25/25094/1

@thufschmitt
Copy link
Member

Closing since we'd rather use libgit for that, and the implementation is on the way.

@fricklerhandwerk
Copy link
Contributor

Discussed in Nix team meeting:

  • @regnat: this and libgit are solving the same problem. the question is how long libgit would take to finish
  • @edolstra: it won't take long. it already works on the lazy trees branch, but there is still a breakage with libssh.
  • @roberth: this implementaton is useless, let's close the PR
  • @edolstra: unfortunately we cannot completely get rid of the git dependency, because libgit uses libssh2 which doesn't support SSH config files
    • but there is a PR to fix this

@nixos-discourse
Copy link

This pull request has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/2023-04-14-nix-team-meeting-minutes-48/27358/1

@zebreus zebreus mentioned this pull request Sep 24, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
security Security-related issues UX The way in which users interact with Nix. Higher level than UI.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants